Leading items

Welcome to the LWN.net Weekly Edition for April 19, 2018

This edition contains the following feature content:

Counting beans—and more—with Beancount: a look at a popular text-based accounting application.
PostgreSQL's fsync() surprise: the way fsync() works on Linux (and beyond) may lead to silent data corruption in response to I/O errors. What can be done to improve the situation?
The second half of the 4.17 merge window: the rest of the changes merged for 4.17.
The rhashtable documentation I wanted to read: Neil Brown examines (and describes) the kernel's rhashtable data structure.
A look at terminal emulators, part 2: the second half of the terminal-emulator series, with a focus on latency and performance.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Counting beans—and more—with Beancount

By Jake Edge
April 18, 2018

It is normally the grumpy editor's job to look at accounting software; he does so with an eye toward getting the business off of the proprietary QuickBooks application and moving to something free. It may be that Beancount deserves a look of that nature before too long but, in the meantime, a slightly less grumpy editor has been messing with this text-based accounting tool for a variety of much smaller projects. It is an interesting system, with a lot of capabilities, but its reliance on hand-rolling for various pieces may scare some folks off.

Text

"Text based" means just what it says on the tin—all of the information used by Beancount is stored in regular text files. Those who are used to GUI tools for accounting may be daunted, but text editors are remarkably flexible tools with a wide array of customization possibilities. Emacs is one editor that is noted for its integration with other tools, such as compilers, build systems, debuggers, and source-code control—making it into an IDE for some. An Emacs minor mode for Beancount is available and there are also customization packages for Vim and Sublime.

At its core, Beancount "simply" implements double-entry bookkeeping on various types of items (which could be pork bellies, money in multiple currencies, vacation hours, beans, air miles, or anything else that can be counted). A typical transaction might look something like:

    2018-04-01 * "yearly subscription"
      Expenses:Subscriptions:LWN                 75.60 USD
      Liabilities:US:BigBankCorp:TuxCard

That adds the amount of an LWN subscription to the appropriate expense account, while deducting the same amount from the "TuxCard" liability account. The "-75.60 USD" to balance out the transaction is automatically applied to that final entry. On a complicated transaction, with several different expense buckets, it may be easier to let Beancount do the math; for this transaction, it probably makes sense to simply put the value into the entry:

    ...
      Liabilities:US:BigBankCorp:TuxCard       -75.60 USD

The "balancing" nature of transactions is one of the fundamental concepts behind Beancount (and double-entry accounting as a whole, I think, though I am far from an expert): all transactions must sum to zero. This central idea is described in "Beancount - The Double-Entry Counting Method" by Beancount developer Martin Blais. As with much of the Beancount documentation, that document is helpful and well-written, but in some ways the documentation is like the tool itself: it opens up huge possibilities, but doesn't grab new users by the scruff of the neck and say "do it this way". In addition, some of the documentation is starting to get a bit long in the tooth, with dates from 2014 and 2015, so it may not all accurately reflect the current state.

The open-ended nature of the documentation seems clearly in keeping with the philosophy of the project as a whole: provide lots of power and only minimal guidance. Which is not to say that there is no guidance at all, just that it is rather general and non-prescriptive. The "Getting Started" document is helpful, as is the "Tutorial & Example" file. But, what to track, how to do so, and how to integrate that tracking into your life or business is largely left up to you. For those who are not afraid of the command line, though, Beancount does not lock users into a particular scheme: changing account names or the hierarchy is easily done with a regular expression in the editor or a simple script.

In addition, Beancount makes it easy to start tracking new things at any point—simply establish the account, perhaps a starting balance, and begin recording transactions. There is no need for everything to be rooted to a single "epoch" date, so things can change as needs do. Account names are largely free-form (though spaces are not allowed) and hierarchies can be established using colon separation:

    1970-01-01 open Expenses:Dining
    1970-01-01 open Expenses:Dining:Drinks
    1970-01-01 open Expenses:Dining:Tips
    2017-10-23 open Expenses:Dining:FaveRestaurant  ; how much am i spending here?

That creates a hierarchy of dining expense accounts that can be used to allocate parts of the tab into different buckets—or all into Expenses:Dining if desired or the details have been lost. The dates are arbitrary to some extent, though expenses cannot be recorded to them prior to their "open" date (or after their "close" date for closed accounts, naturally). Comments start with a ";", but Beancount silently ignores lines that do not fit syntactically, which allows adding things like headers for Emacs Org Mode to help organize the input file.

Beancount is written in Python and released under the GPLv2. Installation from the Python Package Index (PyPI) using pip is straightforward, but it does require Python 3.5 or higher. Once installed, there are a dozen or so utilities available, all starting with "bean-". If you already have some data in a file, bean-check filename will tell you if there are any problems with the file; if you don't have any data, capturing the output of bean-example into a file will provide you with a detailed test bed (and crib-sheet) full of fictional Beancount data.

Imports and reports

It is certainly possible to add all of the data of interest by hand (or with assistance from an editor plugin), but importing transactions directly from CSV and other file types from banks, credit card companies, brokerages, and the like is possible as well. Once again, it requires some hand assembly, but the process to create importers is documented; Beancount has a few different commands that provide a framework for gathering useful information from the myriad of different formats. Being able to make use of Beancount without spending inordinate amounts of time manually entering data will likely depend on having enough of these for the sources of interest for your business or personal finances.

Reports of various sorts are, of course, a staple of any accounting system. Beancount provides several different ways to get reports, starting with the bean-report command. It has many different report options, both in terms of what is reported and how (e.g. balance reports, balance sheets, income reports, and net worth) as well as what format to produce the report in (e.g. text, CSV, HTML, and so on).

The inclusion of "HTML" in the output formats leads us to one—two, actually—of the most useful ways to examine Beancount data. The first comes with Beancount; bean-web starts a web server on localhost:8080 that provides different reports, over monthly and yearly time frames, that can be explored in the browser. A screen shot of the interface using the example data can be seen on the right. In addition, the bean-bake command will create a directory tree or ZIP file containing all of the processed data in a format suitable for sending to accountants or others to browse on their own systems.

The web pages created by bean-web are fairly bare-bones—simple HTML that just presents the data at hand. For those looking for a more eye-catching, interactive, JavaScript-based site, Fava may fit the bill. It runs a web server at localhost:5000 that provides many of the same kinds of reports as bean-web, but in a more interactive form. Fava has its own unique ways of looking at the data too, as can be seen on the left.

Fava has its own editor component that can be used to enter transactions in a way that will be familiar to those coming from the GUI-accounting world. The file itself can be edited directly using that interface as well. Another thing that Fava gives direct access to is Beancount's query capabilities, which can also be accessed from the command line using bean-query. For both, the Beancount Query Language (BQL) can be used to query the data set in ways that will seem familiar to SQL users. As noted in the BQL document, though, there are specific needs for accounting queries that are not met by simply pulling the data into, say, SQLite tables and doing queries on those.

Another interesting aspect of Beancount is that it knows little about what it counts. US Dollars, zorkmids, stocks, shopping points, or Whuffie are all treated the same way. The text file can establish its "operating currency" (or currencies), but that simply affects how some reports are generated. Any of the commodities can be priced relative to another (on a given date) using the "price" directive, as with this fictional gold price:

    2018-04-13 price GLD                      254.62 USD

Prices can also be specified as part of a transaction, like a stock buy or a currency conversion; as long as the price information is available, any commodity can be converted to any other so that reports can all be in a single currency (e.g. "What is our net worth in zorkmids?").

Conclusion

This just scratches the surface of what Beancount can do, of course. It is not the first, nor the only, text-based accounting system out there, as the Plain Text Accounting page will attest. In addition, it will not come as a surprise that text files are relatively easy to convert between these different systems, so trying out one does not preclude trying out or using one of the others down the road.

Text offers a number of advantages over binary formats; it is more durable, usable without any application at all, easy to diff or store in Git (or other version control system), and so on. Editors may not seem like the first choice for an interface to an accounting system—and may not work out in the long run, it is too soon to tell. It does have enough going for it that further investigation, including using it for a small-business project, seems indicated. That may well lead to more articles here; stay tuned ...

Comments (11 posted)

PostgreSQL's fsync() surprise

By Jonathan Corbet
April 18, 2018

Developers of database management systems are, by necessity, concerned about getting data safely to persistent storage. So when the PostgreSQL community found out that the way the kernel handles I/O errors could result in data being lost without any errors being reported to user space, a fair amount of unhappiness resulted. The problem, which is exacerbated by the way PostgreSQL performs buffered I/O, turns out not to be unique to Linux, and will not be easy to solve even there.

Craig Ringer first reported the problem to the pgsql-hackers mailing list at the end of March. In short, PostgreSQL assumes that a successful call to fsync() indicates that all data written since the last successful call made it safely to persistent storage. But that is not what the kernel actually does. When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently, but that behavior usually includes discarding the data in the affected pages and marking them as being clean. So a read of the blocks that were just written will likely return something other than the data that was written.

What about error status reporting? One year ago, the Linux Filesystem, Storage, and Memory-Management Summit (LSFMM) included a session on error reporting, wherein it was described as "a mess"; errors could easily be lost so that no application would ever see them. Some patches merged during the 4.13 development cycle improved the situation somewhat (and 4.16 had some changes to improve it further), but there are still ways for error notifications to be lost, as will be described below. If that happens to a PostgreSQL server, the result can be silent corruption of the database.

PostgreSQL developers were not pleased. Tom Lane described it as "kernel brain damage", while Robert Haas called it "100% unreasonable". In the early part of the discussion, the PostgreSQL developers were clear enough on what they thought the kernel's behavior should be: pages that fail to be written out should be kept in memory in the "dirty" state (for later retries), and the relevant file descriptor should be put into a permanent error state so that the PostgreSQL server cannot miss the existence of a problem.

Where things go wrong

Even before the kernel community came into the discussion, though, it started to become clear that the situation was not quite as simple as it might seem. Thomas Munro reported that Linux is not unique in behaving this way; OpenBSD and NetBSD can also fail to report write errors to user space. And, as it turns out, the way that PostgreSQL handles buffered I/O complicates the picture considerably.

That mechanism was described in detail by Haas. The PostgreSQL server runs as a collection of processes, many of which can perform I/O to the database files. The job of calling fsync(), however, is handled in a single "checkpointer" process, which is concerned with keeping on-disk storage in a consistent state that can recover from failures. The checkpointer doesn't normally keep all of the relevant files open, so it often has to open a file before calling fsync() on it. That is where the problem comes in: even in 4.13 and later kernels, the checkpointer will not see any errors that happened before it opened the file. If something bad happens before the checkpointer's open() call, the subsequent fsync() call will return successfully. There are a number of ways in which an I/O error can happen outside of an fsync() call; the kernel could encounter one while performing background writeback, for example. Somebody calling sync() could also encounter an I/O error — and consume the resulting error status.

Haas described this behavior as failing to live up to what PostgreSQL expects:

What you have (or someone has) basically done here is made an undocumented assumption about which file descriptors might care about a particular error, but it just so happens that PostgreSQL has never conformed to that assumption. You can keep on saying the problem is with our assumptions, but it doesn't seem like a very good guess to me to suppose that we're the only program that has ever made them.

Joshua Drake eventually moved the conversation over to the ext4 development list, bringing in part of the kernel development community. Dave Chinner quickly described this behavior as "a recipe for disaster, especially on cross-platform code where every OS platform behaves differently and almost never to expectation". Ted Ts'o, instead, explained why the affected pages are marked clean after an I/O error occurs; in short, the most common cause of I/O errors, by far, is a user pulling out a USB drive at the wrong time. If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else. So those pages cannot be kept if the user wants the system to remain usable after such an event.

Both Chinner and Ts'o, along with others, said that the proper solution is for PostgreSQL to move to direct I/O (DIO) instead. Using DIO gives a greater level of control over writeback and I/O in general; that includes access to information on exactly which I/O operations might have failed. Andres Freund, like a number of other PostgreSQL developers, has acknowledged that DIO is the best long-term solution. But he also noted that getting there is "a metric ton of work" that isn't going to happen anytime soon. Meanwhile, he said, there are other programs (he mentioned dpkg) that are also affected by this behavior.

Toward a short-term solution

As the discussion went on, a fair amount of attention was paid to the idea that write failures should result in the affected pages being kept in memory, in their dirty state. But the PostgreSQL developers had quickly moved on from that idea and were not asking for it. What they really need, in the end, is a reliable way to know that something has gone wrong. Given that, the normal PostgreSQL mechanisms for dealing with errors can take over; in its absence, though, there is little that can be done.

One idea that came up a few times was to respond to an I/O error by marking the file itself (in the inode) as being in a persistent error state. Such a change, though, would take Linux behavior further away from what POSIX mandates and would raise some other questions, including: when and how would that flag ever be cleared? So this change seems unlikely to happen.

At one point in the discussion, Ts'o mentioned that Google has its own mechanism for handling I/O errors. The kernel has been instrumented to report I/O errors via a netlink socket; a dedicated process gets those notifications and responds accordingly. This mechanism has never made it upstream, though. Freund indicated that this kind of mechanism would be "perfect" for PostgreSQL, so it may make a public appearance in the near future.

Meanwhile, Jeff Layton pondered another idea: setting a flag in the filesystem superblock when an I/O error occurs. A call to syncfs() would then clear that flag and return an error if it had been set. The PostgreSQL checkpointer could make an occasional syncfs() call as a way of polling for errors on the filesystem holding the database. Freund agreed that this might be a viable solution to the problem.

Any such mechanism will only appear in new kernels, of course; meanwhile, PostgreSQL installations tend to run on old kernels maintained by enterprise distributions. Those kernels are likely to lack even the improvements merged in 4.13. For such systems, there is little that can be done to help PostgreSQL detect I/O errors. It may come down to running a daemon that scans the system log, looking for reports of I/O errors there. Not the most elegant solution, and one that is complicated by the fact that different block drivers and filesystems tend to report errors differently, but it may be the best option available.

The next step is likely to be a discussion at the 2018 LSFMM event, which happens to start on April 23. With luck, some sort of solution will emerge that will work for the parties involved. One thing that will not change, though, is the simple fact that error handling is hard to get right.

Comments (59 posted)

The second half of the 4.17 merge window

By Jonathan Corbet
April 16, 2018

By the time the 4.17 merge window was closed and 4.17-rc1 was released, 11,769 non-merge changesets had been pulled into the mainline repository. 4.17 thus looks to be a typically busy development cycle, with a merge window only slightly more busy than 4.16 had. Some 6,000 of those changes were pulled after last week's summary was written. There was a lot of the usual maintenance work in those patches (over 10% of those changes were to device-tree files, for example), but also some more significant changes, including:

Core kernel

The CLOCK_MONOTONIC and CLOCK_BOOTTIME clocks used to differ only in that the latter is fast-forwarded after a suspend-and-resume cycle. As of 4.17, CLOCK_MONOTONIC is also moved forward to reflect the time that the system spent suspended. As a result, the two timers are now identical and have been unified within the kernel. Among other things, that change eliminates a potentially surprising behavior wherein the offset between the monotonic and realtime clocks would change after a resume. Thomas Gleixner noted: "There might be side effects in applications, which rely on the (unfortunately) well documented behaviour of the MONOTONIC clock, but the downsides of the existing behaviour are probably worse."
If applications do break, this change may have to be reverted. Meanwhile, there is a new clock (CLOCK_MONOTONIC_ACTIVE) that only advances when the system is actually running.
The new INOTIFY_IOC_SETNEXTWD ioctl() command allows inotify users to specify the number of the descriptor they would like to see returned for the next watch descriptor they create. This is used for checkpoint/restart.
After a few years of waiting, the histogram trigger feature was added to the tracing subsystem. This mechanism enables the easy creation, in kernel space, of histograms from tracing data.
The mmap() system call supports a new MAP_FIXED_NOREPLACE option. Like MAP_FIXED, it tries to place the new memory region at a user-supplied address. Unlike MAP_FIXED, though, it will not replace an existing mapping at that address; instead, it will fail with EEXIST if such a mapping exists. This is the change that was discussed last year as MAP_FIXED_SAFE; it seems that the battle over the proper name for the feature has finally been resolved.

Architecture-specific

The ARM architecture has gained support for the "system control and management interface", or SCMI. It is a set of standards for system management and, in particular power management.
64-Bit PowerPC systems now have the ability to address up to 4PB of memory.
Support for POWER4 processors was accidentally (they swear) broken in 2016, and nobody complained. So support for those processors has been removed entirely on the assumption that nobody is using them anymore.

Filesystems

The overlayfs filesystem can, at times, present different inode numbers for the same file at different times, potentially confusing applications that use those numbers. The "xino" option added for 4.17 will store the filesystem ID in the upper part of the inode number, which allows it to present inode numbers that will not change over time. Some information can be found in Documentation/filesystems/overlayfs.txt.

Security-related

The kernel now supports the Speck cipher, a block cipher that is said to outperform AES on systems without hardware AES support.
AES encryption in Cipher Feedback Mode is now supported; this is required for TPM2 cryptography.
The SM4 symmetric cipher algorithm is supported; it is "an authorized cryptographic algorithm for use within China" according to commit.
The SCTP protocol now has complete SELinux support; see Documentation/security/SELinux-sctp.rst for details.
The AppArmor security module has gained basic support for the control of socket use. See this commit for a little bit of documentation.

Hardware support

Audio: Texas Instruments PCM1789 codecs, AKM AK4458 and AK5558 codecs, Rohm BD28623 codecs, Motorola CPCAP codecs, Maxim MAX9759 speaker amplifiers, ST TDA7419 audio processors, and UniPhier AIO audio subsystems,
Cryptographic: ARM TrustZone CryptoCell security processors and TI Keystone NETCP SA hardware random-number generators.
Industrial I/O: Melexis MLX90632 infrared sensors, Analog Devices AD5272 digital potentiometers, On Semiconductor LV0104CS ambient light sensors, and Microchip MCP4017/18/19 digital potentiometers.
USB: HiSilicon STB SoCs COMB PHYs, AMLogic Meson GXL and GXM USB3 PHYs, STMicroelectronics STM32 USB HS PHY controllers, HiSilicon INNO USB2 PHYs, Motorola Mapphone MDM6600 USB PHYs, Pericom PI3USB30532 Type-C cross switches, ELAN USB touchpads, and devices supporting USB class 3 audio.
Miscellaneous: QCOM on-chip GENI based serial ports, MediaTek SoC gigabit Ethernet controllers, Raspberry Pi 3 GPIO expanders, Nintendo Wii GPIO controllers, Spreadtrum SC9860 platform GPIO controllers, RAVE SP power buttons, PhoenixRC flight controller adapters, HiSilicon hi3660 mailbox controllers, Socionext SynQuacer I2C controllers, Intersil ISL12026 realtime clocks, Nuvoton NPCM750 watchdog timers, Mediatek MT2701 audsys clocks, Allwinner H6 clock controllers, Silicon Labs 544 I2C clock generators, Synopsys DesignWare AXI DMA controllers, and MediaTek High-Speed DMA controllers.

Other

The ABI for 32-bit RDMA users has changed in incompatible ways. The changes are justified with the claim that there are no actual users of the 32-bit mode now, but some may be coming in the future.

Internal kernel changes

The way that system calls are invoked on the x86-64 architecture has been reworked to make it more uniform and flexible. The new scheme has also been designed to prevent unused (but caller-controlled) data from getting onto the call stack — where it could perhaps be used in a speculative-execution attack.
The lexer and parser modules used by the kernel build process are now themselves built on the target system (requiring flex and bison) rather than being shipped in the kernel repository.

As expected, the final diffstat for this merge window shows that more lines of code were deleted than added — 191,000 more. This is only the third time in the kernel's history that a release has been smaller than its predecessor.

Also possibly worthy of note is that the final SCSI pull pushed the kernel repository to over six-million objects. Linus added: "I was joking around that that's when I should switch to 5.0, because 3.0 happened at the 2M mark, and 4.0 happened at 4M objects. But probably not, even if numerology is about as good a reason as any."

This kernel now enters the stabilization process, which will culminate in the final 4.17 (or maybe 5.0?) release in early June.

Comments (18 posted)

The rhashtable documentation I wanted to read

April 13, 2018

This article was contributed by Neil Brown

The rhashtable data structure is a generic resizable hash-table implementation in the Linux kernel, which LWN first introduced as "relativistic hash tables" back in 2014. I thought at the time that it might be fun to make use of rhashtables, but didn't, until an opportunity arose through my work on the Lustre filesystem. Lustre is a cluster filesystem that is currently in drivers/staging while the code is revised to meet upstream requirements. One of those requirements is to avoid duplicating similar functionality where possible. As Lustre contains a resizable hash table, it really needs to be converted to use rhashtables instead — at last I have my opportunity.

It didn't take me long to discover that the rhashtable implementation in Linux 4.15 is quite different from the one that originally landed in Linux 3.17, so the original LWN introduction is now barely relevant. I also quickly discovered that the in-kernel documentation was partially wrong, far from complete, and didn't provide any sort of "getting started" guide. Nevertheless I persisted and eventually developed a fairly complete understanding of the code, which seems worth sharing. This article gives an introduction to the use of the rhashtable interfaces without getting into too many internal implementation details. A followup will explain how rhashtables work internally and show how some of the mechanism details leak though the interfaces.

Creating your first hash table

Suppose you have some data structure that you want to store in a hash table. This data structure will have a key and, usually, some other content. To work with rhashtables, the data structure must contain a struct rhash_head field which is used for table linkage, so the whole structure might be:

    struct object {
	int key;
	struct rhash_head linkage;
	char content[64];
	refcount_t ref;
	struct rcu_head rcu_read;
    };

This example includes a reference counter and an rcu_head for lifetime management, which will be used in later examples.

Creating a hash table to store some of these structures requires declaring some parameters and calling rhashtable_init(), for example:

    const static struct rhashtable_params object_params = {
	.key_len     = sizeof(int),
	.key_offset  = offsetof(struct object, key),
	.head_offset = offsetof(struct object, linkage),
    };

    struct rhashtable my_objects;

    success = rhashtable_init(&my_objects, &object_params);

rhashtable_init() can fail with -EINVAL if some of the parameters given are invalid (this example only shows a small selection of the available parameters) or -ENOMEM if it failed to allocate some required memory.

Providing just the length and offset of the key is sufficient for simple keys; rhashtables will provide code to compute a hash over the given number of bytes and do a byte-by-byte comparison to see if a given key matches an object. Often, keys are not that simple. Examples where length-plus-offset is not sufficient include:

a variable-length key, such as a NUL-terminated string
a key that is not directly contained in the object, but is accessed through a pointer in the object
a key that is a C struct that might contain padding bytes. It is hard to guarantee these bytes will be zero, so a direct comparison is unlikely to work.

In such cases, a hash function (.hashfn()) and object comparison function (.obj_cmpfn()) can be provided with signatures matching:

    struct rhashtable_compare_arg {
	struct rhashtable *ht;
	const void *key;
    };
    typedef u32 (*rht_hashfn_t)(const void *data, u32 len, u32 seed);
    typedef int (*rht_obj_cmpfn_t)(struct rhashtable_compare_arg *arg,
	         const void *obj);

When these are provided, the key_len is largely ignored, though it should be given a non-zero value since a key_len of zero indicates a more complex approach to hashing that is described later. The obj_cmpfn() should return zero for a match and non-zero otherwise, much like memcmp(). I like to return a negative error code, such as -ESRCH, rather than 1 as some in-kernel users do; it makes it more obvious when reading the code that zero means "success".

Putting objects in the table

The most general way to insert into a hash table is to call:

    old_obj = rhashtable_lookup_get_insert_fast(&my_objects, &new_obj.linkage, object_params);

If an object already exists in the table with the same key, it is returned as old_obj and new_obj is not inserted. If no such old object exists, new_obj will normally be inserted and NULL will be returned. It is technically possible for insertion to fail, in which case an ERR_PTR() is returned. These error cases will be explained in the next article once the required context has been presented.

If you don't want the old object (i.e. don't want to "get" it), or don't even want to check if it exists (no "lookup" wanted), other interfaces are available. The three main insert interfaces are:

    void *rhashtable_lookup_get_insert_fast(struct rhashtable *ht, struct rhash_head *obj,
		const struct rhashtable_params params);

    int rhashtable_lookup_insert_fast(struct rhashtable *ht, struct rhash_head *obj,
		const struct rhashtable_params params);

    int rhashtable_insert_fast(struct rhashtable *ht, struct rhash_head *obj,
		const struct rhashtable_params params);

The fact that they all end in the word fast is probably historical accident; there are no "slow" versions. All rhashtable interfaces are "fast" in that they never sleep and can be called from interrupt handlers, while holding spinlocks, or in any other atomic context. Note that rhashtable_insert_fast() can insert duplicates in the table, so it must be used with caution.

The passing of the params argument to these functions by value deserves some explanation. The parameters were already passed when the table was created, so the ht argument should already contain the parameter information — why pass it again?

All of these insert functions, as well as the lookup and remove functions that we will meet shortly, are defined in include/linux/rhashtable.h as static inline functions, so the compiler generates a new copy of the code every place they are called. Because params is declared as a const by-value argument, the compiler can treat it as constant during code optimization and can discard code that is never used. In the example given above, no hashfn() is specified, so the code that ultimately gets run won't even test to see if there is a hash function. It will go directly to the default hash function (jhash2() in this case — the Jenkins hash function). This approach allows a lot of flexibility in the C code, with minimal overhead at run time.

Lookup and object lifetimes

An object can be found in the hash table using rhashtable_lookup_fast():

    void *rhashtable_lookup_fast(struct rhashtable *ht, const void *key,
	const struct rhashtable_params params);

for example:

    int key = 42;
    struct object *found = rhashtable_lookup_fast(&my_objects, &key, object_params);

Note that there is a slight asymmetry between insert and lookup. When inserting an object, the address of the linkage (the rhash_head structure embedded in that object) is passed in. When looking up an object, the address of the whole object is returned. This is a stable pattern across the whole API and is easy to get used to: pass in the address of the linkage, get back the address of the object.

When performing a lookup, you need always to be mindful of object lifetimes to ensure you don't find an object just as it is being deleted. In my previous experience with hash tables, each object in the table has a reference count which is incremented as part of the lookup operation while a hash-table lock is still held. Object removal takes the same lock and ensures that a newly looked-up object isn't removed. The resizable hash-table implementation in Lustre understands this and can be told how to increment or decrement a reference counter so that race-free lookup is easy. Rhashtables provide no such support.

While this may seem like a weakness in rhashtables, it is really a strength. There are many different protocols that can be used to manage object lifetimes and, even with the fairly common reference-counter approach, there are different rules as to what happens when the count reaches zero. Encoding all these in the rhashtable code would be unwieldy; leaving it to the caller allows complete flexibility.

One of the benefits of integrating object lifetimes with the hash table is that the hash-table lock can be leveraged to help with lifetimes. Without that benefit, a caller must provide their own locking. In many cases the RCU read lock is quite sufficient, and its cost is usually indistinguishable from zero.

A common lifetime rule has an object removed from the hash table only when the reference count becomes zero. In this case there is a small window between zero being reached and the removal completing. This can be allowed for on lookup with:

    key = 42;
    rcu_read_lock();
    obj = rhashtable_lookup_fast(&my_objects, &key, params);
    if (obj && !refcount_inc_not_zero(&obj->ref))
	obj = NULL;
    rcu_read_unlock();

Object removal

To remove an object from the table, use rhashtable_remove_fast():

    int rhashtable_remove_fast(struct rhashtable *ht, struct rhash_head *obj,
			       const struct rhashtable_params params);

This call will fail with -ENOENT if the object isn't found. There is no interface to remove an object given only the key, but you can easily combine lookup with removal:

    rcu_read_lock();
    obj = rhashtable_lookup_fast(table, key, params);
    if (obj && rhashtable_remove_fast(table, &obj->linkage, params) == 0)
	kfree_rcu(obj, rcu_head);
    rcu_read_unlock();

When using RCU to help manage object lifetimes, it is important to use kfree_rcu() or similar to free the object, so the memory doesn't get reused while some other thread is performing a concurrent lookup, hence its inclusion in this example.

To complete the example started earlier where an object is removed when the reference count becomes zero, the removal code might look like:

    if (refcount_dec_and_test(&obj->ref)) {
	rhashtable_remove_fast(&my_objects, &obj->linkage, object_params);
	kfree_rcu(obj, rcu_head);
    }

Walking all the objects in the table

It is occasionally useful to iterate (or "walk") over all the objects in a hash table. Rhashtables make this fairly easy, though there are some caveats. To support walking there is one data structure and six functions:

    struct rhashtable_iter;

    void rhashtable_walk_enter(struct rhashtable *ht, struct rhashtable_iter *iter);
    int rhashtable_walk_start_check(struct rhashtable_iter *iter);
    void rhashtable_walk_start(struct rhashtable_iter *iter);
    void *rhashtable_walk_next(struct rhashtable_iter *iter);
    void rhashtable_walk_stop(struct rhashtable_iter *iter);
    void rhashtable_walk_exit(struct rhashtable_iter *iter);

There are two different protocols for keeping track of the "current" location in the table during a walk, one that assumes the RCU read lock is held and one that does not. These can be mixed, so RCU might be held while walking for a while, then it might be dropped for a while, at which point the non-RCU protocol must be used to continue. This subtlety explains why there is both an enter/exit pair of functions, and a start/stop pair. The enter/exit pair must be called at the beginning and end, respectively, and they set up, then tear down, the iterator (struct rhashtable_iter). During a walk, the start/stop pair must be called at least once and all calls to rhashtable_walk_next() must occur between rhashtable_walk_start() and rhashtable_walk_stop().

To make this more concrete, a simple table walk that proceeds entirely under the RCU read lock might be:

    struct rhashtable_iter iter;

    rhashtable_walk_enter(&my_objects, &iter);
    rhashtable_walk_start(&iter);
    while ((obj = rhashtable_walk_next(&iter)) != NULL) {
	if (IS_ERR(obj))
	    continue;
	printk_obj(obj);
    }
    rhashtable_walk_stop(&iter);
    rhashtable_walk_exit(&iter);

The RCU read lock is held between rhashtable_walk_start() and rhashtable_walk_stop() so nothing in the body of the walk is permitted to sleep. This walk is completely safe against concurrent insertions and removal: any object that was in the table at the start and is still there at the end will be printed. Objects inserted or removed during the walk may or may not be seen. It is quite safe to remove the object that you are visiting when walking like this.

If the hash table is resized during the walk, it is possible to see duplicates and rhashtable_walk_next() can return with the error -EAGAIN at some point, hence the test. The implication of this will be detailed in the followup article.

If the action on some objects needs to potentially sleep, then the walk must be temporarily "stopped", for example:

    struct rhashtable_iter iter;

    rhashtable_walk_enter(&my_objects, &iter);
    rhashtable_walk_start(&iter);
    while ((obj = rhashtable_walk_next(&iter)) != NULL) {
	if (IS_ERR(obj))
	    continue;
	if (uninteresting(obj))
	    continue;
	if (!refcount_inc_not_zero(&obj->ref))
	    continue;
	rhashtable_walk_stop(&iter);
	do_something_slow(obj);
	rhashtable_walk_start(&iter);
	obj_put(obj);
    }
    rhashtable_walk_stop(&iter);
    rhashtable_walk_exit(&iter);

When rhashtable_walk_stop() drops the RCU read lock, the rhashtable code can no longer be sure that obj will remain in the table, so rhashtable_walk_next() cannot just get the "next" object from the linkage structure. Instead it remembers which hash chain the object was in, and how many elements along the chain it was. If a concurrent insertion or removal changes the chain, the skip-count can be wrong and so the walk might miss an object, or see a duplicate.

In our my_objects running example, we need to claim a reference to an object when dropping the RCU lock; this will prevent the object from being removed from the table. In this case the object is, in fact, still in the table, so rhashtable_walk_start() should be able to find it and restart the walk exactly where it left off. A patch has been submitted to implement this and, once accepted, it should be possible to walk the hash table and be guaranteed not to miss anything. Guaranteeing that no duplicates will be seen is not possible if concurrent resizes are permitted. The implications of this will be examined more closely when we dig into resizing next time.

The rest of the API

That covers most of the interesting parts of the API, but for completeness we should quickly cover the remainder.

When you have finished with an rhashtable it must be destroyed, typically with rhashtable_free_and_destroy():

    void rhashtable_free_and_destroy(struct rhashtable *ht,
			 void (*free_fn)(void *ptr, void *arg),
			 void *arg);

This will call free_fn() on every object in the table, then free all internally allocated data structures. If there is nothing in the table, or if cleanup should be avoided for some other reason, then the simpler rhashtable_destroy() can be used.

    void rhashtable_destroy(struct rhashtable *ht);

The rhashtable_lookup_and_insert() interfaces we looked at above contain an implicit assumption that the key embedded in an object is in exactly the same format as the key that would be passed to rhashtable_lookup(). This is not always the case, as seen in an example from the Mellanox Ethernet driver that has a lookup key with a linked list of values, and an object that contains an array of these values. As the obj_cmpfn() expects a lookup key and an object, this circumstance can only be managed if the insert function is passed both a key and an object. Further, a table like this will need two different hash functions, one to hash a standalone key and one to hash the key that is embedded in an object.

To use rhashtable with this sort of key, three changes to the normal approach are needed:

.key_len must be set to 0 (or left unset).
A new function .obj_hashfn must be provided in the parameters. It has the same signature as hashfn() but is passed the object rather than the key.

Instead of the lookup_insert functions already listed, one of the insert_key functions must be used:

    static inline void *rhashtable_lookup_get_insert_key(struct rhashtable *ht,
 			const void *key, struct rhash_head *obj,
			const struct rhashtable_params params);

    static inline int rhashtable_lookup_insert_key(struct rhashtable *ht, 
			const void *key, struct rhash_head *obj,
			const struct rhashtable_params params);

Sometimes you might want a hash table to potentially contain multiple objects for any given key. In that case you can use "rhltables" — rhashtables with lists of objects. The interfaces are much the same as for rhashtables except that the table is represented by struct rhltable and the function names start with rhltable_ instead of rhashtable_:

    int rhltable_init(struct rhltable *hlt, const struct rhashtable_params *params);
    struct rhlist_head *rhltable_lookup(struct rhltable *hlt, const void *key,
	const struct rhashtable_params params);
    int rhltable_insert(struct rhltable *hlt, struct rhlist_head *list,
	const struct rhashtable_params params);
    int rhltable_insert_key(struct rhltable *hlt, const void *key, 
	struct rhlist_head *list, const struct rhashtable_params params);
    int rhltable_remove(struct rhltable *hlt, struct rhlist_head *list,
	const struct rhashtable_params params);
    void rhltable_walk_enter(struct rhltable *hlt, struct rhashtable_iter *iter);
    void rhltable_free_and_destroy(struct rhltable *hlt,
    	void (*free_fn)(void *ptr, void *arg), void *arg);

What's next?

This completes the "getting started" introduction, which has already helped improve code quality as I found, while writing this, that some of my Lustre conversion was wrong. It should be enough for others to successfully use rhashtables in their own code, but it is not yet complete. There are more configuration parameters, and various subtleties related to resizing that need to be understood before you can be sure that all cases are covered. This will be explored in the second half of this series.

Comments (6 posted)

A look at terminal emulators, part 2

April 15, 2018

This article was contributed by Antoine Beaupré

A comparison of the feature sets for a handful of terminal emulators was the subject of a recent article; here I follow that up by examining the performance of those terminals. This might seem like a lesser concern, but as it turns out, terminals exhibit surprisingly high latency for such fundamental programs. I also examine what is traditionally considered "speed" (but is really scroll bandwidth) and memory usage, with the understanding that the impact of memory use is less than it was when I looked at this a decade ago (in French).

Latency

After thorough research on terminal emulators performance, I have come to the conclusion that its most important aspect is latency. In his Typing with pleasure article, Pavel Fatin reviewed the latency of various text editors and hinted that terminal emulators might be slower than the fastest text editors. That is what eventually led me to run my own tests on terminal emulators and write this series.

But what is latency and why does it matter? In his article, Fatin defined latency as "a delay between the keystroke and corresponding screen update" and quoted the Handbook of Human-Computer Interaction which says: "Delay of visual feedback on a computer display have important effects on typist behavior and satisfaction."

Fatin explained that latency has more profound effects than just satisfaction: "typing becomes slower, more errors occur, eye strain increases, and muscle strain increases". In other words, latency can lead to typos but also to lesser code quality as it imposes extra cognitive load on the brain. But worse, "eye and muscle strain increase" seems to imply that latency can also lead to physical repetitive strain injuries.

Some of those effects have been known for a long time, with some results published in the Ergonomics journal in 1976 showing that a hundred-millisecond delay "significantly impairs the keying speed". More recently, the GNOME Human Interface Guidelines set the acceptable response time at ten milliseconds and, pushing this limit down even further, this video from Microsoft Research shows that the ideal target might even be as low as one millisecond.

Fatin performed his tests on editors, but he created a portable tool called Typometer that I used to test latency in terminal emulators. Keep in mind that the test is a simulation: in reality, we also need to take into account input (keyboard, USB controller, etc.) and output (graphics card and monitor buffers) latency. Those typically add up to more than 20ms in typical configurations, according to Fatin. With more specialized "gaming" hardware, the bare minimum is around three milliseconds. There is therefore little room for applications to add any latency to the pipeline. Fatin's goal is to bring that extra latency down to one millisecond or even zero-latency typing, which was released as part of IntelliJ IDEA 15. Here are my measurements, which include some text editors, showing that my results are consistent with Fatin's:

Latency in milliseconds

Program mean std min 90% max

uxterm 1.7 0.3 0.7 2 2.4

mlterm 1.8 0.3 0.7 2.2 2.5

Vim (Athena) 2.8 1.1 0.4 3.5 12.7

Vim (GTK2) 3.9 1.2 0.7 4.8 11.9

Emacs 4.8 2.3 0.5 5.8 32.5

gedit 8.9 3.4 2.8 12.5 14.2

Konsole 13.4 1.2 11.5 15 16.1

Alacritty 15.1 1.2 12.8 15.9 26.3

st 15.7 3.9 10.6 19.4 19.6

Vim (GTK3) 16.5 7.9 0.4 21.9 27.2

urxvt 18.3 0.3 17.3 18.7 19

pterm 23.4 0.9 21.7 24.5 25.4

GNOME Terminal 27.1 1 25.9 27.5 39.3

Xfce Terminal 27.4 0.4 26.4 27.9 28.7

Terminator 28.1 0.7 26.4 29 29.4

	Latency in milliseconds
Program	mean	std	min	90%	max
uxterm	1.7	0.3	0.7	2	2.4
mlterm	1.8	0.3	0.7	2.2	2.5
Vim (Athena)	2.8	1.1	0.4	3.5	12.7
Vim (GTK2)	3.9	1.2	0.7	4.8	11.9
Emacs	4.8	2.3	0.5	5.8	32.5
gedit	8.9	3.4	2.8	12.5	14.2
Konsole	13.4	1.2	11.5	15	16.1
Alacritty	15.1	1.2	12.8	15.9	26.3
st	15.7	3.9	10.6	19.4	19.6
Vim (GTK3)	16.5	7.9	0.4	21.9	27.2
urxvt	18.3	0.3	17.3	18.7	19
pterm	23.4	0.9	21.7	24.5	25.4
GNOME Terminal	27.1	1	25.9	27.5	39.3
Xfce Terminal	27.4	0.4	26.4	27.9	28.7
Terminator	28.1	0.7	26.4	29	29.4

The first thing that struck me is that old programs like xterm and mlterm have the best response time, having worse case latency (2.4ms) better than the best case for all other terminals (10.6ms for st). No modern terminal crosses the ten milliseconds threshold. In particular, Alacritty doesn't seem to live up to his "fastest terminal emulator in existence" claims either, although results have improved since I first tested the program in July 2017. Indeed, the project seems to be aware of the situation and is working on improving the display pipeline with threads. We can also note that Vim using GTK3 is slower than its GTK2 counterpart by an order of magnitude. It might therefore be possible that the GTK3 framework introduces extra latency, as we can also observe other that GTK3-based terminals (Terminator, Xfce4 Terminal, and GNOME Terminal, for example) have higher latency.

You might not notice those differences. As Fatin explains: "one does not necessarily need to perceive latency consciously to be affected by it". Fatin also warns about standard deviation (the std column above and the width of error bars in the graph): "any irregularities in delay durations (so called jitter) pose additional problem because of their inherent unpredictability".

The graph above is from a clean Debian 9 (stretch) profile with the i3 window manager. That environment gives the best results in the latency test: as it turns out, GNOME introduces about 20ms of latency to all measurements. A possible explanation could be that there are programs running that synchronously handle input events: Fatin gives the example of Workrave, which adds latency by processing all input events synchronously. By default, GNOME also includes compositing window manager (Mutter), an extra buffering layer that adds at least eight milliseconds in Fatin's tests.

In the graph above, we can see the same tests performed on Fedora 27 with GNOME running on X.org. The change is drastic; latency at least doubled and in some cases is ten times larger. Forget the one millisecond target: all terminals go far beyond the ten milliseconds budget. The VTE family gets closer to fifty milliseconds with Terminology and GNOME Terminal having spikes well above that threshold. We can also see there's more jitter in those tests. Even with the added latency, we can see that mlterm and, to a lesser extent xterm still perform better than their closest competitors, Konsole and st.

Scrolling speed

The next test is the traditional "speed" or "bandwidth" test that measures how fast the terminal can scroll by displaying a large amount of text on the terminal at once. The mechanics of the test vary; the original test I found was simply to generate the same test string repeatedly using the seq command. Other tests include one from Thomas E. Dickey, the xterm maintainer, which dumps the terminfo.src file repeatedly. In another review of terminal performance, Dan Luu uses a base32-encoded string of random bytes that is simply dumped on the terminal with cat. Luu considers that kind of test to be "as useless a benchmark as I can think of" and suggests using the terminal's responsiveness during the test as a metric instead. Dickey also dismisses that test as misleading. Yet both authors recognize that bandwidth can be a problem: Luu discovered the Emacs Eshell hangs while showing large files and Dickey implemented an optimization to work around the perceived slowness of xterm. There is therefore still some value in this test as the rendering process varies a lot between terminals; it also serves as a good test harness for testing other parameters.

Here we can see rxvt and st are ahead of all others, closely followed by the much newer Alacritty, expressly designed for speed. Xfce (representing the VTE family) and Konsole are next, running at almost twice the time while xterm comes last, almost five times as slow as rxvt. During the test, xterm also had jitter in the display: it was difficult to see the actual text going by, even if it was always the same string. Konsole was fast, but it was cheating: the display would hang from time to time, showing a blank or partially drawn display. Other terminals generally display all lines faithfully, including st, Alacritty, and rxvt.

Dickey explains that performance variations are due to the design of scrollback buffers in the different terminals. Specifically, he blames the disparity on rxvt and other terminals "not following the same rules":

Unlike xterm, rxvt did not attempt to display all updates. If it fell behind, it would discard some of the updates, to catch up. Doing that had a greater effect on the apparent scrolling speed than its internal memory organization, since it was useful for any number of saved-lines. One drawback was that ASCII animations were somewhat erratic.

To fix this perceived slowness of xterm, Dickey introduced the fastScroll resource to allow xterm to drop some screen updates to catch up with the flow and, indeed, my tests confirm the resource improves performance to match rxvt. It is, however, a rather crude implementation as Dickey explains: "sometimes xterm — like konsole — appears to stop, since it is waiting for a new set of screen updates after having discarded some". In this case, it seems that other terminals found a better compromise between speed and display integrity.

Resource usage

Regardless of the worthiness of bandwidth as a performance metric, it does provide a way to simulate load on the terminals, which in turn allows us to measure other parameters like memory or disk usage. The metrics here were obtained by running the above seq benchmark under the supervision of a Python process that collected the results of getrusage() counters for ru_maxrss, the sum of ru_oublock and ru_inblock, and a simple timer for wall clock time.

St comes first in this benchmark with the smallest memory footprint, 8MB on average, which was no surprise considering the project's focus on simplicity. Slightly larger are mlterm, xterm, and rxvt at around 12MB. Another notable result is Alacritty, which takes a surprising amount of memory at 30MB. Next comes the VTE family members which vary between 40 and 60MB, a higher result that could be explained by those programs use of higher-level libraries like GTK. Konsole comes last with a whopping 65MB of memory usage during the tests, although that might be excused due to its large feature set.

Compared with the results I had a decade ago, all programs take up much more memory. Xterm used to take 4MB of memory, but now takes 15MB just on startup. A similar increase also affects rxvt, which now takes 16MB of memory out of the box. The Xfce Terminal now takes 34MB, a three-fold increase, yet GNOME Terminal only takes 20MB on startup. Of course, the previous tests were done on a 32-bit architecture. At LCA 2012, Rusty Russell also explained there are many more subtle reasons that could explain such an increase. Besides, maybe this is something we can live with in this modern day and age of multi-gigabyte core memory sizes.

Yet I can't help but feel there's a waste of resources for something so fundamental as a terminal. Those programs should be the smallest of the small and should be able to run in a shoe box, when those eventually run Linux (you know they will). Yet with those numbers, memory usage would be a concern only when running multiple terminals in anything but the most limited of environments. To compensate, GNOME Terminal, Konsole, urxvt, Terminator, and Xfce Terminal feature a daemon mode that manages multiple terminals through a single process which limits the impact of their larger memory footprint.

Another result I have found surprising in my tests is actual disk I/O: I did not expect any, yet some terminals write voluminous amounts of data to disk. It turns out the VTE library actually writes the scrollback buffer to disk, a "feature" that was noticed back in 2010 and that is still present in modern implementations. At least the file contents are now encrypted with AES256 GCM since 0.39.2, but this raises the question of what's so special about the VTE library that it requires such an exotic approach.

Conclusion

In the previous article, we found that VTE-based terminals have a good feature set, yet here we see that this comes with some performance costs. Memory isn't a big issue since all VTE terminals are spawned from a single daemon process that limits memory usage. Old systems tight on core memory might still need older terminals with lower memory usage, however. While VTE terminals behave well in bandwidth tests, their latency is higher than the criterion set in the GNOME Human Interface Guidelines, which is probably something that the developers of the VTE library should look into. Considering how inevitable the terminal is even for novice users in Linux, those improvements might make the experience slightly less traumatic. For seasoned geeks, changing from the default terminal might even mean quality improvements and less injuries during long work sessions. Unfortunately, only the old xterm and mlterm get us to the magic 10ms range, which might involve unacceptable compromises for some.

The latency benchmarks also show there are serious tradeoffs that came with the introduction of compositors in Linux graphical environments. Some users might want to take a look at conventional window managers, since they provide significant latency improvements. Unfortunately, it was not possible to run latency tests in Wayland: the Typometer program does exactly the kind of things Wayland was designed to prevent, namely inject keystrokes and spy on other windows. Hopefully, Wayland compositors are better than X.org at performance and someone will come up with a way of benchmarking latency in those environments in the future.

Comments (91 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>