Leading items
Welcome to the LWN.net Weekly Edition for March 5, 2020
This edition contains the following feature content:
- The costs of continuous integration: freedesktop.org's continuous-integration system has been a huge success, but can the project afford to keep it running?
- Attestation for kernel patches: a proposal for a way to secure the provenance of patches for the kernel.
- An end to high memory?: a technique needed to make 32-bit systems work is reaching the end of its welcome in the kernel.
- Unexporting kallsyms_lookup_name(): a proposal to remove a way to circumvent the kernel's module symbol system, coming from what some might see as a surprising source.
- Python time-zone handling: adding symbolic time zones to Python's datetime class is not as easy as it might seem.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
The costs of continuous integration
By most accounts, the freedesktop.org (fd.o) GitLab instance has been a roaring success; lots of projects are using it, including Mesa, Linux kernel graphics drivers, NetworkManager, PipeWire, and many others. In addition, a great deal of continuous-integration (CI) testing is being done on a variety of projects under the fd.o umbrella. That success has come at a price, however. A recent message from the X.Org Foundation, which merged with fd.o in 2019, has made it clear that the current situation is untenable from a financial perspective. Given its current resources, X.Org cannot continue covering those costs beyond another few months.
X.Org board member Daniel Vetter posted a message to multiple mailing lists on February 27. In it, he noted that the GitLab instance has become quite popular, including the CI integration, which is good news. But there is some bad news to go with it:
The expense growth is mainly from "storing and serving build
artifacts and images to outside CI runners
sponsored by various companies
". Beyond that, all of that growth
means that a system administrator is needed to maintain the infrastructure,
so "X.org is therefore also looking for admin
sponsorship, at least medium term
". Without more sponsors for the
costs of the CI piece, it looks like those service would need to be turned
off in May or June, he said. The board is working on finding additional
sponsorship money, but that takes time, so it wanted to get the word out.
That set off a discussion of the problem and some possible solutions. Matt
Turner was concerned
that the bandwidth expense had not been adequately considered when the
decision was made to self-host the GitLab instance. "Perhaps
people building the CI would make different decisions about its
structure if they knew it was going to wipe out the bank account.
"
He wondered if the tradeoff was worth the cost:
Daniel Stone,
one of the administrators for the fd.o infrastructure (who gave a
talk on the organization's history at the
2018 X.Org Developers Conference), filled in some of the numbers for the
costs involved. He said
that the bill for January was based on 17.9TB of network egress (mostly
copying CI images to the test-running systems) and 16TB of storage for
"CI artifacts, container images, file uploads, [and] Git LFS
".
That totaled to almost $4,000, so the $75,000 projection takes into account
further growth. In a follow-up message,
he detailed the growth as well:
Stone also noted that the Google Cloud Platform (where the GitLab instance is hosted) does not provide all of the platform types needed for running the CI system. For example, Arm-based DragonBoards are needed, so some copying to external testing systems will be required. Using the cloud services means that bandwidth is metered, which is not necessarily true in other hosting setups, such as virtual private servers, as Jan Engelhardt pointed out. That would require more system administration costs, however, which Stone thinks would now make sense:
Dave Airlie argued
that the CI infrastructure should be shut down "until we work out what a
sustainable system would look like within the budget we have
". He
thought that it would be difficult to attract sponsors to effectively pay
Google and suggested that it would make more
sense for Google to cut out the middleman: "Having google sponsor the credits costs google
substantially less than having any other company give us money to do
it.
"
Vetter said
that Google has provided $30,000 in hosting credits over the last year, but
that money "simply
ran out _much_ faster than anyone planned for
". In addition, there are plenty of
other ways that companies can sponsor the CI system:
The lack of any oversight of what gets run in the CI system and which
projects are responsible for it is part of the problem, Airlie said. "You can't have a system in place that lets CI users
burn [large] sums of money without authorisation, and that is what we
have now.
" Vetter more or less agreed,
but said that the speed of the growth caught the board by surprise,
"so we're a bit behind on the controlling
aspect
". There is an effort to be able to track the costs by
project, which will make it easier to account for where the money is
going—and to take action if needed.
As part of the reassessment process, Kristian Høgsberg wanted
to make sure
that the "tremendous success
" of the system was
recognized. "Between gitlab and the CI, our
workflow has improved and code quality has gone up.
" He said that
it would have been hard to anticipate the growth:
Reducing costs
The conversation soon turned toward how to reduce the cost in ways that would not really impact the overall benefit that the system is providing. There may be some low-hanging fruit in terms of which kinds of changes actually need testing on all of the different hardware. As Erik Faye-Lund put it:
[...] We could also do stuff like reducing the amount of tests we run on each commit, and punt some testing to a per-weekend test-run or [something] like that. We don't *need* to know about every problem up front, just the stuff that's about to be released, really. The other stuff is just nice to have. If it's too expensive, I would say drop it.
There were other suggestions along those lines, as well as discussion of how to use GitLab features to reduce some of the "waste" in the amount of CI testing that is being done. It is useful to look at all of that, but Jason Ekstrand cautioned against getting too carried away:
He continued by noting that more data will help guide the process, but he is worried about the effect on the development process of reducing the amount of CI testing:
[...] I'm fairly hopeful that, once we understand better what the costs are (or even with just the new data we have), we can bring it down to reasonable and/or come up with money to pay for it in fairly short order.
Vetter is also worried that it could be somewhat difficult to figure out what tests are needed for a given change, which could result in missing out on important test runs:
In any case, the community is now aware of the problem and is pitching in
to start figuring out how best to attack it. Presumably some will also be
working with their companies to see if they can contribute as well. Any of
the solutions are likely to take some amount of effort both for developers
using the infrastructure and for the administrators of the system.
GitLab's new open-source program manager, Nuritzi Sanchez, also chimed
in; the company is interested in ensuring that community efforts like
fd.o are supported beyond just the migration help that was already
provided, she said. "We’ll be exploring ways for GitLab to help make
sure there isn’t a gap
in coverage during the time that freedesktop looks for sponsors.
"
While it may have come as a bit of a shock to some in the community, the announcement would seem to have served its purpose. The community now has been informed and can start working on the problem from various directions. Given the (too) runaway success of the CI infrastructure, one suspects that a sustainable model will be found before too long—probably well ahead of the (northern hemisphere) summer cutoff date.
Attestation for kernel patches
The kernel development process is based on trust at many levels — trust in developers, but also in the infrastructure that supports the community. In some cases, that trust may not be entirely deserved; most of us have long since learned not to trust much of anything that shows up in email, for example, but developers still generally trust that emailed patches will be what they appear to be. In his ongoing effort to bring more security to kernel development, Konstantin Ryabitsev has proposed a patch attestation scheme that could help subsystem maintainers verify the provenance of the patches showing up in their mailboxes.One might wonder why this work is needed at all, given that email attestation has been widely available for almost exactly as long as the kernel has existed; Phil Zimmermann first released PGP in 1991. PGP (and its successor, GnuPG) have always been painful to use, though, even before considering their interference with patches and the review process in particular; PGP-signed mail can require escaping characters or be mangled by mail transfer agents. It is safe to say that nobody bothers checking the few PGP signatures that exist on patches sent via email.
Ryabitsev's goal is to make attestation easy enough that even busy kernel developers will be willing to add it to their workflow. The scheme he has come up with is, for now, meant for integration with processes that involve using git send-email to send out a set of patches, though it is not tightly tied to that workflow. A developer can add attestation to their process by creating a directory full of patches and sending them out via git send-email in the usual manner; attestation is then done as a separate step, involving an additional email message.
In particular, the developer will run the attest-patches tool found in Ryabitsev's korg-helpers repository. It will look at each patch and split it into three components:
- Some patch metadata: specifically the author's name and email address, along with the subject line.
- The commit message.
- The patch itself.
The tool will use sha256sum to create a separate SHA-256 checksum for each of the three components. The three checksums are then joined, in an abbreviated form, to create a sort of unique ID for the patch that looks like:
2a02abe0-215cf3f1-2acb5798
The attest-patches tool creates a file containing this "attestation ID", along with the full checksums for all three components:
2a02abe0-215cf3f1-2acb5798:
i: 2a02abe02216f626105622aee2f26ab10c155b6442e23441d90fc5fe4071b86e
m: 215cf3f133478917ad147a6eda1010a9c4bba1846e7dd35295e9a0081559e9b0
p: 2acb5798c366f97501f8feacb873327bac161951ce83e90f04bbcde32e993865
A block like this is generated for each patch given to attest-patches. The result happens to be a file in the YAML format, but one can live in ignorance of that fact without ill effect. The file is then passed to GnuPG for signing. The final step is to email this file to signatures@kernel.org, where it will appear on a public mailing list; attest-patches can perform this step automatically.
On the receiving end, a reviewer or subsystem maintainer runs get-lore-mbox with the -aA options; -A does not actually exist yet but one assumes it will appear shortly. As the tool creates a mailbox file suitable for feeding to git am, it will verify the attestation for each of the patches it finds. That is done by generating its own attestation ID for each patch, then using that ID to search for messages on the signatures mailing list. If any messages are found, the full checksum for each of the three patch components is checked. The GPG signature in the file is also checked, of course.
If the checks pass — meaning that an applicable signature message exists, the checksums match the patches in question, and the message is signed by a developer known to the recipient — then get-lore-mbox will create the requested mailbox file, adding a couple of tags to each patch describing the attestation that was found. Otherwise the tool will abort after describing where things went wrong.
A test run of the system has already been done; Kees Cook generated an
attestation message for this
patch series. He said that this
mechanism would be "utterly trivial
" to add to his normal
patch-generation workflow.
Jason Donenfeld, instead, was unconvinced
of the value of this infrastructure. He argued that "maintainers
should be reading commits as they come in on the mailing list
" and
that attestation would make the contribution process harder. He asked:
"is the lack of signatures on email patches a real problem we're
facing?
"
Ryabitsev responded that he saw this mechanism as addressing two specific threats:
- An "
overworked, tired maintainer
" may be tempted to perform cursory reviews of patches from trusted developers; attestation at least lets them know that those patches actually came from their alleged author. - Maintainers might diligently review patches arriving in email, then use a tool like get-lore-mbox to fetch those patches for easy application. If lore.kernel.org has been compromised, it could return a modified form of those patches and the maintainer may well never notice. Once again, attestation should block any such attack.
He ended with a hope that the process he has developed is easy enough that developers will actually use it.
Whether that will actually happen remains to be seen. The use of signed tags on pull requests is still far from universal, despite the fact that they, too, are easy to generate and Linus Torvalds requires them for repositories not hosted on kernel.org. Based on past discussions, it seems unlikely that Torvalds will require attestation for emailed patches. So if patch attestation is to become widespread in the kernel community, it will be as a result of lower-level maintainers deciding that it makes sense. Of course, a successful attack could change attitudes quickly.
An end to high memory?
This patch from Johannes Weiner seemed like a straightforward way to improve memory-reclaim performance; without it, the virtual filesystem layer throws away memory that the memory-management subsystem thinks is still worth keeping. But that patch quickly ran afoul of a feature (or "misfeature" depending on who one asks) from the distant past, one which goes by the name of "high memory". Now, more than 20 years after its addition, high memory may be brought down low, as developers consider whether it should be deprecated and eventually removed from the kernel altogether.
A high-memory refresher
The younger readers out there may be forgiven for not remembering just what high memory is, so a quick refresh seems in order. We'll start by noting, for the oldest among our readers, that it has nothing to do with the "high memory" concept found on early personal computers. That, of course, was memory above the hardware-implemented hole at 640KB — memory that was, according to a famous quote often attributed to Bill Gates, surplus to the requirements of any reasonable user. The kernel's notion of high memory, instead, is a software construct, not directly driven by the hardware.
Since the earliest days, the kernel has maintained a "direct map", wherein all of physical memory is mapped into a single, large, linear array in kernel space. The direct map makes it easy for the kernel to manipulate any page in the system; it also, on somewhat newer hardware, is relatively efficient since it is mapped using huge pages.
A problem arose, though, as memory sizes increased. A 32-bit system has the ability to address 4GB of virtual memory; while user space and the kernel could have distinct 4GB address spaces, arranging things that way imposes a significant performance cost resulting from the need for frequent translation lookaside buffer flushes. To avoid paying this cost, Linux used the same address space for both kernel and user mode, with the memory protections set to prevent user space from accessing the kernel's portion of the shared space. This arrangement saved a great deal of CPU time — at least, until the Meltdown vulnerability hit and forced the isolation of the kernel's address space.
The kernel, by default, divided the 4GB virtual address space by assigning 3GB to user space and keeping the uppermost 1GB for itself. The kernel itself fits comfortably in 1GB, of course — even 5.x kernels are smaller than that. But the direct memory map, which is naturally as large as the system's installed physical memory, must also fit into that space. Early kernels could only manage memory that could be directly mapped, so Linux systems, for some years, could only make use of a bit under 1GB of physical memory. That worked for a surprisingly long time; even largish server systems didn't exceed that amount.
Eventually, though, it became clear that the need to support larger installed memory sizes was coming rather more quickly than 64-bit systems were, so something would need to be done. The answer was to remove the need for all physical memory to be in the direct map, which would only contain as much memory as the available address space would allow. Memory above that limit was deemed "high memory". Where the dividing line sat depended entirely on the kernel configuration and how much address space was dedicated to kernel use, rather than on the hardware.
In many ways, high memory works like any other; it can be mapped into user space and the recipients don't see any difference. But being absent from the direct map means that the kernel cannot access it without creating a temporary, single-page mapping, which is expensive. That implies that high memory cannot hold anything that the kernel must be able to access quickly; in practice, that means any kernel data structure at all. Those structures must live in low memory; that turns low memory into a highly contended resource on many systems.
64-Bit systems do not have the 4GB virtual address space limitation, so they have never needed the high-memory concept. But high memory remains for 32-bit systems, and traces of it can be seen throughout the kernel. Consider, for example, all of the calls to kmap() and kmap_atomic(); they do nothing on 64-bit systems, but are needed to access high memory on smaller systems. And, sometimes, high memory affects development decisions being made today.
Inode-cache shrinking vs. highmem
When a file is accessed on a Linux system, the kernel loads an inode structure describing it; those structures are cached, since a file that is accessed once will frequently be accessed again in the near future. Pages of data associated with that file are also cached in the page cache as they are accessed; they are associated with the cached inode. Neither cache can be allowed to grow without bound, of course, so the memory-management system has mechanisms to remove data from the caches when memory gets tight. For the inode cache, that is done by a "shrinker" function provided by the virtual filesystem layer.
In his patch description, Weiner notes that the inode-cache shrinker is allowed to remove inodes that have associated pages in the page cache; that causes those pages to also be reclaimed. This happens despite the fact that the inode-cache shrinker has no way of knowing if those pages are in active use or not. This is, he noted, old behavior that no longer makes sense:
Andrew Morton, it turns out, is the
developer responsible for this behavior, which is driven by the
constraints of high memory. Inodes, being kernel data structures, must
live in low memory; page-cache pages, instead, can be placed in high
memory. But if the existence of pages in the page cache can prevent inode
structures from being reclaimed, then a few high-memory pages can prevent
the freeing of precious low memory. On a system using high memory,
sacrificing many pages worth of cached data may well be worth it to gain a
few hundred bytes of low memory. Morton said
that the problem being solved was real, and that the solution cannot be
tossed even now; "a 7GB
highmem machine isn't crazy and I expect the inode has become larger
since those days
".
The conversation took a bit of a turn, though, when Linus Torvalds interjected
that "in the intervening years a 7GB highmem machine has indeed become
crazy
". He continued that high memory should be now considered
to be deprecated: "In this day and age, there is no excuse for running
a 32-bit kernel with lots of physical memory
". Others were quick to
add their support for this idea; removing high-memory would
simplify the memory-management code significantly with no negative effects
on the 64-bit systems that everyone is using now.
Except, of course, not every system has a 64-bit CPU in it. The area of
biggest concern is the Arm architecture, where 32-bit CPUs are still being
built, sold, and deployed. Russell King noted
that there are a lot of 32-bit Arm systems with more than 1GB of installed
memory being sold: "You're probably
talking about crippling support for any 32-bit ARM system produced
in the last 8 to 10 years
".
Arnd Bergmann provided a
rather more detailed look at the state of 32-bit Arm systems; he noted
that there is one TI CPU that is being actively marketed with the ability
to handle up to 8GB of RAM. But, he said, many new Arm-based devices are
actually shipping with smaller installed memory because memory sizes up to
512MB are cheap to provide. There are phones out there with 2GB of memory
that still need to be supported, though it may be possible to support them
without high memory by increasing the kernel's part of the address space to
2GB. Larger systems still exist, he said, though systems with 3GB or more
"are getting very rare
". Rare is not the same as nonexistent,
though.
The conversation wound down without any real conclusions about the fate of high-memory support. Reading between the lines, one might conclude that, while it is still a bit early to deprecate high memory, the pressure to do so will only increase in the coming years. In the meantime, though, nobody will try to force the issue by regressing performance on high-memory systems; the second version of Weiner's patch retains the current behavior on such machines. So users of systems needing high memory are safe — for now.
Unexporting kallsyms_lookup_name()
One of the basic rules of kernel-module development is that modules can only access symbols (functions and data structures) that have been explicitly exported. Even then, many symbols are restricted so that only modules with a GPL-compatible license can access them. It turns out, though, that there is a readily available workaround that makes it easy for a module to access any symbol it wants. That workaround seems likely to be removed soon despite some possible inconvenience for some out-of-tree users; the reason why that is happening turns out to be relatively interesting.The backdoor in question is kallsyms_lookup_name(), which will return the address associated with any symbol in the kernel's symbol table. Modular code that wants to access a symbol ordinarily denied to it can use kallsyms_lookup_name() to get the address of its target, then dereference it in the usual ways. This function itself is exported with the GPL-only restriction, which theoretically limits its use to free software. But if a proprietary module somewhere were to falsely claim a free license to get at GPL-only symbols, it would not be the first time.
Will Deacon has posted a patch series that removes the export for kallsyms_lookup_name() (and kallsyms_on_each_symbol(), which is also open to abuse). There were some immediate positive responses; few developers are favorably inclined toward module authors trying to get around the export system, after all. There were, however, a couple of concerns expressed.
One of those is that there is, it seems, a class of out-of-tree users of
kallsyms_lookup_name() that is generally considered to be
legitimate: live-patching systems for the kernel. Irritatingly, kernel
bugs often
stubbornly refuse to restrict themselves to exported functions, so a live
patch must be able to locate (and patch out) any function in the kernel;
kallsyms_lookup_name() is a convenient way to do that. After some
discussion Joe Lawrence let
it be known that the kpatch system has all of
its needed infrastructure
in the mainline kernel, and so has no further need for
kallsyms_lookup_name(). The Ksplice system, though, evidently
still uses it. As Miroslav Benes observed,
though: "no one cares about ksplice in upstream now
". So it
would appear that live patching will not be an obstacle to the merging of
this patch.
A different sort of concern was raised by Masami Hiramatsu, who noted that there are a number of other ways to find the address associated with a kernel symbol. User space could place some kprobes to extract that information, or a kernel module could, if time and CPU use is not a concern, use snprintf() with the "%pF" format (which prints the function associated with a given address) to search for the address of interest. He worried that the change would make life harder for casual developers while not really getting in the way of anybody who is determined to abuse the module mechanism.
In response, Deacon posted an interesting message about what is driving this particular change. Kernel developers are happy to make changes just to make life difficult for developers they see as abusing the system, but that is not quite what is happening here. Instead, it is addressing a support issue at Google.
Back in 2018, LWN reported on work being done to bring the Android kernel closer to the mainline. One of the steps in that direction is moving the kernel itself into the Android generic system image (GSI), an Android build that must boot and run on a device for that device to be considered compliant with the Android requirements. Putting the kernel into the GSI means that hardware vendors can no longer modify it; they will be limited to what they can do by adding kernel modules to the GSI.
Restricting vendors to supplying kernel modules greatly limits the kind of
changes they can make; there will be no more Android devices that replace
the CPU scheduler with some vendor's special version, for example. But
that only holds
if modules are restricted to the exported-symbol interface; if they start
to reach into arbitrary parts of the kernel, all bets are off. Deacon
doesn't say so, but it seems clear that some vendors are, at a minimum,
thinking about doing exactly that. The business-friendly explanation for
removing this capability is: "Monitoring and managing the ABI surface is not feasible
if it effectively includes all data and functions via
kallsyms_lookup_name()
".
After seeing this explanation, Hiramatsu agreed that the patch makes sense and offered a Reviewed-by tag. So this concern, too, seems unlikely to prevent this patch set from being merged.
It's worth repeating that discouraging module developers from bypassing the export mechanism is generally seen as more than sufficient motivation to merge a change like this. But it is also interesting to see a large company supporting that kind of change as well. By more closely tying the Android kernel to the mainline, Google would appear to be better aligning its own interests with the long-term interests of the development community — on this point, at least. That, hopefully, will lead to better kernels on our devices that also happen to be a lot closer to mainline kernels.
Python time-zone handling
Handling time zones is a pretty messy affair overall, but language runtimes may have even bigger problems. As a recent discussion on the Python discussion forum shows, there are considerations beyond those that an operating system or distribution needs to handle. Adding support for the IANA time zone database to the Python standard library, which would allow using names like "America/Mazatlan" to designate time zones, is more complicated than one might think—especially for a language trying to support multiple platforms.
It may come as a surprise to some that Python has no support in the standard library for getting time-zone information from the IANA database (also known as the Olson database after its founder). The datetime module in the standard library has the idea of a "time zone" but populating an instance from the database is typically done using one of two modules from the Python Package Index (PyPI): pytz or dateutil. Paul Ganssle is the maintainer of dateutil and a contributor to datetime; he has put out a draft Python Enhancement Proposal (PEP) to add IANA database support as a new standard library module.
Ganssle gave a presentation
at the 2019
Python Language Summit about the problem. On February 25, he posted
a draft of PEP 615
("Support for the IANA Time Zone Database in the Standard
Library
"). The original posted version of the PEP can be found
in the PEPs GitHub repository.
The datetime.tzinfo
abstract base class provides ways "to implement arbitrarily
complex time zone rules
", but he has observed that users want to work with
three time-zone types: fixed offsets from UTC, the system time zone, and
IANA time zones. The standard library supports the first type with datetime.timezone
objects, and the second to a certain extent, but does not support IANA time
zones at all.
There are some wrinkles to handling time zones, starting with the
fact that they change—frequently. The IANA database is updated multiple
times per year; "between 1997 and 2020, there have been between 3 and 21
releases per year, often in response to changes
in time zone rules with
little to no notice
". Linux and macOS have packages with that
information which get updated as usual, but the situation for Windows is
more complicated. Beyond that, there is a question of what should happen
in a running program when the time-zone information changes out from under it.
The PEP proposes adding a top-level zoneinfo standard library module with a zoneinfo.ZoneInfo class for objects corresponding to a particular time zone. A call like:
tz = zoneinfo.ZoneInfo("Australia/Brisbane")
will search for a corresponding Time
Zone Information Format (TZif) file in various locations to populate
the object. The zoneinfo.TZPATH list will be consulted to find
the file of interest.
On Unix-like systems, that variable will be set to a list of the standard locations (e.g. /usr/share/zoneinfo, /etc/zoneinfo) where the time-zone data files are normally stored. On Windows, there is no official location for the system-wide time-zone information, so TZPATH will initially be empty. The PEP proposes that a data-only tzdata package be created for PyPI that would be maintained by the CPython core developers. That could be used on Windows systems to provide a source for the IANA database information.
By default, ZoneInfo objects would effectively be singletons; a cache would be maintained so that repeated uses of the same time-zone name would return the exact same object. That is not specifically being done for efficiency reasons, but to ensure that times in the same time zone will be handled correctly. The existing datetime arithmetic operations only consider time zones to be equal if they are the same object, not just if they contain the same information. But caching also protects running programs from strange behavior if the underlying time-zone data changes. Effectively, the data will be read once, on first use, and never change again until the interpreter is restarted.
There is support for loading time zones without consulting (or changing) the cache, as well as for clearing the cache, which would effectively reload the time zone for any new ZoneInfo object. But getting updates to time zones mid-stream is problematic in its own right, Ganssle said:
I will note that there is some precedent in this very area: local time information is only updated in response to a call to time.tzset(), and even that doesn’t work on Windows. The equivalent to calling time.tzset() to get updated time zone information would be calling ZoneInfo.clear_cache() to force ZoneInfo to use the updated data (or to always bypass the main constructor and use the .nocache() constructor).
But Florian Weimer was concerned
that users would want those time-zone updates to automatically be
incorporated, so he sees the caching behavior as problematic. "I do not
think that users would want to restart their application (with a scheduled
downtime) just to apply one of those updates.
" Ganssle acknowledged
the concern, "but there are a lot of reasons to use the cache, and
good reasons to believe that using the cache won’t be a problem
".
He went on to note that both pytz and dateutil already
behave this way and he has heard no complaints. He also gave an example of
surprising behavior without any caching:
>>> from datetime import *
>>> from zoneinfo import ZoneInfo
>>> dt0 = datetime(2020, 3, 8, tzinfo=ZoneInfo.nocache("America/New_York"))
>>> dt1 = dt0 + timedelta(1)
>>> dt2 = dt1.replace(tzinfo=ZoneInfo.nocache("America/New_York"))
>>> dt2 == dt1
True
Each call to ZoneInfo.nocache() will return a different object, even if the time-zone name is the same. So dt1 and dt2 have the same time-zone information, but different ZoneInfo objects. The two datetime objects compare "equal" (==) because they represent the same "wall time", but that does not mean that arithmetic operations will behave as one might expect:
>>> print(dt2 - dt1) 0:00:00 >>> print(dt2 - dt0) 23:00:00 >>> print(dt1 - dt0) 1 day, 0:00:00
March 8, 2020 is the day of the daylight savings time transition in the US, so adding one day (i.e. timedelta(1)) crosses that boundary. In a followup message, he explained more about the oddities of datetime math that are shown by the example:
[...] So dt2 - dt0 is treated as two different zones and the math is done in UTC, whereas dt1 - dt0 is treated as the same zone, and the math is done in local time.
dt1 will necessarily be the same zone as dt0, because it’s the result of an arithmetical operation on dt0. dt2 is a different zone because I bypassed the cache, but if it hit the cache, the two would be the same.
Using the pickle object-serialization mechanism on ZoneInfo objects was also discussed. The PEP originally proposed that pickling a ZoneInfo object would serialize all of the information from the object (e.g. all of the current and historical transition dates), rather than simply serializing the key (e.g. "America/NewYork"). Only serializing the key could lead to problems when de-serializing the object with a different set of time-zone data (e.g. the "Asia/Qostanay" time zone was added in 2018).
But, as pytz maintainer Stuart Bishop pointed out, serializing all of the transition data is likely to lead to other, worse problems:
The PEP specifies that datetimes get serialized with all transition data. That seems unnecessary, as the transition data is reasonably likely to be wrong when it is de-serialized, and I can’t think of any use cases where you want to continue using the wrong data.
Ganssle agreed that it makes more sense to pickle ZoneInfo objects "by reference" (i.e. by time-zone name), though providing a way to also pickle "by value" for those who need or want it would be an option. Guido van Rossum had suggested an approach where a RawZoneInfo class would underlie ZoneInfo objects. Pickling a RawZoneInfo could be done by value. Ganssle liked that idea but thought that it could always be added later if there was a need for it; dateutil.tz already gives the by-value ability, so that could be used in the interim if needed.
Overall, the reaction to the PEP seems quite favorable. Bishop said that
he looks forward to "being able to deprecate pytz, making it a thin
wrapper around the standard library when run with a supported
Python
". Ganssle is still working out some of the details,
particularly around whether to automatically install the tzdata
module for platforms where there is no system-supplied IANA database. It
seems likely that we will soon see support for IANA time zones in
Python—presumably in Python 3.9 in October.
Page editor: Jonathan Corbet
Next page:
Brief items>>