[go: up one dir, main page]

|
|
Log in / Subscribe / Register

LWN.net Weekly Edition for March 14, 2019

Welcome to the LWN.net Weekly Edition for March 14, 2019

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Turris: secure open-source routers

By Jake Edge
March 13, 2019

SCALE

The Czech Republic top-level domain registrar, CZ.NIC, wondered about the safety of home routers, so it set out to gather some information on the prevalence of attacks against them. It turns out that one good way to do that is to create a home router that logs statistics and other information. Michal Hrušecký from CZ.NIC came to the 2019 Southern California Linux Expo (SCALE 17x) in Pasadena, CA to describe the experiment and how it grew into a larger project that makes and sells open-source routers.

CZ.NIC is legally an association of competing companies, but in reality it is run like a non-profit, Hrušecký said. Beyond just domain registration, CZ.NIC has various other activities around making the internet more accessible and secure. That includes projects like the BIRD internet routing daemon and the Knot DNS resolver, as well as books, translations, and even a television series on "How to handle the internet". Beyond that, the Czech Computer Security Incident Response Team (CSIRT) is part of CZ.NIC.

[Michal Hrušecký]

One of the other things it is doing is creating open-source home routers. It started because CZ.NIC wondered about how safe home users are from network attacks. Are there active attacks against home users? And, if so, how frequent are they and what kinds of attacks are being made? To figure out the answer, the organization created Project Turris to create a secure router that it gave away. These routers would monitor the network and report suspicious traffic back to the project. They also served as endpoints for some honeypots that the project was running.

CZ.NIC wanted to make the Turris router "the right way", he said, so the organization made it all open source. The router has automatic security updates and users are given root access on the device. It also sported some "interesting hardware", Hrušecký said; it had a two-core PowerPC CPU, 2GB of RAM, and 256MB of NAND flash.

Based on the information provided by the Turris routers, CZ.NIC researchers started publishing reports about what they were finding. That led some people to ask if they could get the routers themselves, because they felt that other router makers were "not doing things right". That led to the creation of commercial Turris routers: the Turris Omnia (which was reviewed here in 2016) and the upcoming Turris Mox. Those routers will still allow people to participate in the research if they choose to.

Building the routers with free and open-source software (FOSS) is really the only way to go, he said. The project knew that it was not going to be able to compete with small, cheap routers, so it created routers with lots of capability that would allow them to run lots of different kinds of services. FOSS makes it easy to get started on a project like this because there is lots of available software that can be easily integrated into the OS.

These routers allow users to do whatever they want and people believe they are more capable than they truly are, Hrušecký said. That means they break things in "really creative ways". Sometimes they will make custom changes, completely outside of the OS framework, which get overwritten with the next automatic update. These are "tricky problems" to handle; the project would not have if it locked its users out. At some "dark moments" he understands why some companies do that.

Another tricky piece is upstreaming, he said. Turris works on getting its code upstream, but it takes longer than "anyone would want", he said. The project can take shortcuts that the upstream project will find lacking. Upstream projects want the code to be polished and generalized, which takes time. The "upstream project" in this case is OpenWrt, which is a distribution for routers. But OpenWrt is optimized for routers with far less resources than the Turris routers.

Typically, OpenWrt installs a highly compressed filesystem image into flash and has a small overlay where packages can be installed—generally only a few, however. Turris is not using the compressed image, but is instead using the "coolest filesystem for Linux": Btrfs. The project is using Btrfs snapshots and "went crazy" with them. It does a snapshot automatically weekly and before any update; it also allows manual snapshots. A "factory" reset can go back to the previous snapshot, the factory snapshot, or reflash the system.

Turris created its own web interface for the router, which is simpler than the standard OpenWrt interface. OpenWrt is targeted at more-technical users, while Turris wanted an interface for less-technical users, but to still allow them to use advanced features, such as VPNs or adding a guest WiFi network. Since Turris does things a bit differently, it sometimes runs into problems that OpenWrt does not have. In addition, OpenWrt packages are sometimes too trimmed down feature-wise, so Turris must build its own versions. Some packages use LXC containers, which may seem crazy, he said, but does make sense in some cases; it requires a different kernel configuration from the standard OpenWrt kernels, though.

Hrušecký introduced the honeypots as a service (HaaS) project by saying: "Honeypots are cool, right? Everyone wants a honeypot at home." But it takes time to set up and maintain a honeypot and there is some risk, so why not have someone else run it for you? CZ.NIC will run the honeypot; users just need to run the HaaS proxy on their system, which will relay potentially malicious traffic (e.g. connections to the SSH port) to a honeypot on the HaaS server. It will simulate a device and record what is sent by the attack. Users can then check out the attacks aimed at their server on the HaaS web site. HaaS is something that came from the security research but has now been separated out from the router project so it can be used elsewhere.

Turris Sentinel is a work in progress that will make some of the other security-research pieces available outside of the router framework. It will collect firewall logs and send them to a central location. It also has "minipots", which pretend to be a service on some port (e.g. telnet, HTTP), ask for login credentials that get logged, then close the connection. There was an earlier version of this, but it was closely tied to the Turris routers, so it has been rewritten to be more general. The Turris project was a bit surprised how willing people are to provide this data to it. The data will be made available on the site eventually, but is currently being shared with the Czech and other countries' CSIRT organizations.

The project has integrated the Suricata open-source intrusion-detection system (IDS) and intrusion-prevention system (IPS). It can look deeply into the packets and log or block a network flow based on its rules. For unencrypted communication, it has access to all of the information exchanged. But even for encrypted connections, there is a fair amount of information that can be extracted from things like the IP and MAC addresses, parts of certificate exchange, the length of the connection, and the amount of data transferred.

Suricata can be used to monitor untrusted devices and detect suspicious anomalies. There are open-source rules available to detect malware attacks, which can be used to block the traffic, for example. There is also the PaKon tool for Suricata that will aggregate information about the traffic on the network. It will alert when a new computer connects to the network. It will allow you to find out what your refrigerator is doing on the network when you are not at home, Hrušecký said with a grin.

Something that has come out of the research that CZ.NIC is doing is a "list of bad guys". If certain hosts are repeatedly attacking servers and routers that are reporting back, they will get added to the list, which is sent out to all of the routers. They can then block those IP addresses to reduce the malicious traffic they are handling.

Something that people have been asking for is Nextcloud support. It makes good sense, he said, because the Turris routers are the ultimate in self-hosting. They live at your home and you are root, so it is natural fit. Turris is also working with Nextcloud on a device that will specifically target hosting that service. So far, much of the software side is working, though there are still some areas that need work.

One of the "little bit crazier" uses is to turn the router into a digital video recorder (DVR). Adding a USB DVB-T device and a disk drive gives you a DVR, he said. Adding TVHeadend along with Nextcloud turns the Turris device into a router and home server combo box.

Hrušecký demonstrated the HaaS interface and took questions at the end of the talk. He showed how you can look at the kinds of attacks that are being attempted against your router (but are actually being handled by the CZ.NIC HaaS server) including the credentials used, the commands the attacks are trying, and the locations where they are trying to download code from. The Turris routers cost around €300, he said in response to an attendee question. They are not directly available in the US, but that is being worked on; there is lots of paperwork that needs to be completed. Until then, he suggested looking on eBay and similar sites.

A YouTube video of the talk is available.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Pasadena for SCALE.]

Comments (33 posted)

Python dictionary "addition" and "subtraction"

By Jake Edge
March 13, 2019

A proposal to add a new dictionary operator for Python has spawned a PEP and two large threads on the python-ideas mailing list. To a certain extent, it is starting to look a bit like the "PEP 572 mess"; there are plenty of opinions on whether the feature should be implemented and how it should be spelled, for example. As yet, there has been no formal decision made on how the new steering council will be handling PEP pronouncements, though a review of open PEPs is the council's "highest priority". This PEP will presumably be added into the process; it is likely too late to be included in Python 3.8 even if it were accepted soon, so there is plenty of time to figure it all out before 3.9 is released sometime in 2021.

João Matos raised the idea in a post at the end of February. (That email is largely an HTML attachment, which is not getting archived cleanly, its contents can be seen here or quoted by others in the thread.) In that message, he suggested adding two new operators for dictionaries: "+" and "+=". They would be used to merge two dictionaries:

    tmp = dict_a + dict_b      # results in all keys/values from both dict_a and dict_b
    dict_a += dict_b           # same but done "in place"
In both cases, any key that appears in both dictionaries would take its value from dict_b. There are several existing ways to perform this "update" operation, including using the dictionary unpacking operator "**" specified in PEP 448 ("Additional Unpacking Generalizations"):
    tmp = { **dict_a, **dict_b }     # like tmp = dict_a + dict_b
    dict_a = { **dict_a, **dict_b }  # like dict_a += dict_b
Or, as Rhodri James pointed out, one can also use the dictionary update() method:
    tmp = dict_a.copy(); tmp.update(dict_b)  # like tmp = dict_a + dict_b
    dict_a.update(dict_b)                    # like dict_a += dict_b     

Matos's idea drew the attention of Guido van Rossum, who liked it. It is analogous to the "+=" operator for mutable sequence types, he said.

This is likely to be controversial. But I like the idea. After all, we have `list.extend(x)` ~~ `list += x`. The key conundrum that needs to be solved is what to do for `d1 + d2` when there are overlapping keys. I propose to make d2 win in this case, which is what happens in `d1.update(d2)` anyways. If you want it the other way, simply write `d2 + d1`.

There were some questions about non-commutative "+" operators (i.e. where a+b is not the same as b+a), but several pointed out that there are a number of uses of the "+" operator in Python that are not commutative, including string concatenation (e.g. 'a'+'b' is not the same as 'b'+'a'). Steven D'Aprano suggested that instead of "addition", the operation should be treated like a set union, which uses the "|" operator. He had some other thoughts on new dictionary operators as well:

That also suggests d1 & d2 for the intersection between two dicts, but which value should win?

More useful than intersection is, I think, dict subtraction: d1 - d2 being a new dict with the keys/values from d1 which aren't in d2.

All of this hearkens back to a 2015 python-ideas thread that was summarized here at LWN. As a comment from Raymond Hettinger on a Python bug tracker issue notes, however, Van Rossum rejected the idea back in 2015. Some things have changed since then: for one, Van Rossum is no longer the final decision-maker on changes to the language (though his opinion still carries a fair amount of weight, of course), but he has also changed his mind on the feature.

Hettinger made it clear that he opposes adding a new "+" operator, both in another comment on the bug and in a python-ideas post. He does not see this operation as being "additive" in some sense, so using "+" does not seem like the right choice. In addition, there are already readable alternatives, including using the collections.ChainMap() class:

[...] I'm not sure we actually need a short-cut for "d=e.copy(); d.update(f)". Code like this comes-up for me perhaps once a year. Having a plus operator on dicts would likely save me five seconds per year.

If the existing code were in the form of "d=e.copy(); d.update(f); d.update(g); d.update(h)", converting it to "d = e + f + g + h" would be a tempting but algorithmically poor thing to do (because the behavior is quadratic). Most likely, the right thing to do would be "d = ChainMap(e, f, g, h)" for a zero-copy solution or "d = dict(ChainMap(e, f, g, h))" to flatten the result without incurring quadratic costs. Both of those are short and clear.

Several thought "|" made more sense than "+", but that was not universal. For the most part, any opposition to the idea is similar to Hettinger's: dictionary addition is unneeded and potentially confusing. It also violates "some of the unifying ideas about Python", Hettinger said.

Van Rossum suggested that D'Aprano write a PEP for a new dictionary operator. Van Rossum also thought the PEP should include dictionary subtraction for consideration, though that idea seemed to have even less support among participants in the thread. D'Aprano did create PEP 584 ("Add + and - operators to the built-in dict class"); that post set off another long thread.

As part of that, Jimmy Girardet asked what problem is being solved by adding a new operator, given that there are several alternatives that do the same thing. Stefan Behnel noted that "+" for lists and tuples, along with "|" for sets, provide a kind of basic expression for combining those types, but that dictionaries lack that, which is "a gap in the language". Adding such an operator would "enable the obvious syntax for something that should be obvious". Van Rossum has similar thoughts:

[...] even if you know about **d in simpler contexts, if you were to ask a typical Python user how to combine two dicts into a new one, I doubt many people would think of {**d1, **d2}. I know I myself had forgotten about it when this thread started! If you were to ask a newbie who has learned a few things (e.g. sequence concatenation) they would much more likely guess d1+d2.

There is an implementation of the PEP available, but it is still seemingly a long way from being accepted—if it ever is. The PEP itself may need some updating. It would seem that the arguments for switching to "|" may be gaining ground. Early on, Van Rossum favored "+", but there have been some strong arguments in favor of switching. More recently, Van Rossum said that he is "warming up to '|' as well". Other than his arguments early on, there have been few others who were strongly advocating "+" over "|".

There have also been discussions of corner cases and misunderstandings that might arise from the operators. But the main objections seem to mostly fall, as they did in the PEP 572 "discussion", on the question of each person's "vision" for the language. It could certainly be argued that the PEP 572 assignment expression (i.e. using ":=" for assignments within statements) is a more radical change to the overall language, but many of the arguments against dictionary "addition" sound eerily familiar.

It is, as yet, unclear how PEPs will be decided; that will be up to the steering council. It may well be somewhat telling that none of the other four steering council members have been seriously involved in the discussion, but it is also the case that many may have tuned out python-ideas as a forum that is difficult to keep up with. Only time will tell what happens with PEP 584 (and the rest of the PEPs that are pending).

Comments (17 posted)

Motivations and pitfalls for new "open-source" licenses

March 12, 2019

This article was contributed by Tom Yates


FOSDEM

One of the bigger developments of the last year has been the introduction of licenses that purport to address perceived shortcomings in existing free and open-source software licenses. Much has been said and written about them, some of it here, and they are clearly much on the community's mind. At FOSDEM 2019, Michael Cheng gave his view on the motivations for the introduction of these licenses, whether they've been effective in addressing those motivations, what unintended consequences they may also have had, and the need for the community to develop some ground rules about them going forward.

In the past year we have seen several unusual new licenses, the Server Side Public License (SSPL), the Commons Clause license addendum, the CockroachDB Community License, and the Confluent Community License among them. All either perturb the historical copyleft norm of "you must distribute derivative works under the same license" by extending the scope past what's covered under the definition of a derivative work, or they exclude some historically permitted form of activity such as building similar works or making money. These developments have been of concern to many; talks at FOSDEM and the immediately-following Copyleft Conference with titles like "Redis Labs and the tragedy of the Commons Clause", "Who wants you to think nobody uses the AGPL and why", and "What is the maximum permissible scope for copyleft?" leave little room to doubt how many people are mulling over them.

[Michael Cheng]

Cheng is a lawyer at Facebook who helps manage the company's open-source program, though, as is traditional, we were warned that the talk contained only his own opinions. But those opinions drive day-to-day licensing decisions inside Facebook, so they are of interest. He started off by reminding us that there is a class of technology that lives deep in the cloud stack, as part of the modern equivalent of the once-ubiquitous LAMP stack. It is increasingly developed by companies that offer a basic version of their product as free software and reserve a non-free, more featureful version for paying customers, which is a business model known as "open core". Other companies have taken the free versions of such software and founded profitable cloud-hosting businesses based on it.

To Cheng's mind, the driver for the new licenses has been a perception that the cloud providers aren't playing fair, a point of view he supported with quotes such as the CEO of MongoDB saying: "Once an open source project becomes interesting or popular, it becomes too easy for the cloud vendors to capture all the value and give nothing back to the community". This perception, he says, then gets used by open-core companies as justification to react; some of them have reacted by producing new licenses. The questions that arose in his mind are whether the cloud providers are violating the AGPL (and other free licenses) and whether the users have any moral obligation to contribute back to the project over and above what the license requires.

His problem with the first question is that if companies are currently violating the free licenses on the open-core technologies they're using, then the one thing that won't help at all is a new license, since it presumably will be adhered to no better than its predecessors. As for the second, he believes that the contribution norms are clearly stated in the licenses themselves; the circumstances under which you must share any contributions you make, and the terms under which you must share them, are spelled out in those licenses. He is unpersuaded by claims that there are traditional community norms that require contribution over and above what's stated in the license.

He also thinks that people feeling aggrieved by how open core has worked out for them are overlooking some of their own business decisions. Each of them decided to go with open core, each decided which software would be given away and which kept proprietary, and each decided which license to pick. Having made those business decisions, things didn't turn out as expected, and it seems to him that now some of them are trying to move the goalposts.

Some have raised the issue of basic fairness. It is true, Cheng conceded, that open-core companies have often invested a lot in their products. But if those products have become popular, it is substantially because people thought the core was "free". Fairness is not an issue, because fairness is built into the license itself; the license is our expression of what's fair. The real issue is the economic realities of the industry, in support of which assertion Cheng quoted an "(ex-)founder" of RethinkDB, who said:

Open-core didn't work because the space is so crowded with high-quality options that you have to give away enormous amount of functionality for free to get adoption. Given how complex distributed database products are, by the time you get to building a commercial edition you're many years in and very short on cash.

He also reminded us that other industries that have been disrupted, specifically the music recording, film, and newspaper industries, have also at times complained about fairness. When they did so, what they were trying to do was make a play on our values.

Cheng then drilled into the specifics of the licenses listed earlier, and looked at the shortcomings of each. Particular attention was reserved for the Confluent Community License and the SSPL. The former he found to have a lot of the "look and feel" of open-source licenses, even though its own FAQ states that it's not an open-source license; he ascribes the discrepancy to simple, blatant marketing. As for the SSPLv1, he examined the license's land grab of what it defines as "Service Source Code", which must be distributed under the SSPL, noting that it includes vast amounts of code that in copyright terms is in no way derived from the licensed work. It generously exempts the operating system itself, but to Cheng's mind that is because it would otherwise be practically impossible to honor, so nobody would ever use software covered by it. The SSPLv2 is better, but only slightly so, and the changes were motivated by the unfulfilled hope that the Open Source Initiative would approve the license.

Following these debates infuriates him, he said, because they are not changing anyone's behavior. In the course of his work, he gets many inquiries from Facebook people asking him if they can use a particular piece of technology. His normal workflow is such, he said, that if it takes him more than five seconds to work out if he can he use a given piece of technology, he probably won't bother. The only result of this whole exercise in new licenses has been to decrease adoption of technology covered by it. Specifically, projects governed by the SSPL, the CockroachDB or Confluent licenses, or any license with a Commons Clause addendum are prohibited from any kind of use anywhere at Facebook.

Having earlier warned us to retain spare soft food items for use as projectiles, Cheng grasped the nettle and talked about Facebook's own foray into the world of interesting new licenses. He noted that the company didn't agree with the community's interpretation of the BSD+patents license, but ultimately it didn't matter. Facebook doesn't make any money from React; Facebook does React to show off its talent and to impress the community. The community's thoughts are important, so when the company had to choose between retaining the community's respect and having a long debate about the new license, relicensing was an easy choice.

That is not the case with the open-core companies; their lifeblood is the sale of their enterprise licenses and they are consequently a lot less responsive to community disapproval. Cheng's particular fear is that the relicensing strategy becomes attractive and that the domino effect takes us from a few additional weird licenses to hundreds of them, most of which are understood by nearly nobody. Open core is here to stay, so he feels it needs rules, and the purpose of his talk (like many of the others on the subject) is to help forge community consensus about those rules. Ultimately, he hopes to work toward an ecosystem where new licenses are not needed, or are clearly not beneficial and so they aren't created.

His suggestion for guidelines to make it all work are:

  • You can't have your cake and eat it, too: If you choose to make open-core products you must clearly distinguish between the open and closed bits of your offering.
  • Don't confuse people by using licenses as marketing tools: They should be clear statements of what you can do, what you can't do, and what you must do.
  • Be transparent.
  • Seek consultation.
  • Seek technical and business solutions to problems: The legal and, specifically, license-based solutions aren't so great.

There were several interesting questions. One attendee asked whether a provision for maintenance costs could be built into a license to ensure the future for a piece of vital core software. Cheng thought it might be possible, but he couldn't immediately see how and didn't think it was the right way to go. He also particularly disliked the use of trademarks to enforce behavior, because there are traditional ways to use trademarks and free-software users are not using a trademark in any of them.

In summary, license proliferation for bad reasons is something that's upon us. It's good that we appear to be worrying about it and hopefully out of all this heat some light will come to illuminate the community's preferred path for companies undertaking open-core development. It's also worth noting that companies can only switch licenses like this because all the external contributors to their projects are required to either transfer the copyright in their contributions to the project or to license their contributions under terms that permit license changes. Next time someone tells you you have to sign a Contributor License Agreement, it may be worth pausing to wonder what that might mean down the road.

The video of the talk should eventually be up here; according to Cheng the FOSDEM person responsible is in the middle of moving to South America, which is quite reasonably having knock-on effects on the availability of the video.

[We would like to thank LWN's travel sponsor, the Linux Foundation, for travel assistance to Brussels for FOSDEM.]

Comments (1 posted)

5.1 Merge window part 1

By Jonathan Corbet
March 8, 2019
As of this writing, 6,135 non-merge changesets have been pulled into the mainline repository for the 5.1 release. That is approximately halfway through the expected merge-window volume, which is a good time for a summary. A number of important new features have been merged for this release; read on for the details.

Core kernel

  • Support for the ancient a.out executable format has been deprecated, with an eye toward removing it entirely later this year if nobody objects.
  • The BPF verifier has gained the ability to detect and remove dead code from BPF programs.
  • The BPF spinlock patches have been merged, giving BPF programs increased control over concurrency.
  • There is a new prctl() option for controlling speculative execution; PR_SPEC_DISABLE_NOEXEC disables enough speculation to block the speculative store bypass vulnerability, but only until the process calls exec().
  • A whole new set of system calls using 64-bit time values has been added; they are part of the kernel's year-2038 preparation. See this commit for the full list.
  • A new sysctl knob (kernel/sched_energy_aware) can be used to disable energy-aware scheduling. This feature is already disabled on most platforms, but by default it will be turned on for asymmetric systems where it makes sense. There are also two new documents on energy-aware scheduling and the energy model framework.
  • There is a new F_SEAL_FUTURE_WRITE operation for memfd regions that allows the caller to continue to write to the region but prevents any others from doing so. This feature supports an Android use case; see this commit for some more information.
  • The timer-events oriented CPU-frequency governor has been merged; it should yield better power-saving results on systems with tickless scheduling enabled.

Filesystems

  • The Btrfs filesystem now supports the use of multiple ZSTD compression levels. See this commit for some information about the feature and the performance characteristics of the various levels.

Hardware support

  • Audio: Cirrus Logic CS4341 and CS35L36 codecs, Rockchip RK3328 audio codecs, NXP pulse density modulation microphone interfaces, Mediatek MT8183 audio platforms, MediaTek MT6358 codecs, Qualcomm WCD9335 codecs, and Ingenic JZ4725B codecs.
  • Industrial I/O: Sensirion SPS30 particulate matter sensors, Plantower PMS7003 particulate matter sensors, Texas Instruments ADS124S08 analog-to-digital converters, Analog Devices ad7768-1 analog-to-digital converters, Analog Devices AD5674R/AD5679R digital-to-analog converters, STMicroelectronics LPS22HH pressure sensors, Nuvoton NPCM analog-to-digital converters, Invensense ICM-20602 motion trackers, Maxim MAX44009 ambient light sensors, and Texas Instruments DAC7612 digital-to-analog converters.
  • Miscellaneous: STMicroelectronics FMC2 NAND controllers, Amlogic Meson NAND flash controllers, ROHM BD70528 power regulators, Maxim MAX77650/77651 power regulators, Loongson-1 interrupt controllers, Broadcom STB reset controllers, OP-TEE-based random number generators, Habana AI accelerators, Mediatek GNSS receivers, and NVIDIA Tegra combined UARTs.
  • Networking: NXP ENETC gigabit Ethernet controller physical and virtual devices, and MediaTek MT7603E (PCIe) and MT76x8 wireless interfaces.
  • SPI: Freescale quad SPI controllers, NXP Flex SPI controllers, and SiFive SPI controllers.
  • USB: Marvell A3700 comphy and UTMI PHYs and Cadence D-PHYs.

Networking

  • There is a new socket option called "SO_BINDTOIFINDEX": "It behaves similar to SO_BINDTODEVICE, but takes a network interface index as argument, rather than the network interface name.". See this changelog for more information.
  • As part of the "make WiFi fast" effort, the mac80211 layer has gained an awareness of airtime fairness and the ability to share the available airtime across stations.
  • The year-2038-safe socket timestamp options described in this article have been merged.
  • The new "devlink health" mechanism provides notifications when an interface device has problems. See this merge commit and this documentation patch for details.
  • The mac80211 layer now has support for multiple BSSIDs (MAC addresses, essentially) for wireless devices.

Internal kernel changes

  • The ancient get_ds() function, which originally retrieved the value of the %ds segment on x86 systems, has been removed; all in-tree callers (of which there are few) have been changed to simply use the KERNEL_DS constant instead.
  • Much of the API for dealing with atomic variables is now automatically generated from a set of descriptions; the intent is to provide better consistency across architectures and to make it easier to add instrumentation. This commit gives an overview of how it all works.
  • The definitions of the GFP_* memory-allocation flags have been changed to allow the most frequently used bits to be represented more compactly. According to this commit, that reduces the size of Arm kernels by about 0.1%.
  • Memory compaction has been reworked, resulting in significant improvements in success rates and CPU time required.
  • The new "on-chip interconnect API" provides a generic mechanism for the control of connections within complex peripheral devices; see Documentation/interconnect/interconnect.rst for details.

The 5.1 merge window will probably stay open through March 17, with the final 5.1 release due at the beginning of May. As always, we'll catch up with the rest of the merge-window activity once it has closed; stay tuned.

Comments (27 posted)

Controlling device peer-to-peer access from user space

March 7, 2019

This article was contributed by Marta Rybczyńska

The recent addition of support for direct (peer-to-peer) operations between PCIe devices in the kernel has opened the door for different use cases. The initial work concentrated on in-kernel support and the NVMe subsystem; it also added support for memory regions that can be used for such transfers. Jérôme Glisse recently proposed two extensions that would allow the mapping of those regions into user space and mapping device files between two devices. The resulting discussion surprisingly led to consideration of the future of core kernel structures dealing with memory management.

Some PCIe devices can perform direct data transfers to other devices without involving the CPU; support for these peer-to-peer transactions was added to the kernel for the 4.20 release. The rationale behind the functionality is that, if the data is passed between two devices without modification, there is no need to involve the CPU, which can perform other tasks instead. The peer-to-peer feature was developed to allow Remote Direct Memory Access (RDMA) network interface cards to pass data directly to NVMe drives in the NVMe fabrics subsystem. Using peer-to-peer transfers lowers the memory bandwidth needed (it avoids one copy operation in the standard path from device to system memory, then to another device) and CPU usage (the devices set up the DMAs on their own). While not considered directly in the initial work, graphics processing units (GPUs) and RDMA interfaces have been able to use that functionality in out-of-tree modules for years.

The merged work concentrated on support at the PCIe layer. It included setting up special memory regions and the devices that will export and use those regions. It also allows finding out if the PCIe topology allows the peer-to-peer transfers.

The new work by Glisse adds support for peer-to-peer memory regions managed from user space. An important aspect is adding support for heterogeneous memory management (HMM), of which Glisse is the author, to access such memory, but the functionality also covers devices without HMM. The current prototype implementation connects a network card (Mellanox mlx5) to a GPU (AMD); Glisse expects to use it in AMD ROCm RDMA. He mentioned a number of possible use cases, including one device controlling another device's command queue. An example of this situation is a network card accessing a block device command queue so that it can submit storage transactions without the CPU's involvement. Another is direct memory access between two devices, where the memory is not even mapped to the CPU. In this case, the computation on one device runs without interaction with the other one. Examples include a network device monitoring the progress of a device and avoiding use of the system memory, or a network device streaming results from an accelerator.

The patch set in its current state does not include any drivers that actually use the feature. This resulted in other developers having difficulties in understanding how the code is expected to be used and commenting that users are needed before the code can be merged. Similar comments appeared elsewhere in the discussion. Glisse provided some code examples from other branches, but it seems that examples of drivers using the feature will have to appear in future versions of the patch set so that it can be seriously considered for a merge. A part of this related to that fact that many developers remember the difficulties after HMM was initially merged and would rather avoid a repeat.

Interconnect topologies

Currently, PCIe peer-to-peer transfers are only allowed when two devices are attached to the same bridge. Glisse added helper functions to simplify the checks in drivers and support any other interconnect in the process. In the discussion that followed, Greg Kroah-Hartman commented that there is no peer-to-peer concept in the device model; the implementation only covers PCIe for now, so he thought that the changes were premature. Glisse agreed to concentrate on PCIe for now. It seems likely that the device model will get richer when other interconnects start using the peer-to-peer functionality.

Extending vm_operations_struct

One of the patches added two new callbacks to struct vm_operations_struct, a core kernel structure collecting callbacks for virtual memory area (VMA) operations. The two proposed callbacks are p2p_map() and p2p_unmap(); respectively, they map and unmap a peer-to-peer region. Rather than mapping the region into a process's address space, though, these functions instruct the device to make the regions available for peer-to-peer access. A common use case would be for a user-space process to map a device memory region with mmap(), then pass that region to a second device to set up the peer-to-peer connection. That second device would then call the new methods to manage that mapping.

Logan Gunthorpe asked about the possible use of the existing fault() callback that would map peer-to-peer memory if appropriate. Glisse answered that the two new callbacks should be called by the mmap() callbacks of the exporting device driver. Instead of mapping to a typical process space (as fault() would do), the goal is to map memory between two device drivers. This task may include complex conditions and allow accessing some memory by a peer device, but not all of it.

The resulting memory region works as a connection between the two sides of the peer-to-peer setup. It will not necessarily be available to user space (or even the CPU itself). Glisse explained that, in the GPU case, it is easier to add a specific callback: the exporting device can check if the peer-to-peer operation is possible and then allow this operation, or use main memory instead. Setting up a fault handler, instead, would require numerous additional flags and a fake page table to be filled and then interpreted by the other device, according to Glisse.

Views of device memory

The discussion highlighted some differences in the handling of special memory in the GPU and RDMA subsystems. It started with Glisse explaining the GPU point of view: the driver often updates the device page tables that map memory within the GPU itself, meaning that the CPU's mappings of GPU memory are invalid most of the time. There is a need for a method to tell the exporting driver that another device wants to map one of its memory regions. If possible, the GPU driver should avoid remapping the affected zone while it is mapped. The exporting device must be able to decide whether it wants to authorize that mapping. Glisse noted that he also wants to use the API in case of interconnected GPUs. In this case, the CPU page tables are invalid and the physical addresses are the only meaningful ones; the kernel has no information about what the addresses are. Glisse also gave an overview of the GPU usage of HMM and how things work without HMM.

One significant subthread of the discussion had to do with whether device peer-to-peer memory should be represented in the CPU's memory map with page structures. For hardware where HMM is in use, device memory behaves (mostly) like ordinary memory and, thus, has those structures. Jason Gunthorpe commented, though, that in the case where HMM is not applicable (many RDMA settings, for example), there are no page structures for this memory; he would like things to stay that way. Attempts to use page structures for this memory, he said, always run into trouble.

Christoph Hellwig replied that not having page structures can create even more trouble; some functionalities, like get_user_pages() or direct I/O to the underlying device memory, just do not work without them. Deeper in the discussion, he listed three uses of struct page in the kernel: to keep the physical address, to keep a reference of memory so that it does not go away while it is still needed, and to set the dirty flag on the PTE after writing to that memory. Any solution that avoids struct page would have to solve those problems first, he said.

get_user_pages(), which maps user-space memory into the kernel's address space, is frequently called by drivers to access that memory. Whether it works or not will have a significant impact on how peer-to-peer memory is used. Some developers would like to see this memory act like ordinary memory, with associated page structures, that would be mappable with get_user_pages(). However, peer-to-peer memory is I/O memory that requires special handling, Jason Gunthorpe noted, so it can never look entirely like ordinary system memory. Glisse would rather avoid supporting get_user_pages() for that memory altogether. In the case of GPUs, it is not needed, he noted. Things turn out to be more complicated for the other potential user, the RDMA layer, though. The developers discussed other options like a special version of get_user_pages() for DMA-only mappings.

Jason Gunthorpe commented on the current state of the peer-to-peer implementation, which implements a type of scatter-gather list (SGL) that references DMA memory only — there is no mapping for the CPU. This is different than the standard in-kernel SGLs that hold both the CPU and DMA addresses. The NVMe and RDMA layers got fixes to support the special SGLs, but he fears that some RDMA drivers may still break because they won't understand those specific SGL semantics. Making get_user_pages() work for all those cases will be a big job and there are conflicting requirements, he said. He also suggested promoting the peer-to-peer, DMA-only, scatter-gather lists to general-use kernel structures.

Next steps

The developers have not found a solution to all of the mentioned problems yet. There are arguments for both keeping and avoiding struct page. Using page structures would allow the use of O_DIRECT and other APIs, but would require much additional work and fixing all issues in the process. On the other hand, avoiding struct page will lead to a type of special memory zone that needs to be handled in a particular way. The decision seems to have potentially important consequences.

The choice has not yet been made, and it seems that there will be more discussion needed before there is a solution the developers can agree on. In addition to that, future submissions of this patch set will probably need to include examples of the API usage so that the developers can understand them better. It seems likely that the peer-to-peer memory will be available from user space some time in the future, but there is still important work to be done first.

Comments (2 posted)

Leaderless Debian

By Jonathan Corbet
March 11, 2019
One of the traditional rites of the (northern hemisphere) spring is the election for the Debian project leader. Over a six-week period, interested candidates put their names forward, describe their vision for the project as a whole, answer questions from Debian developers, then wait and watch while the votes come in. But what would happen if Debian were to hold an election and no candidates stepped forward? The Debian project has just found itself in that situation and is trying to figure out what will happen next.

The Debian project scatters various types of authority widely among its members, leaving relatively little for the project leader. As long as they stay within the bounds of Debian policy, individual developers have nearly absolute control over the packages they maintain, for example. Difficult technical disagreements between developers are handled by the project's technical committee. The release managers and FTP masters make the final decisions on what the project will actually ship (and when). The project secretary ensures that the necessary procedures are followed. The policy team handles much of the overall design for the distribution. So, in a sense, there is relatively little leading left for the leader to do.

The roles that do fall to the leader fit into a couple of broad areas; the first of those is representing the project to the rest of the world. The leader gives talks at conferences and manages the project's relationships with other groups and companies. The second role is, to a great extent, administrative; the leader manages the project's money, appoints developers to other roles within the project, and takes care of details that nobody else in the project is responsible for. Leaders are elected to a one-year term; for the last two years, this position has been filled by Chris Lamb. His February "Bits from the DPL" posting gives a good overview of what sorts of tasks the leader is expected to carry out.

The Debian constitution describes the process for electing the leader. Six weeks prior to the end of the current leader's term, a call for candidates goes out. Only those recognized as Debian developers are eligible to run; they get one week to declare their intentions. There follows a three-week campaigning period, then two weeks for developers to cast their votes. This being Debian, there is always a "none of the above" option on the ballot; should this option win, the whole process restarts from the beginning.

This year, the call for nominations was duly sent out by project secretary Kurt Roeckx on March 3. But, as of March 10, no eligible candidates had put their names forward. Lamb has been conspicuous in his absence from the discussion, with the obvious implication that he does not wish to run for a third term. So, it would seem, the nomination period has come to a close and the campaigning period has begun, but there is nobody there to do any campaigning.

This being Debian, the constitution naturally describes what is to happen in this situation: the nomination period is extended for another week. Any Debian developers who procrastinated past the deadline now have another seven days in which to get their nominations in; the new deadline is March 17. Should this deadline also pass without candidates, it will be extended for another week; this loop will repeat indefinitely until somebody gives in and submits their name.

Meanwhile, though, there is another interesting outcome from this lack of candidates: the election of a new leader, whenever it actually happens, will come after the end of Lamb's term. There is no provision for locking the current leader in the office and requiring them to continue carrying out its duties; when the term is done, it's done. So the project is now certain to have a period of time where it has no leader at all. Some developers seem to relish this possibility; one even suggested that a machine-learning system could be placed into that role instead. But, as Joerg Jaspert pointed out: "There is a whole bunch of things going via the leader that is either hard to delegate or impossible to do so". Given enough time without a leader, various aspects of the project's operation could eventually grind to a halt.

The good news is that this possibility, too, has been foreseen in the constitution. In the absence of a project leader, the chair of the technical committee and the project secretary are empowered to make decisions — as long as they are able to agree on what those decisions should be. Since Debian developers are famously an agreeable and non-argumentative bunch, there should be no problem with that aspect of things.

In other words, the project will manage to muddle along for a while without a leader, though various aspects of business could slow down and become more awkward if the current candidate drought persists. One might well wonder, though, why there seems to be nobody who wants to take the helm of this project for a year. The fact that it is an unpaid position requiring a lot of time and travel might have something to do with it. If that were indeed to prove to be part of the problem, Debian might eventually have to consider doing what a number of similar organizations have done and create a paid position to do this work. Such a change would not be easy to make; a look back at the "Dunc-Tank" experiment is a good reminder of just how fraught such discussions can be. But, if the project finds itself struggling to find a leader every year, it's a discussion that may need to happen.

Comments (11 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: Security things in Linux 5.0; CommunityBridge; GCC 9; Season of Docs; sway 1.0; SPI annual report; Quotes; ...
  • Announcements: Newsletters; events; security updates; kernel patches; ...
Next page: Brief items>>

Copyright © 2019, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds