PyTorch and the PyPI supply chain
The PyTorch compromise that happened right at the end of 2022 was rather ugly, but its impact was not widespread—seemingly, at least. The incident does highlight some of the perils of relying on an external "supply chain" for the components that are used to build one's software. It also would appear to be another case of "security researchers" run amok, though perhaps that part of the story is only meant to cover the tracks—or ass—of the perpetrator.
Beyond that, the incident shows that the Python Package Index (PyPI) and the pip package installer act in ways that arguably assisted the compromise. That clearly comes as a surprise to many, though those behaviors are well-known and well-established in the Python Package Authority (PyPA) community. There is, at minimum, a need for education on that topic.
Compromise
People (or continuous-integration bots) who installed the nightly build of the PyTorch machine-learning framework using pip between December 25 and 30 got an unwelcome surprise. A binary program was installed with a dependent module that was triggered when that module was imported into a PyTorch-using code base. That binary gathers system information (e.g. name servers, host names) and the contents of various interesting files (e.g. /etc/passwd, $HOME/.ssh/*, the first 1000 files in $HOME), then uploads that information to an attacker-controlled server using encrypted DNS queries.
In order to build PyTorch, multiple dependencies of various sorts are required. Some are regular PyPI packages that should be downloaded from that repository, while others are PyTorch-specific packages that should come from the PyTorch nightly repository. A single pip command is used to install from both PyPI and the PyTorch nightly repository given on the command line, but pip does not distinguish between the two repositories; it treats them both as equal possibilities for fulfilling the need for a given package.
If there is a dependency on, say, torchtriton from some other part of PyTorch and there is a package by that name available on PyPI, pip can choose it to install instead of the one by the same name in the PyTorch repository. That is exactly what happened, of course; an attacker registered the torchtriton PyPI package and uploaded a version of that code that functioned exactly the same as the original—except that it added the malicious payload that is executed when it is imported. It is unknown exactly how many sites were actually affected, but the malicious torchtriton package was downloaded from PyPI around 2,800 times, according to a lengthy analysis of the compromise by Tzachi Zorn.
Once the PyTorch project was alerted to the malware at PyPI on December 30, it took immediate steps to fix the problem. The torchtriton package name was removed as a dependency from PyTorch and replaced with pytorch-triton; a placeholder project called pytorch-triton was registered at PyPI so that the problem could not recur. In addition, PyTorch nightly builds that referred to torchtriton as a dependency were removed from the repositories so that any cached versions of the malicious package would not be picked up inadvertently. The PyPI administrators were also alerted and they promptly removed the malicious package. On December 31, the project put out the advisory linked above.
The analysis by Zorn (and another by Ax Sharma at BleepingComputer) describe efforts by the perpetrator of the attack to explain their actions. At first, the domain used for DNS lookups in order to exfiltrate the information put up a short text message [archive link] claiming that the information was gathered simply to identify the companies affected so that they could be alerted. Another, longer message was apparently sent to various outlets with similar claims, including that all of the data gathered by malicious payload had been deleted, which can be seen in those articles. It is pretty much impossible to verify one way or the other; it could be truthful and heartfelt—or it could simply be damage control.
Dependency confusion
The type of problem being exploited here is called "dependency confusion"; the technique was popularized by Alex Birsan in 2021, but the pip bug report linked above makes it clear that the problem was known in that community back in 2020. When the ‑‑extra‑index‑url option for pip is used, it consults that index and adds all of the packages it provides to its internal list. When it comes time to install a package, pip chooses the one with the highest version (or highest version that satisfies any version constraints that were specified) regardless of which repository it comes from.
PEP 440 ("Version
Identification and Dependency Specification") governs how pip
chooses which version to install. One might think pinning a dependency to a
specific version would be sufficient, but, as Dustin Ingram pointed
out in a recent discussion, that is not true. pip and other
installers "prefer wheels with more specific tags over less specific
tags
". That makes it relatively easy for an attacker to shadow even a
version-pinned dependency.
As Ingram noted in another message, the way to truly pin a dependency is by specifying the hash values of the binary artifacts to be installed as described in the pip documentation. That thread is interesting in other ways, however.
It starts with request
for help in convincing the security administrators at a company to unblock
PyPI. Kirk Graham ran into a problem at his company, which had wholesale
blocked the repository "because there were '29 malwared malicious
modules' at the site
". Those modules had long been removed from PyPI
but the reputation for unreliability lingered on. Brett Cannon pointed
out that there are lots of other places where malicious code can
sometimes be obtained:
My first question would be whether they block every project index out there (e.g., npm, crates.io, etc.), as they all have the same problem? Or what about GitHub? I mean where does the line get drawn for protecting you from potentially malicious code?My follow-up is how do they expect you to do use any open source Python code? If so, how are you supposed to get that code? Straight from the repositories? I mean I know lots of large companies that ban pulling directly from code indexes like PyPI, but then these are large companies with dedicated teams to get the source, store it internally, do their own builds of wheels, etc. If you block access to using what the projects provide you have to be up for doing all the work they provide in getting you those files.
Several in the thread pointed to various services and tools for managing dependencies of open-source components, which might help solve the problem at the company. Graham was clearly frustrated with the situation and his company, but once he found out about PyTorch, he changed his tune to certain extent:
Over the holidays there was malicious code added to PyTorch module on PyPi. That makes me think our Security Director is right. If there isn't better security from PyPi and GitHub those sites will be blocked by more and more companies. Open Source needs to be more secure. /sigh
That is not an entirely accurate picture of what happened, which was pointed out in the thread, but the larger point still stands. To outsiders it looks like PyTorch itself was compromised on PyPI, when what actually happened is more nuanced than that.
The pip bug report came
up in the thread as well. Reading through that report makes it clear
that the problem does not lend itself to a simple or straightforward fix.
The root of the problem is that people do not understand that using the
PyPI repository is not without risks and they fail to fully evaluate what
those risks are—and what they mean for their software supply chain. As
Paul Moore put
it when the bug was resurrected after the Birsan posting in 2021:
"But I do think that we're trying to apply a technology solution
to a people problem here, and that never goes well :-(
"
Much of what Moore and other PyPA developers have to say in the report is worth reading for those interested in the problem. So far, the most straightforward "solution" is to remove the ‑‑extra‑index‑url option entirely, but that has its own set of problems, as Moore noted:
There really is no "good" way of securing ‑‑extra‑index‑url if you look at it that way. Allow pip to look in 2 locations and you have to accept that all of your packages are now being served as securely as the least secure index. And the evidence of the "dependency confusion" article is that people simply aren't aware of that reality. So what the pip developers need to decide is whether our responsibility ends with having documented how multiple indexes work, or whether we should view the ability to have multiple indexes as an "attractive nuisance" and remove it to ensure that people aren't tempted to use it in an insecure manner.The clamour of voices arguing "this is a security flaw", plus the sheer stress on the maintainers that would be involved in arguing that this isn't our problem, suggests that we should remove the feature. But there's no doubt that it would penalise people who use the ability correctly - and it feels wrong to be penalising those people for the sake of the group who didn't properly assess the risks.
The bug report thread was brought to life again after the PyTorch mess, naturally. Moore describes some concrete steps that could be taken to address the problem, but it still requires someone (or some organization) willing to take on that work, make a proposal, and push it through to completion. So far there has been a lot of talk about the problem, but little in the way of action to fix it.
It really should come as no surprise that grabbing random code from the internet sometimes results in less than ideal outcomes. The flipside of that is that, usually, "sometimes" is extremely rare, which in some ways leads directly to the "attractive nuisance" argument. These kinds of problems are not new and are seemingly not going away anytime soon either. Each time we have an event like this PyTorch compromise, it gives open-source software another black eye, which is perhaps not entirely fair, but also not entirely surprising.
| Index entries for this article | |
|---|---|
| Security | Python |
| Security | Supply chain |
| Python | Packaging |
| Python | Security |