Insecurity and Python pickles
Serialization is the process of transforming Python objects into a sequence of bytes which can be used to recreate a copy of the object later — or on another machine. pickle is Python's native serialization module. It can store complex Python objects, making it an appealing prospect for moving data without having to write custom serialization code. For example, pickle is an integral component of several file formats used for machine learning. However, using pickle to deserialize untrusted files is a major security risk, because doing so can invoke arbitrary Python functions. Consequently, the machine-learning community is working to address the security issues caused by widespread use of pickle.
It has long been clear that pickle can be insecure. LWN covered a PyCon talk ten years ago that described the problems with the format, and the pickle documentation contains the following:
Warning:
The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
That warning might give the impression that the creation of malicious pickles is difficult, or relies on exploiting flaws in the pickle module, but executing arbitrary code is actually a core part of pickle's design. pickle has supported running Python functions as part of deserializing a stored structure since 1997.
Objects are serialized by pickle as a list of opcodes, which will be executed by a custom virtual machine in order to deserialize the object. The pickle virtual machine is highly restricted, with no ability to execute conditionals or loops. However, it does have the ability to import Python modules and call Python functions, in order to support serializing classes.
When writing a Python class, the programmer can define a __reduce__() method that gives pickle the information it needs to store an instance of that class. __reduce__() returns a tuple of information needed to save and restore the object; the first element of the tuple is a callable — a function or a class object — that will be called in order to reconstitute the object. Only the name of callable objects are stored in the pickle, which is why pickle doesn't support serializing anonymous functions.
The ability to customize the pickling of a class is the secret to pickle's ability to store such a wide variety of Python objects. For objects without special requirements, the default object.__reduce__() method — that just stores the object's instance variables — usually suffices. For objects that have more complicated requirements, having a hook available to customize pickle's behavior allows for the programmer to completely control how the object is serialized.
Limiting pickle to not support unnamed callable objects is a deliberate design choice with two advantages: allowing code upgrades, and decreasing the size of pickled objects, both before and after deserialization. The fact that pickle loads classes by name allows a programmer to serialize an object with a custom class, edit their program, and deserialize the object with the new semantics. This also ensures that unpickled objects don't come with an extra copy of their classes (and all the objects that those reference, etc.), which significantly reduces the amount of memory required to store many small unpickled objects.
pickle does support restricting which named callables can be accessed during unpickling, but finding a set of functions to allow without introducing the potential to run arbitrary code can be surprisingly difficult. Python is a highly dynamic language, and Python code is often not written with the security of unpickling in mind — both because security is not a goal of the pickle module, and because programmers often don't need to think about pickling at all.
A malicious pickle
The pickle documentation gives this example of a malicious pickle:
import pickle
pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")
This pickle imports the os.system() function, and then calls it with
"echo hello world" as an argument. This particular example is not
terribly malicious; malware using this technique in the real world usually
executes Python code to set up a reverse shell, or download and execute the next
stage of the malware. The builtin
pickletools
module shows how this byte stream is interpreted as instructions for the pickle
machine:
0: c GLOBAL 'os system'
11: ( MARK
12: S STRING 'echo hello world'
32: t TUPLE (MARK at 11)
33: R REDUCE
34: . STOP
GLOBAL is the instruction used to import functions and classes.
REDUCE calls a function with the given arguments.
Widespread use
Because pickle is so convenient, it is used in many different applications. Programs that use pickle to send data to themselves — such as programs that use multiprocessing — mostly have little to worry about on the security front. But it is common, especially in the world of machine learning, to use pickle to share data between programs developed by different people.
There are several directories of machine-learning models, such as Hugging Face, PyTorch Hub, or TensorFlow Hub, that allow users to share the weights of pre-trained models. Since Python is a popular language to use for machine learning, many of these are shared in the form of either raw pickle files, or other formats that have pickled components.
Security researchers have found models on the platforms that embed malware that is delivered via unpickling. Security company Trail of Bits recently announced an update to its LGPL-licensed tool — fickling — for detecting these kinds of payloads. Fickling disassembles pickle byte streams without executing them to produce a report about suspicious characteristics. It can also recognize polyglots — files that appear to use one file format, but can be interpreted as pickles by other software.
The machine-learning community is certainly aware of these problems. The fact that loading a model is insecure is noted in PyTorch's documentation. Hugging Face, EleutherAI, and Stability AI collaborated to design a new format — called safetensors — for securely sharing machine-learning models. Safetensors files use a JSON header to describe the contained data: the shape of each layer of the model, the numeric format used for the weights, etc. After the header, a safetensors file includes a flat byte-buffer containing the packed weights. Safetensors files can only store model weights without any associated code, making it a much simpler format. The safetensors file format has been audited (also by Trail of Bits), suggesting that it might prove to be a secure alternative.
Even with safetensors becoming the new default format to save models for several libraries, there are still many older pickle-based models in regular use. As with any transition to a new technology, it seems likely that there will be a long tail of pickle-based models.
Hugging Face has started including security warnings on files that contain pickle data, but this information is only visible if users click through to view the files associated with a model, not if they only look at the "model card". Other sources of machine-learning models, such as PyTorch Hub and TensorFlow Hub, merely host pointers to weights stored elsewhere, and therefore do not do even that small check.
pickle's compatibility with many kinds of Python objects and its presence in the standard library make it an attractive choice for developers wishing to quickly share Python objects between programs. Using pickle within a single application can be a good way to simplify communication of complex objects. Despite this, using pickle outside of its specific use case is dangerously insecure. Once pickle has made its way into an ecosystem, it can be difficult to remove, since any alternative will have a hard time providing the same flexibility and ease of use.
| Index entries for this article | |
|---|---|
| Security | Python |
| Python | Pickles |