Trojan Source and Python
The Trojan Source vulnerabilities have been rippling through various development communities since their disclosure on November 1. The oddities that can arise when handling Unicode, and bidirectional Unicode in particular, in a programming language have led Rust, for example, to check for the problematic code points in strings and comments and, by default, refuse to compile if they are present. Python has chosen a different path, but work is underway to help inform programmers of the kinds of pitfalls that Trojan Source has highlighted.
On the day of the Trojan Source disclosure, Petr Viktorin posted
a draft of an informational Python Enhancement Proposal (PEP) to the
python-dev mailing list. He noted that the Python security response
team had reviewed the report and "decided that it should be
handled in code editors, diff viewers, repository
frontends and similar software, rather than in the language
". He
agreed with that decision, in part because there are plenty of other kinds
of "gotchas" in Python (and other languages), where readers can be
misled—purposely or not.
But there is a need to document these kinds of problems, both for Python
developers and for the developers of tools to be used with the language,
thus the informational PEP. After some adjustments based on the discussion
on the mailing list, Viktorin created PEP 672
("Unicode-related Security Considerations for Python
"). It
covers the Trojan Source vulnerabilities and other potentially misleading
code from a Python perspective, but, as its "informational" status would
imply, it is not a list of ways to mitigate the problem. "This
document purposefully does not give any solutions or recommendations: it is
rather a list of things to keep in mind.
"
ASCII
It starts by looking at the ASCII subset of Unicode, which has its own, generally well-known, problem spots. Characters like "0" and "O" or "l" and "1" can look the same depending on the font; in addition, "rn" may be hard to distinguish from "m". Fonts designed for programming typically make it easier to see those differences, but human perception can sometimes still be outwitted:
However, what is “noticeably” different always depends on the context. Humans tend to ignore details in longer identifiers: the variable name accessibi1ity_options can still look indistinguishable from accessibility_options, while they are distinct for the compiler. The same can be said for plain typos: most humans will not notice the typo in responsbility_chain_delegate.
Beyond that, the ASCII control codes play a role. For example, NUL (0x0) is treated by CPython as an end-of-line character, but editors may display things differently. Even if the editor highlights the unknown character, putting a NUL at the end of a comment line might be easily misunderstood, as the following example shows:
[...] displaying this example:# Don't call this function: fire_the_missiles()as a harmless comment like:# Don't call this function:⬛fire_the_missiles()
Backspace, carriage return (without line feed), and escape (ESC)
can be used for various visual tricks, particularly when code is output to
a terminal. Python allows more than just ASCII in its programs, however;
Unicode is legal for identifiers (e.g. function, variable, and
class names) as described in PEP 3131
("Supporting Non-ASCII Identifiers
"). But, as PEP 672
notes: "Only 'letters and numbers' are allowed, so while γάτα is a
valid Python identifier, 🐱 is not.
" In addition, non-printing
control characters (e.g. the bidirectional overrides used in one of the Trojan
Source vulnerabilities) are not allowed in identifiers.
Homoglyphs
But the other Trojan Source vulnerability relates to "homoglyphs" (or
"confusables" as the PEP calls them). Characters in various alphabets can
be similar or the same as those in other languages: "For example, the
uppercase versions of the Latin b, Greek β (Beta), and Cyrillic в (Ve)
often look identical: B, Β and В, respectively.
" That can lead to
identifiers that look the same, but are actually different; there are other
oddities as well:
Additionally, some letters can look like non-letters:
- The letter for the Hawaiian ʻokina looks like an apostrophe; ʻHelloʻ is a Python identifier, not a string.
- The East Asian word for ten looks like a plus sign, so 十= 10 is a complete Python statement. (The “十” is a word: “ten” rather than “10”.)
Though there are symbols that look like letters in another language, symbols are not allowed in Python identifiers, obviating the reverse problem. Another surprising aspect might be in the conversion of strings to numbers in functions such as int() and float(), or even in str.format():
Some scripts include digits that look similar to ASCII ones, but have a different value. For example:
>>> int('৪୨')
42
>>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
five
The second example uses the indexing feature of str.format() to pick the Nth value from its arguments; in that case, the value of the number is five, even though it looks vaguely like a zero. Then there are the confusions that can arise from bidirectional text.
Bidirectional text
The presence of code containing identifiers in right-to-left order fundamentally changes the way it is interpreted by CPython, which may be puzzling to those who are used to left-to-right ordering; the Unicode bidirectional algorithm is used to determine how to interpret and display such code. For example:
In the statement ערך = 23, the variable ערך is set to the integer 23.
The example above might be clear enough from context for someone reading it who is used to reading left-to-right text, but another of the PEP's examples takes things further:
In the statement قيمة - (ערך ** 2), the value of ערך is squared and then subtracted from قيمة. The opening parenthesis is displayed as ).
Another extended example gets to the heart of the Trojan Source bidirectional vulnerability. It starts by showing the difference a single right-to-left character makes in a line of code, then looks at the effects of the invisible Unicode code points that change or override directionality within a line.
Consider the following code, which assigns a 100-character string to the variable s:s = "X" * 100 # "X" is assignedWhen the X is replaced by the Hebrew letter א, the line becomes:
s = "א" * 100 # "א" is assignedThis command still assigns a 100-character string to s, but when displayed as general text following the Bidirectional Algorithm (e.g. in a browser), it appears as s = "א" followed by a comment.
[...] Continuing with the s = "X" example above, in the next example the X is replaced by the Latin x followed or preceded by a right-to-left mark (U+200F). This assigns a 200-character string to s (100 copies of x interspersed with 100 invisible marks), but under Unicode rules for general text, it is rendered as s = "x" followed by an ASCII-only comment:
s = "x" * 100 # "x" is assigned
Readers who normally use left-to-right text may find it interesting to paste some of that code into Python or to try working with it in an editor; the behavior is not intuitive, at least for me. The uniname utility may be useful for peering inside to see the code points.
There are other Unicode code points that affect directionality, but the effects of all of them terminate at the end of a paragraph, which is usually interpreted as the end of line by various tools, including Python. Using those normally invisible code points can have wide-ranging effects as seen in Trojan Source and noted in the PEP:
These characters essentially allow arbitrary reordering of the text that follows them. Python only allows them in strings and comments, which does limit their potential (especially in combination with the fact that Python's comments always extend to the end of a line), but it doesn't render them harmless.
Normalization
Another topic covered in the PEP is the normalization of Unicode code points for identifiers. In Unicode, there are often several different ways to generate the same "character"; using Unicode equivalence, it is possible to normalize a sequence of code points to produce a canonical representation. There are four ways to do so, however; Python uses NFKC to normalize all identifiers, but not strings, of course.
There are some interesting consequences stemming from that, which can also be confusing. For example, there are multiple variants of the letter "n" in Unicode, several in mathematical contexts, all of which are normalized to the same value, leading to oddities like:
In a followup message, Paul McGuire posted a particularly graphic demonstration of how a simple program can be transformed into something almost unreadable via normalization. Treating strings differently means that functions like getattr() will behave differently than a lookup done directly in the code. An example from the PEP (ab)uses the equivalence of the ligature "fi" with the string "fi" to demonstrate that:>>> xⁿ = 8 >>> xn 8
>>> class Test:
... def finalize(self):
... print('OK')
...
>>> Test().finalize()
OK
>>> Test().finalize()
OK
>>> getattr(Test(), 'finalize')
Traceback (most recent call last):
...
AttributeError: 'Test' object has no attribute 'finalize'
Similarly, using the
import statement to refer to a module in the code will
normalize the identifier, but using importlib.import_module()
with a string does not. Beyond that, various operating systems and
filesystems also do normalization; "On some systems, finalization.py,
finalization.py and FINALIZATION.py are three distinct filenames; on
others, some or all of these name the same file.
"
Reaction
The reaction to the PEP has been quite positive overall, as might be
expected. There were some questions about whether
it should be a PEP or part of the standard documentation. For now,
Viktorin is content
to keep it as a PEP, but thinks it may make sense to integrate it into the
documentation at some point; "I went with an informational PEP
because it's quicker to publish
", Viktorin said.
The conversation also turned toward changes that could be made to Python to help avoid some of the problems and ambiguities that arise. Several suggestions that might seem to be reasonable at first blush are too heavy-handed, likely resulting in too many false positives or, even, effectively banning some common languages (e.g. Cyrillic), as Serhiy Storchaka pointed out. For the most part, it was agreed that these kinds of problems should be detected by linters and other tools that can be configured based on the project and its code base. There may be some appetite for disallowing explicit ASCII control codes in strings and comments, however, as Storchaka suggested:
All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too. In string literals you can use backslash-escape sequences, and comments should be human readable, there are no [reasons] to include control characters in them. There is a precedence of emitting warnings for some superficial escapes in strings.
As can be seen here and in the PEP, Viktorin provided a whole cornucopia of things for Python developers of various stripes to consider. While the exercise was motivated by the Trojan Source vulnerabilities, the "problems" are more widespread. There is a fine line between supporting various writing systems used by projects worldwide and discovering oddities—malicious or otherwise—in a particular code base. Developers of tools targeting the Python ecosystem will find much of interest in the PEP.
| Index entries for this article | |
|---|---|
| Security | Python |
| Security | Unicode |
| Python | Security |