You should do both. Sanitize your inputs so that it can be safely stored in your data store, and then sanitize your output so it can be safely displayed.
We did this at reddit. We had basic SQL sanitization on the way in, and then a full pass on the way back to the user. The advantage this gave us is that when someone discovered a new way to hack our sanitization, all we had to do was update the output filter and everything was magically safe.
We didn't have to do full database scans to find all the bad data and change it.
Edit: Apparently I shouldn't have simplified "paramaterazation of SQL" as "sanitize your input". I used the more generic term since I was talking about any kind of data store. But yes, it was of course parameterized.
Most input by humans is unicode. Most people do not understand unicode. Don't try to sanitize it other than checking it's valid utf8 - which hopefully your programming language's string type/http parameter deserializer/DB engine already do for you.
For example some people think that stripping ZWNJ, ZWJ or other kinds of spaces is a thing they ought to do because it confuses their markup parsers or can be used to encode hidden information in posts or stuff like that. Guess what, it breaks emoji, arabic, some asian languages and a bunch of other things.
If you only sanitize your output and realize you made a mistake you can easily fix it by changing your output sanitzing algorithm. If you santizied your input you threw away data and can't fix your problem.
WTF-8 is for a very specific application—to encode in an 8-bit sequence a 16-bit sequence which is nominally UTF-16 but may contain unpaired surrogates. It’s not used for weird user inputs, instead, it’s used for e.g. interop with the Win32 API.
Are you sure that’s what is happening at Reddit? You shouldn’t need to sanitise your inputs for SQL. Paramatised SQL has been a thing in some languages for two decades now. This really is a long solved problem by now.
Output is a different matter though but that’s because of rendering content safely down to HTML, JavaScript or JSON (to name a few examples). SQL shouldn’t come into the equation by this point.
This. I'm tired of people implying or right out stating that SQL injection is an input validation problem. Why couldn't you have foo' OR 1=1; as a title of your post? It is all good characters as far as text entry is concerned.
SQL injection is really a problem of how you pass parameters to your SQL layer. Parametrized queries are the (easy and widely available) solution. If you are concatenating input to your SQL queries, you're doing it wrong.
How many people named O'Brien are told they can't sign up, or passwords get rejected because they contain special characters?
It's crazy.
Even if you're using 1990s technology without parameterised queries, it's not like it's impossible to say `insert into users (name, motto) values ('O\'Brien', 'foo\' OR 1=1;')`.
Yep. I've stopped counting the amount of websites that force me to use weak passwords. It's crazy that this is still a thing in 2020.
I wish the controls on browsers came with a green V that implements best practices (8+ symbols, no filter) so that people who made websites understand that this is what they should conform to. Not their own misconceptions about password security.
I wish websites would stop enforcing a "password policy". An insecure password should be a choice. If you are so sure you cannot secure your site, leave authentication to a third party provider. All this leads to is zillions of user accounts that are used only once.
I have heard stories about people in Ireland struggling to get certain services because their name contains a Fada, but some of the identity paperwork they have is missing the Fada due to lack of support by computer systems.
I blame PHP. Many webdevs active today started with it, and the standard library's solution to injections was escaping everything half a dozen times just in case. Because PHP being PHP nobody saw any red flags when they implemented a function named "mysql_real_escape_string". Apparently they've deprecated these functions since then, but the damage is done.
But that's not a thing for 15 years or more? PDO was added around 2005, and even before that anyone in their right mind used mysqli extension for prepared statements. Since 2012 you can't even use the mysql extension without getting a depreciation warning.
And yes, in 90s php's security sucked, but that was nothing php specific, it was just the sentiment of that time. Everyone did it, in all languages. I remember using tons of $dbh->do() in Perl's DBI back then, intentionally avoiding to prepare statements for a quick and dirty stuff (and most of the scripts back then were quick and dirty stuff). It's in a big part because we were used to building desktop apps and thinking in terms of security that applied for them like being careful about your pointers and input strings lengths and stack overflows and stuff. Web was still pretty new thing.
> But that's not a thing for 15 years or more? PDO was added around 2005
Ex-shared hosting bod here, who had the joy of managing our PHP environments :(
Sadly in the real world, even after the great big (and pointless) act of deprecating and removing the mysql_* library, naive developers (and experienced ones that should've know better) just moved onto mysqli_* or PDO and still used string concatenation with raw inputs, instead of learning how to parameterise their queries.
> naive developers (and experienced ones that should've know better) just moved onto mysqli_* or PDO and still used string concatenation with raw inputs, instead of learning how to parameterise their queries.
The one and only time I argued with someone about PHP, the forum the guy ran got hacked the very next morning and was running a botnet of some sort. I was smug but quiet, and I never really thought about it but the timing makes me wonder if he thought I was involved.
Anyway, there’s a patch release the next day, and somewhere I find the diff. Now I can’t read PHP but I know what string concatenation looks like, especially if someone does a diff on it. I’ll be damned if the diff didn’t fix one SQL string concatenation that was less than five lines from code with the same structure. Scsry.
PHP and vulnerable example code is an additional thing. Most people just copypaste from tutorials. For example, the first search result for "PHP mysql example" gives you the wrong example first https://www.w3schools.com/php/php_mysql_select.asp
That doesn't mean the information supplied was accurate. Which, as it happened, it wasn't and this whole thing was really just a massive misunderstanding.
When I said "sanitized input" I was simplifying parameterized SQL in the case of reddit. But since I said "for your datastore" I was being less specific since different datastores require different methods.
Parametrised SQL isn’t sanitising your input. It’s injecting those values in at byte code so your values are separate to the query language. Calling that process “sanitisation” is, at best, highly misleading.
Also the methods don’t really change across different SQL databases, at least not conceptually. Sure the RDBMS drivers might change but these days that stuff is usually abstracted away into a single framework for SQL. The real significant change would be switching to a NoSQL database but if you’re doing that then it’s not SQL you need to be “sanitising” anyway.
> Calling that process “sanitisation” is, at best, highly misleading.
That's a fair criticism. But that is what I've called it for a long time.
> The real significant change would be switching to a NoSQL database but if you’re doing that then it’s not SQL you need to be “sanitising” anyway.
Right, that's why I started off by saying "Sanitize your inputs so that it can be safely stored in your data store", so that it could apply to any data store.
It's just a terminology argument at this point. But my main point was that you need to still think carefully about how you're going to store your data and do it safely, and then also make things safe on the way out, like the article suggests.
> That's a fair criticism. But that is what I've called it for a long time.
I think many people would agree that "sanitization" is loosely defined and I think that's exactly what led to the misunderstandings that the OP's article is trying to address.
> that's why I started off by saying "Sanitize your inputs so that it can be safely stored in your data store"
That could be interpreted like "escape quotes at the start of the request if you know that you're using a database where quotes have a special meaning" a la PHP magic quotes, which I'm guessing is not what you meant but it is what the OP is criticizing. The key is that the sanitization (or whatever you want to call it) shouldn't happen until you're ready to insert into the DB, otherwise that data will be coupled with database logic through the whole flow of your app
You don't "sanitize your inputs" for your datastore. You escape the outputs as you send it to your datastore. (Either through oldschool methods, or parameterization.)
Sanitizing your input means changing them permanently. You don't actually want to do that. You want to store exactly what the original value was, but you want to do it safely. When you retrieve it again, it should be the same as it was originally.
If you "sanitize" the input, you won't necessarily have the original value ever again.
Yes. It's the word "sanitize" itself that misleads people. It creates the mindset that input from users is dirty and must be made clean, and "clean" is "safe" to use in any context.
(I've seen the line of thought taken one step further: taking the realization that it's impractical to make strings universally safe for any context—even if you HTML entity-encode it twice, what if a recipient decodes it three times?—and concluding that security is hard and we can only approach it asymptotically, so shrugs XSS-like bugs are normal and unavoidable given finite time & budget.)
If the mindset is more like converting units, it becomes clearer. You can't concatenate HTML with a general Unicode string without converting the string to HTML first, any more than you can add inches and centimeters directly. "Cleaning" the centimeters would make no sense.
> But my main point was that you need to still think carefully about how you're going to store your data...
I think this is true, but doesn't quite reach the point GP is making: speaking correctly is also important. Calling parameterization input sanitization communicates the wrong message. And abstracting the wrong solution to apply it to a different problem isn't all that helpful. You could just as easily encode or hash input to fit the underlying data format without losing data (except in the case of truncation), but that isn't input sanitization, either.
Input sanitization is strictly checking front-end input against a ruleset and rejecting anything that does not comply. This is fundamentally different than dealing with anything thrown at you and handling it gracefully.
That's not how the word is usually used within development in my experience though.
I think most devs think of sanitization as "make X safe", not "see if X is safe, if not reject" since that is usually called validation.
Using hand sanitizer does not remove your hands if they have harmful bacteria. The hands are still there, just cleaned from (some) of the harmful parts.
We should certainly use parameterization always, and this will allow us to save 'Bobby Tables' type input into our databases, but we should acknowledge that there is then the potential risk that some internal, probably non- public-facing program or script, either now or in the future, will contain a bug that leads to its execution.
The spread of natural language processing into systems and analysis tools might increase the scope for this sort of thing.
I agree you can never say never but you also can't sanitise against a risk that hasn't been defined yet simply because you wouldn't know what needs to be sanitised.
For example what if your NLP is bootstrapped from a shell script and your database content has been stripped of SQL but still contains stuff that might be interpreted as $(sub-shells)? Before long you run into a situation where literally no characters are considered safe (eg even alpha characters in the English alphabet are used as tokens in some programming languages and "what if someone builds a script in one of those languages?").
The only sane way to address the unknown is to treat raw strings as "dirty" and follow best practices when handling them (plus all the usual processes to properly test your code before it's used in production). In which case you're back to no longer needing input sanitisation.
My point is not that input sanitization is a solution, but that parameterized database input is not the end of the issue, from a broader, whole-systems security point of view.
Many data types have some concept of well-formedness, and in those cases, there are pragmatic reasons for only accepting well-formed input that go even beyond the security aspect.
I completely agree and I also said this in my original comment you replied to.
This is why I got confused when you said "but we should acknowledge..." (ie thinking you were raising a point other than what myself and others had already acknowledged).
I honestly don't understand what the point you're trying to make is then.
If you're saying people should be aware that handling data safely requires more steps than just parametrised SQL then yes I touched on that, as have others, and it's not something anyone is unaware of. Hence why there has been so many high quality posts discussing the different methods of validation, sanitisation and escaping. So it's a rather strange position to assume when you say "we need to acknowledge" given that's what everyone (including me) has been doing. But it never hurts to be categoric about important points like that so your original post is still relevant.
If you're making some other point then I've already had two stabs at deciphering it and failed both times. So it's really not clear what that point is.
However if your intention was just trolling me then fine, I bit and you won.
> Sanitize your inputs so that it can be safely stored in your data store,
That's not sanitizing input, that's escaping output - if you subdivide concerns appropriately. The data you're saving into database is the output of your program.
Really, the problem is of language translation. User input is an unstructured blob. SQL, or HTML, are structured languages, with their own semantics. Whenever you cross the language barrier, you need to translate data from one language to the other. Parametrized queries are the usual API to SQL drivers, and they do this for you under the hood, producing a valid SQL query string[0]. When going to HTML+JS, you need to invoke some library (or do translation yourself).
(I really don't like the term "escaping". Translating between languages is more than just sticking slashes in front of double quotes.)
This is why "sanitizing inputs" is a nonsense concept. The problem is of language translation, and you can't translate if you don't know the destination language. A blob correctly sanitized for SQL will not be correctly sanitized for HTML, and an input correctly sanitized for both will look bad in either.
SQL injection and XSS are the same bug. Failure to translate between languages. Usually caused by a pretty stupid but somehow very popular idea - building target language expressions by gluing plain strings together[1].
--
[0] - Sometimes. I remember this is how it worked in the past, but not sure if server RDBMS APIs haven't changed since. For comparison, with SQLite, you're passing the query with placeholders to the SQLite functions, and the parameter values are passed as arguments. The SQLite internals turn this into an executable query, but I'm pretty sure this does not involve a query string with actual parameter values in it ever existing in memory.
[1] - A related source of footguns is using templating engines for web pages. HTML is a tree of nodes, not a plain string. Using a template system is a recipe for XSS problems.
Any actually decent RDBMS isn't stupid enough to first escape parameters, then parsing the query string to find placeholders, then do a bunch of string concatenation and then run the concatenated string through a second parser. It is really simpler and more robust to parse the query string once and grab the actual data value from the parameter array once a placeholder is found.
However, a lot of client side libraries are cheating and embed the parameters into the query string before passing it to the server instead of implementing the proper parts of the protocol. This is about as safe as doing plain string concatenation in the first place. I don't trust the library authors to actually get this right.
There's value in doing both in certain cases (as long as you're definitely escaping output), but I'd be careful about the motivation. You say: "Sanitize your inputs so that it can be safely stored in your data store" -- but it's safe to store any string in your database as long as you escape/encode it correctly (i.e., use parameterized queries).
For example, what if someone posts a legitimate comment on reddit helping someone with SQL syntax for deleting tables and includes "DROP TABLE users" -- did you sanitize that away?
What does the “basic SQL sanitization on the way in” consist of and what does it do for you?
The article makes sense to me; if I’m just storing strings in a database, I don’t see why they would need to be sanitized at rest, even if they contain malicious SQL code. Only when I actually come to use those strings for some purpose.
But what if at some point somewhere down the line someone forgets to sanitize the output? Surely better procedure to sanitize at both ends. Nobody is perfect.
You're thinking about this as if data can be in one of two states – untrusted or sanitised. This is not the case.
When you output arbitrary data, you need to encode it in a way that is suitable for that context. These contexts might be:
- Generating a web page.
- Including in a JSON response from an API.
- Sending an email.
- Storing in an SQL database.
These all use different formats / protocols that use different syntax to encode data. How you correctly encode data for one of them is different to how you correctly encode data for another of them. There is no method of taking untrusted data and "sanitising" it so that it is correct for all of them. What works for one will break for the rest.
If you want to handle arbitrary data correctly and safely, store it as-is and when the time comes to use it, encode it appropriately for the context you are using it in. Where possible, use tools and systems that get it right by default instead of requiring developers to remember to encode correctly, e.g. generate HTML with templating engines that encode data as HTML by default, and use parameterised queries with SQL.
> generate HTML with templating engines that encode data as HTML by default
Don't, unless you're sure the templating engine actually parses the HTML into a tree of nodes before interpolating and re-emitting it. Otherwise it's likely someone will interpolate something in an improper context, e.g. inside <script> or <style> block.
This is completely the wrong attitude. Code should be correct, not fail-safe. All of the safety hatches that people tend to introduce make the code less predictable and eventual problems tend to arise far away from where they originated, making them difficult to diagnose. What is the result of this sanitization? Now we have some undefined, changing internal string format running around in our application, and possibly multiple undefined and changing internal string formats, where it is also unclear at what point a string is supposed to be in what format. If something is an arbitrary string it should be allowed to be an arbitrary string and the things handling that should escape it the appropriate way. The article is correct.
I’m confused because I agree with this comment, but your other comments mentioned:
> "sanitizing input" is plain nonsense
> "Unsafe input" is not a thing.
Have you ever used hand sanitizer? The point of hand sanitizer is to reject infectious diseases outright.... you know, so they don’t get IN your body. You seem to have adopted a narrowly defined sense of sanitization which does not include “mercilessly discard/destroy”.
> The point of hand sanitizer is to reject infectious diseases outright.... you know, so they don’t get IN your body.
There is no such thing as "infectious input data".
> You seem to have adopted a narrowly defined sense of sanitization which does not include “mercilessly discard/destroy”.
None of the dictionaries I just checked support such a definition. They are all about "changing something to be more sane/sanitary/pleasant/acceptable/...".
Also, mind you a hand sanitizer doesn't destroy your hand, it destroys microbes, in order to make your hand sanitary, so as to enable you to continue using your hand instead of rejecting/disarding it. Which is exactly the kind of thing you should not ever do with input data.
No, it doesn't. That is what "accept anything" programming leads to.
The idea that "accepting all the inputs" somehow gives you an advantage is an illusion: If the semantics of some input are not well-defined, then the only thing you gain by accepting it anyway are hard to debug interoperability problems and vulnerabilities. When some input is not well-defined accoridng to the spec, then your interpretation is just a random guess, and the next developer will make a different random guess as to what that input means, and so an interoperability problem and potential vulnerability is born. If you reject the invalid input, you will notice the error and thus fix the source of the invalid input to produce input for which the semantics are actually well-defined.
This sounds good in theory, but I'll give a counterexample.
Requirement: Name input box.
Implementation: We'll sanitize the input by rejecting any characters likely to be dangerous if mishandled, like single quotes, or anything else we don't immediately imagine to be useful. If a character turns out to be needed later, that's no problem. We'll just change the list.
Security audit: Passes
Later customer complaint: I can't sign up! — J. O'Brien
Dev team: Sorry, too bad. We'd have to re-audit everything and possibly modify code to allow your last name, because there might be code somewhere that relies on the original sanitization for security. That was the point of sanitizing on input, after all. If you want to sign up, it would be easiest for us if you would just change your name.
I think you misunderstood my point. I am not saying that you should reject valid (that is: semantically meaningful) input, but that if you are confronted with semantically meaningless input, you should reject it rather than garble it so that it gains some random meaning.
So:
Name input field, value "J. O'Brien": accept
JSON parameter, value "{foo:bar}": reject
The context was the idea that you should gracefully accept bad input. If your code considers "J. O'Brien" bad input for a name, then that's the problem, not that it doesn't accept bad input.
Yes, I completely agree in the above case. The JSON input has a well-defined format and input validation should reject it outright.
The issue is that when developers hear they should "reject bad input" in order to avoid vulnerabilities, they often interpret it as a call to reject any user input that isn't already known to be good. Since user inputs are often free text, like the name field, they wind up forbidding any input they hadn't specifically imagined, which doesn't align with any particular recipient's actual data requirement. It creates false-negative edge cases while only providing illusory help against vulnerabilities.
I mean, I generally agree, but I think it's already problematic to frame it as "user input that isn't already known to be good". Because "J. O'Brien" is known to be good. The problem is that anyone thinks in the first place that some semantically meaningful input value for some reason is not good.
You can't sanitize for output at input time, as the sanitization that needs to be applied is different for HTML, JS and JSON. You don't know that at input time.
Well, use libraries/frameworks that ENFORCE you to sanitize and makes that exceptional case to output raw content.
Examples.
PHP: Using mysql_escape_string is a no-no - you will forget to add it one day. Using parametrized queries you won't write unsafe SQL.
.NET Core - Outputting to HTML by default only outputs those chars to HTML which are in predefined UTF range. All other chars will be converted to HTML entities. If you want to output raw, you must explicitly use @Html.Raw https://docs.microsoft.com/en-us/aspnet/core/mvc/views/razor...
This doesn't pass the sniff test. If someone can forget to sanitize the output, someone can also forget to sanitize the input. The most important things are to understand where the content is used, use the appropriate output encoding/escaping, have rigorous tests to ensure your expectation correctly escapes nasty strings, and that you keep the output escaping code up to date to protect against novel attacks and new browser/app features.
I worked at a social media company with one of the largest text-based user-content-stores in the world at the time. Some of the features had input-side encoding and some had output-side encoding. I was there ~10 years after the bad practice of input-side encoding started and it very quickly became too cumbersome to know exactly which fields were encoded with what encoding (and I mean both character encoding and htmlentities / specialchars / specific character stripping / etc). We started getting ridiculous bugs like passwords could not contain '&' characters or logins would fail matching what we had in the DB.
It's not about being perfect. That will never happen. It's about storing exactly what the user submitted (if it is accepted by the POST submission logic) and to correctly encode the output for the correct security context (HTML, XML, JSON, html entities, html attribute, script tag, styles/stylesheet, urls, uploaded filename / file contents, filesystem injection, command injection, etc). These all have different rules. You can unintentionally open yourself to a vulnerability in one if you only expect the output to be displayed in HTML.
I'm no fan of sanitizing inputs, transform unsafe input to safe input and the storing it, because as you say, someone will find a way to circumvent that.
What I always do is to exactly specify what is allowed in any input by parsing, schema validation. If it is HTML I run a HTML parser to validate accepted tags & attributes. If it is plain text i validate that there is no HTML in it, etc.
If the input fails the filter then you deny the request.
This has the advantage that you always know what data structures you are storing in the database and that will make future data migrations much easier.
Drawback is that if your filter is too strict then you deny a valid request, however it is easier too loosen a filter later than migrate unwanted/unknown data that you accidentally accepted.
Stored input is also part of your database schema.
And of course, always escape output even if you know the data is "safe".
> Apparently I shouldn't have simplified "paramaterazation of SQL" as "sanitize your input"
It's not a simplification, it's incorrect. An SQL query is output. You are sending data out of your application code, via the SQL library driver.
It may seem like I'm splitting hairs here, but I've seen this distinction misunderstood in this way often enough in situations that severely compromise security.
Some people think of "output" as purely the end product of application flow, and all I/O that happens in between is somehow lumped together as indistinct "input". There's a reason I/O has two separate letters, and the distinction is crucial for securing your application.
TL;DR:
values passed to an I/O function (like stdout, file.write(), response.write(), db.query(), etc.) are all output, those returned from an I/O function (like stdin, file read, db query results or requestObject.getQueryParam()) are input. Sanitize the former, NOT the latter.
Validate the latter (ideally. Though I'd say this is more about stability than security).
That sounds like a broken datastore API. In a properly designed API, you don't need to escape anything, because the API implementation ensures your data doesn't get read as code.
It depends on the perspective, in case of SQL I would argue that sanitizing the input is the same as escaping the output, because the query you are sending to the database is the output.
Escaping the output however as a term implies you are doing it right, while sanitizing the input could also mean you just replace("DROP", "") etc. (My last name is Dropmann, I know what I am talking about)
The difference is where it's done. "Sanitizing the input" implies that it happens when the value is read, so that all uses of the value are stuck with a single result. "Escaping the output", in your example, would happen in the database or its driver, for parameterized queries. HTML output of the same value in the same request would be escaped differently within a function that builds HTML output.
Can you explain to me why reddit feels like it is held together with duct tape? IMHO, it has the most problems with site uptime and basic functionality of any major site on the net. I am always getting search problems, site unavailable, or some other such glitch with it. I can't believe you guys just don't know what you're doing, so what does the present setup offer that is worth this shitty performance?
> always getting search problems, site unavailable, or some other such glitch
This sounds like my spouse saying "you always...". It's obviously not "always"; I think you'll get a better reply when you quantify it. For example, in the last month, how many times did you get a search problem (which was it), a site unavailable or something else (what was it).
It's about once per session for me. Seriously, compared to every other major site I know of, it is in a class of its own for flaky UX. At least one of the things I describe above per afternoon, lets say. Often many more than one if it is having serious problems. As for the issues with you and your spouse, I will just say it sounds like there is a opportunity for improved communication there. Best of luck.
Are you implying that Reddit is concatenating SQL queries instead of parameterizing properly? Because based on the reliability of the website I'm not surprised by this, but it's still hilarious.
Now, you have ammended your comment to really say something very different (how is parameterized SQL 'basic' anything when it's actually the correct complete solution to the problem?).
But in any case, this still suggests a complete misunderstanding of the point of that blog post. As far as that blog post's point is concerned, the SQL database is an output of your program. And the whole point is that you need to escape/encode all outputs correctly, but you should not ever sanitize anything.
Because as others have said, parameterized SQL is a long standing solution, so I consider it pretty basic, but I still consider it a form of sanitization that not everyone does, even though it's a long standing solution.
I think it is totally fair to interpret the article as saying that the database is a program output. And if you interpret it that way, what I said doesn't make sense.
Yeah, I "LOL" at these type of One True Way™ proclamatory headlines.
What I don't understand is the lack of using proper escaping functions when generating SQL. Templating SQL without escaping is the surest way to a SQL injection.
--- Bad
"SELECT * FROM USERS WHERE name = '%'" % (name)
--- Good
"SELECT * FROM USERS WHERE name = '%'" % sq_esc(name)
# where sq_esc() doesn't add outer ', but escapes anything that needs it
And to defensive coding:
0. Sanitize input. Always, always.
1. Assert pre-condition invariants.*
2. Process.
3. Assert post-condition invariants.*
4. Generate correct output by understanding the output domain.
* Unit tests, smoke tests, integration tests and code-coverage alone are insufficient to cover complex code paths. Fuzzing with asserted invariants is a good way to shake the dust out of hairy code.
So, essentially he is saying:
1) go ahead and accept the risky thing from your users, and store it right there in your database, but
2) make sure that you remember, in every single place in your code where you read that out of the database, to treat it properly, and
3) make sure that every other programmer, now or at any time in the future, remembers to do this also, in any code they write which reads user input out of the database and puts it on the screen.
What a bad idea. Don't leave landmines there for other maintainers of the code to step on. Especially because the other maintainer may actually be you, six months or a year from now.
That simply does not work. You can't sanitise, escape and reproduce correctly all at the same time.
Say you run a blog. I post a comment saying "But in this case, B<A!"
This is clearly dangerous input! But it is also exactly what I wanted to say. How do you sanitise this? Change < to < in the database? Now you have to remember to NOT escape that again when outputting! And you have to make sure that, say, your text resources in your UI are all also escaped the exact same way, or you have to remember to escape them DIFFERENTLY than user-provided input.
Or maybe you "sanitise" by stripping out dangerous characters like "<". Now you have broken my comment.
The only strategy that is at all maintainable is to store the comment as received, and to escape on output. Anything else is massively fragile or broken.
> You can't sanitise, escape and reproduce correctly all at the same time.
That's why you do them at different times...
Let's go to the example:
> This is clearly dangerous input!
That's not clear at all. There is a set of values allowed for a comment, this one is probably within them, while, for example, an empty value usually isn't, as isn't and invalid UTF8 sequence. This one should pass sanitization as is.
> Change < to < in the database?
You escape it when converting into HTML. It's not the same as sanitization.
If your rules say comments will be truncated to 1000 chars, yo do it on the input. If your rules say all prices are in dollars, but your frontend accepts other currencies, you convert on the input, and overstaff the consumer support.
Honestly, those names mean a lot of different stuff to different people. It's not good that there are so many, it's more a consequence of the widespread of bad practices.
I think what people thinking of the term like me (sanitize means modify, validate means accept/reject) will think is that if your rule is "comments will 1000 chars or less", then the validate reasoning would say reject the 1001 chars comment, while the sanitize reasoning would say truncate to 1000 chars.
Maybe I'm overengineering, but couldn't you store the sanitized version as the normal value, and also store and make publicly available the original unsanitized value in an ominously and obviously named key (say, dangerouslyUnsanitizedValue) that happens to be easily greppable/lintable?
Plain text can contain anything and it shall be treated as such, it is that simple.
As for security, don't assume everything in your database came from a trusted source. Maybe there are remains from an old version of your code that didn't sanitize, maybe you improperly used admin tools that bypassed checks.
How would you determine which value to display? It seems to suffer from the same issue where if you display the sanitized value then the comment is still missing necessary characters, but if you use the unsanitized value then your application will be vulnerable to XSS.
In most cases, that would be overengineering, but it is an entirely plausible solution if you happen to have a case where you need to allow the user to enter things like angle brackets, and for some reason you cannot escape them.
Of course not. The fact that “<“ is risky isn’t part of the string, it’s part of the output format (HTML).
If you were to write that string to json or csv, you would have to special-case double quotes. In. POSIX shell, asterisks and question marks need special attention, etc.
You should sanitize the input when possible, so that numbers are really numbers, strings are really strings, slugs and similar are cleaned... But of course you can't clean text so that it will be safe when displayed. After all, `<` is only problematic if you are displaying the text as HTML, which, while common, is not a given.
When displaying anything, you should however use a _framework_ that doesn't allow you to display anything that would not be safe (unless you use some function with "UNSAFE" or "DANGEROUS" in its name). For example React does that, and others too.
There are many different kinds of attacks and the less leeway an attacker has, the safer you are. So sanitize both, input and output.
The solution: don't try to reproduce input exactly. That's a weird thing to want in general anyhow - what exactly are reproducing so exactly? Text? Including Markup? How about some animation thrown in? Maybe Interactivity? Hey, let's just reproduce arbitrary executable code accurately?
The whole point of sane sanitization is that you don't need to reproduce all that stuff exactly. Pick a small domain, and reproduce that. Often, it's OK to reproduce approximately; e.g. not worrying about things like retaining multiple consecutive whitespaces, or perhaps leading/trailing whitespace, or whatever.
The point of sanitization is to make it easier not to make a dangerous mistake accidentally. If you have an input that needs to support layout, that's a pain. But if you can live with just text - so much the better. If you do need to support markup; then I don't see the wisdom in sanitizing it late; that's just asking for bugs to lead to security issues.
Frankly the whole tradeoff is nonsensical. These aren't mutually exclusive alternatives, and don't even really address the same issues. Yes, you should sanitize (and validate) your input. And you also need to escape output as appropriate.
If the point is that it's not wise to skip escaping because you "know" the input is safe due to sanitization - then sure, while theoretically sometimes sound, that's pratically a nasty bug waiting to happen. Don't do that, sure.
No; that's just you picking an absurd example rather than being practical.
Pick a reasonable domain for each input field, considering what kind of input is useful, and what kind of usage in output (i.e. plain text output is likely much less risky than rich text). There's rarely a reason to ban < in plain text; but retaining stuff like zero-width joiners or rtl-ltr-transitions is likely less valueable, and potentially an issue with in things like usernames or email addresses (because they make it trivial to make apparently identical usernames). Similarly, if you're storing a telephone number and want to retain spaces - are you going to retain nul-chars too?
Not all input should allow arbritrary plain text. I'd guess most don't, and lot's of input is at least rich text nowadays (not to mention images and other media - you think it's a good idea to just reproduce an arbitrary image exactly?).
No, there is nothing to remember. Every decent HTML templating engine these days will handle all free form text as unsafe and escape it automatically. The same for SQL libraries.
Input sanizitation doesn't work, because it doesn't know what is dangerous and what is not dangerous. That depends completely on the output domain, and at the point where the inputs are received, the output domain is often unknown. Data can flow through many layers of business logic and then be passed to an SQL query, an HTML templating engine or anything else.
If you don't consider database strings to be free form text when constructing HTML, then there's a good chance there will be vulnerabilities anyway, regardless of whether any sanitization has been applied.
This has absolutely nothing to do with shitty validation rules. It has to do with what you're outputting it as.
If you're outputting it to HTML, commas are fine. If you're outputting it to CSV, commas are bad. And your validation rules suck if you don't allow commas in any text field because it might be output to CSV someday.
Unicode RTL override can be dangerous in filenames (in the sense of confusing humans, not computers). It is necessary to preserve in content management systems dealing with bidi content.
Of course you can try to have different validations on your model, but then you need to make sure to know all output domains on the model level instead of doing it when handing over to the view.
Sadly, there is no "the sanitization". JSON, SQL, HTML, CSS, and URI (and the future formats not invented yet) all require different escape schemes, so while you can indeed filter out anything that can be interpreted as an escape sequence in any of those formats, that's not something you can always do.
Instead, render your data properly (yes, that includes escaping whatever you're outputting).
Like they're BEGGING you to never use it but understand sometimes it may be necessary or at least is needed to avoid worse hacks to achieve the same function.
I've taken to using prefixes like DANGEROUS or UNSAFE in various parts of our codebase to better indicate to the user where extra caution is needed.
Or treat your database as a user and don't blindly accept what it gives you without validating it.
At a previous job I implanted a user record that would have given me admin credentials on the next db migration (which happened pretty regularly) because the developer of the migration tool said "Why should I not trust this data, it came from the database". It was "sanitized" by the app for it's intended use, but not for the ETL tool.
If you sanitize your inputs you automatically create the assumption that the database is "safe", but you also have to sanitize it for every potential future usecase that the data might be used for, which is not clear when you are writing the sanitization code. Can you foresee every type of use the data will have in the future? can you know every ETL step that will be written 5, 10 years down the line? If not, it's safer to treat the data as untrusted, and if you are going to do that anyway it's a whole lot easier to just not sanitize since you will otherwise deal with double-sanitization or double-escaping.
"The risky thing" here is a text string. So yes, you should be able to accept an arbitrary string and store it in your database without shooting yourself in the foot, and the same when you read it out of your database and put it in HTML, or JSON, or CSV, or XML, or ASN.1, or protobufs, or in another database.
Handling escaping in HTML or REST API is web dev 101, this shouldn't be controversial.
Whenever you say, think, or imply something starting with "So from now on we just have to remember...", you're really saying "Let's decide this part of the system will always keep breaking!".
> So, essentially he is saying: 1) go ahead and accept the risky thing from your users, and store it right there in your database
Yes. It's hard to know what "risky" is when you're taking input. You don't have the context of its use. What if someone is discussing html in a comment and wants to refer to an example? Or the same for SQL. You're going to be using a heuristic to try and "guess" this, and you're forever going to be trading off annoying your users with security, which is never a good place to be in.
> 2) make sure that you remember, in every single place in your code where you read that out of the database, to treat it properly (and 3 more of the same)
No. You simply do not use unsafe methods of mixing this data with your output. Use any remotely modern markup templating system which has a way of tracking the escaped status of text and auto-escaping it when necessary. Use an ORM or at least a database connector which has inbuilt parameter escaping support. Do not just do regular string formatting with this kind of stuff, and don't hire the sort of people who do.
> Don't leave landmines there for other maintainers of the code to step on
You don't know what is a landmine until you know the context of its use.
It is much easier to secure a perimeter at one point and know that everything one side of it is safe and everything the other side of it is not safe. The only practical place to put this perimeter is the point of use/output/whatever you want to call it. Having several different places where this data gets mangled and lacking clarity of exactly which pieces of data are safe and which are not at any one point is a recipe for disaster.
"But we'll also escape everything on output" - this leads to double-escaping and weird bespoke hacks to work around the resulting artefacts, which themselves will likely open up holes.
And this is not even touching on the idea of a data field's safeness "living with" the data. "Field xyz is safe, we sanitize it on input" - cool, but we only started sanitizing the input in September 2019, so any fields from before that are unsafe.
Not only can you easily see every place in the code that is using this construct, but, the framework provides a warning.
I definitely don't think it's bad to sanitize HTML content. but in general MOST text in a web app should be just text, and rendered with whatever HTML it contains. In very few places should the web application give user rich-text (aka HTML) access. In any place that the application does that, sanitization should be used.
No, nobody has to remember anything. You use prepared statements for your database, you use autoescaping templates for the output, and it just happens automatically. I wrote about this four years ago: https://palant.de/2016/03/02/why-you-should-go-with-secure-b...
On the other hand, when you "sanitize" input you get immensely complex code. Whenever you get some data in, you have to remember to sanitize it. You have to know in advance how that data is going to be used and what might be problematic there. Worse yet, as your usage of that data changes (or your understanding of the problems), the data sanitization has to change as well - both for existing data in the database and everywhere where this data comes in.
I've seen codebases doing this, where it's impossible to tell whether a particular piece of code is a vulnerability without looking up tons of context. That's the minefield for other maintainers. Don't do this.
It's interesting how so much of programming is knowing about the existence of libraries and not trying to rebuild things that already exist. I'm sure there are thousands of people out there who sanitize inputs and outputs naively and don't know about great libs like DomPurify.
I know I'm guilty of trying to build something from first principles only to google it after banging my head against edge cases and finding a ready-made library or util that with a tiny bit of finesse or modification does the job.
TBF, most of these libraries are not easily discovered. You have to luck on the chance that the library author used similar words as your search query. Barring major stuff (authentication, databases etc), you will rarely know about a library in existence. Search engines are still limited by the language of humans. Unless and until, you note down every possible library you come across for future reference, this problem is here to stay.
Maybe a language having a vast standard library won’t suffer from this problem but it will definitely have other problems.
yeah, creating a personal search db is time consuming and kinda impossible... a while ago everyone was coming up with bookmark managers that kinda sorta could function as a personal search db, but it still required a lot of customization. Also they were all cloud based, and didn't really function offline.
I don't think there is a golden rule for this stuff besides having a very strict CSP that only allows same-site resources and no eval/inline with reporting to see when someone has tried to do something bad.
One of the most useful skills I've developed over the years is a very strong instinct for "someone else will have solved this already in an open source library", combined with intuition on exactly how they would have described it so I can find it with search.
I wrote a very similar piece six or seven years ago. People responded here that I was just making sematic arguments or otherwise fell over themselves to ignore the point I was trying to make.
It is good to see from the responses here that we've learnt absolutely nothing.
Agreed, it seems what this article (along with most of the comments) is dancing around is the idea of a stronger type system. String != string, e.g. UserText vs SqlStatement. Using explicit conversion methods between those types helps clarify the actual “boundaries” of your system, or rather the independent parts and their individual boundaries. Joel Spolsky’s article illustrates the problem well.
The problem stems from our simplistic type systems, which reflect a value’s storage class (“array of bytes”) rather than its semantic type (“raw input from user”, “sql safe TEXT literal”). Once your type system can differentiate between those two, it can help you identify where conversion between the two (aka “escaping” or “encoding”) is necessary. Then the problem of “dangerous string” disappears, because there is no more string: if it’s a UserText, it can’t be concatenated with a SqlStatement without conversion. Just like an “int” for example, or an array.
Anyway I’m just rehashing Spolsky’s article, poorly. Don’t let my inaccurate summary reflect negatively on his point :p
Arrrghh, it's 2020 and developers are still too dumb to understand the difference between a string (list of characters) and a tree (sql, html, ...).
No, the solution is not to "sanitize" input, output or both! The solution is to not use the same type for text and trees!
We should rid the world of brain-damaged templating solutions like jinja, go template and similar garbage that pretends your SQL or HTML is some flat string that just needs a bit of extra contextual "escaping" magic [1].
If you interpolate into a proper AST there is no problem and you need no "escaping". For efficiency reasons you may not want to directly interpolate into an actual ast and then de-serialize all of it to a string again just to output it down a socket in the next line, but that's just an optimization.
Bourne shell fucked this up as well (for an easier case of interpolation, since there is basically no nesting) and it remains a constant source of severe bugs and security holes in shell scripting as well. By contrast lisp has been doing this right for literally decades:
[1] Yes, I get it's "convenient" to have a "universal" solution that works for any type of file. But html and sql in particular are easily important enough to have a correct solution and it's not hard, it's not inherently slower and it's way more convenient in any real sense than puzzling about several rube-goldberg sanitization schemes, running some stupid security linter and paying $$$ to pen-testers for hunting down this crap.
I think this stuff is hilarious too, especially in the era of webpages that are really Javascript apps fed by a JSON API. It is not at all hard to have your JS app directly operate on DOM nodes without using markup at all. Create the DIVs and whatever else you want, create text nodes and shove them in there, done. Zero possibility of injection attacks because you told the browser "this is plain text, don't parse it" instead of making it guess.
I hand-rolled a library to do this back in like 2011. It took less than a day. The only downside was that you had to write the markup in JS code instead of templatized HTML, but it wasn't even that hard with a bit of syntactic sugar. It's fast too - creating DOM nodes directly is much easier than parsing HTML to create DOM nodes.
Parser support is non-existent in most if not all languages. Every language I know is able to parse regular languages at best. Parsing HTML and SQL and manipulating the resulting tree is not the first solution developers think of.
We should be able to look up some RFC, give the EBNF grammar to a library and get a parser out of it. In order to do that today, we need to use ancient parser generator tools. Why? A parse(grammar, input) -> tree function would be easier to use. The Earley algorithm can receive a grammar as input.
Well, I'm not some prefix-fanatic, but much of the problem would not exist in the first place if we had just used some sexp style syntax for HTML, it would be more pleasant to edit, and faster and much easier to parse for both humans and machines to boot. Another billion dollar mistake.
So I feel a bit ambivalent about attempts to lower the costs of pushing out more over complex grammars into the world. When have you last used in earnest a non-sexp/internal DSL for something like build systems that didn't engender in you an occasional urge to visit physical violence on its creators? But what I'd unambiguously like to see is easier parsing of "sane" languages and the death of perl-style regexps.
Still my guess would be: 95% of trouble comes from two: html (including js, css, svg etc. unfortunately) and sql. And most of the remaining 5% from bash :) So just dealing with those three would make a big dent.
Also things are not quite so bleak as you make them out to be: jsx is much saner and any established mainstream language has a conforming html5 parser these days (sadly, to do it properly you also want something that deals with the various other languages that get munged into html: css, javascript, gimped xml and here the situation is less good). SQL is thornier (and has many wildly different dialects), but unless you need dynamic queries, parameterized queries are available everywhere.
In fact a 1% effort/80% of the benefits approach is to not bother with parsing at all and to just use different types for e.g. html and text (interpolate html into html as string-interpolation as text (i.e. plain strings) into html as escaped string).
You can validate if all input characters are utf-8. But the moment you start to 'sanitize' non-utf-8 input into utf-8 you are in trouble. It's best to notify the user that the input validation failed and that you don't accept the input.
As for XSS I think browsers could have done more to fix this. For example add some tag like <unsafe> or <sandbox> for part of HTML that cannot have access to cookies and javascript on the page and disables any active components, like iframes and objects. Developers can use them to renderer rich content provided by users. Right now you can do this with iframes and CORS only, but that's too heavy to implement. These tags could have their own CORS limits for example.
Why I think that it's a browser problem. Security of the output needs to be reviewed with browser features added and only on browser level it can be up to date with all new features.
It does but only on whole document. Problem that by default it allows everything, people are lazy to find how it works and setup it right. Also you need to open every third party separatelly which could work bad for ads. If there will be a sandbox, you can allow only what you need for a particular part. I understand that this conception looks complex and more like iframe. Basically right now a lot of ads content rendered in iframes without src which are kind of sandboxes in this case.
You don't "sanitize" input, you "sane-itize" input. The whole point of checking the input is to make sure it's valid, not to try to scrape away cruft you don't think is valid.
Example: phone numbers. Is the input you accepted a phone number? Well, if you were just "sanitizing input", you might pass it through some generic "sanitizing function" that just checks if it had "malicious characters" or something and strip those out. But what you should actually be checking is is this a phone number? By making sure the input is what it's supposed to be, you not only gain a better security posture, but you improve your program's reliability by making it operate in the way you expect it to.
Some input fields like "give me a random block of text and I'll store it" are very hard to validate, so for those fields you can encode them as Base64 for storage, and at output time decide how to format them safely.
Also consider wrappers for any functions which pass input from a user. Perl's Taint Mode (https://en.wikipedia.org/wiki/Taint_checking) is a global way to enforce this, but for languages that don't have a Taint Mode, you'll have to implement it yourself.
Some people really go overboard with this sort of validation, however. I've had to argue with vendors who didn't believe email addresses can end in something other than ".com" or ".org".
1. Sanitize input when you actually need/want to do that but at least to a degree that it's save to put in a database (e.g. through prepared statements)
2. Always validate input
3. Always escape output (unless you have a reason not to).
No, it's terrible advice as it only causes unnecessary interoperability problems and vulnerabilities. There is no reason why anyone should need to generate invalid input to your program and it is never a better idea to make every consumer more complex to deal with broken input than to make one producer create non-broken output.
The only robustness to invalid input you should have is that you should not fall over when you encounter broken input, but simply reject it.
Both TCP which Postel wrote the spec and HTML follow this principle so it it seems have its merits.
You know what followed your principle? XHTML...and the arguments were the same, its not well formed just reject it, why would you ever accept broken input.
Sure it makes parsing faster and simpler and yet what actually works and is robust in the real world, HTML...
That HTML is a platform with an extraordinary security track record? Noone has ever exploited all the ambiguities that result from the incoherent mess that is the web?
Or is it that we never had any interoperability problems with HTML? All browsers always reliably rendered websites consistently? "This website is optimized for IE" never happened?
How isn't that just the best example to support my point?
As for TCP ... how is it relevant that Postel wrote the spec? Does that mean that the vulnerabilities in TCP never happened? Or are you saying that modern TCP implementations try to accept any crap whatsoever? (No, they don't, of course they don't, people have actually learned that that's a bad idea.)
People seem to prefer web sites that render inconsistently rather than not at all because of one little issue in the markup. It is more robust to render something rather than nothing and is one big reason XHTML was abandoned.
Yes a system that no one uses is more secure than one everybody does.
Postel's Law is literally in the TCP RFC [1], don't you think that makes it relevant?
> People seem to prefer web sites that render inconsistently rather than not at all because of one little issue in the markup.
Except those are not the alternatives. The alternatives are consistently rendered websites or inconsistently rendered websites. If browsers had strictly enforced HTML syntax from the beginning, noone would ever have built websites with "little issues in the markup".
IP stacks do not accept randomly misformatted IP packets. The result is obviously not that you constantly encounter internet services that you can not access because your IP stack is picky about broken IP packets, the result is that noone ever sends you broken IP packets.
> It is more robust to render something rather than nothing
No, it just isn't. You are just looking at a very small part of the consequences of this implementation strategy that indeed happens to be positive, but completely ignoring the big picture of all the externalities and other indirect damage that result from it.
> and is one big reason XHTML was abandoned.
Erm ... no? The reason why XHTML was abandoned was because people are incompetent at writing software, and there existed an alternative that allowed them to keep their idiotic practices, including all the vulnerabilities and interoperability problems that result from those, so that's what people did.
> Yes a system that no one uses is more secure than one everybody does.
How does that follow? And what does that have to do with anything?
> Postel's Law is literally in the TCP RFC [1], don't you think that makes it relevant?
>Except those are not the alternatives. The alternatives are consistently rendered websites or inconsistently rendered websites. If browsers had strictly enforced HTML syntax from the beginning, noone would ever have built websites with "little issues in the markup".
Thats not reality, if everyone got perfect formed input we wouldn't be having this debate, the reality is it occurs, so what do you do, reject it or accept it and try and do something with it. XHTML simply rejects malformed markup and you get a blank page, HTML tries to make sense of it and render something.
>IP stacks do not accept randomly misformatted IP packets. The result is obviously not that you constantly encounter internet services that you can not access because your IP stack is picky about broken IP packets, the result is that noone ever sends you broken IP packets.
So you never heard of ECN? The ECN bits being set where technically incorrect depending on how pedantic you where in the interpretation and some stacks rejected packets if the bits weren't set to zero. Due to he robustness principle most stacks ignored these bits allowing others to use them for ECN, allowing a graceful update to the spec. The stacks that took your stance however and rejected where simply roadblocks in the adoption.
>No, it just isn't. You are just looking at a very small part of the consequences of this implementation strategy that indeed happens to be positive, but completely ignoring the big picture of all the externalities and other indirect damage that result from it.
I'm not ignoring anything, I am just pointing out reality, the real world is messy and the stacks that try to keep working under messy conditions seem to be prevailing. Its not pretty and I don't deny the issues that arise, but here we are communicating on the largest most successful computer network ever built using a protocol and a markup language built with Postels law in mind.
>Erm ... no? The reason why XHTML was abandoned was because people are incompetent at writing software, and there existed an alternative that allowed them to keep their idiotic practices, including all the vulnerabilities and interoperability problems that result from those, so that's what people did.
I think most who know the history there would disagree with this opinion [1], it was obvious to me at the time why XHTML would fail even though I thought it a cleaner solution, I realized thats what was holding it back. It was much better to see your page come up with maybe a weird rendering artifact than just have the browser render nothing and throw an error if some small part was malformed.
>How does that follow? And what does that have to do with anything?
Because complaining about security vulnerabilities found in some of the most used software in the world while comparing to something that no one uses doesn't help your point.
>Relevant ... for what?
Uh gee I don't know maybe Postel's law is kinda relevant when discussing TCP because Postel wrote the spec you know like what you asked in the post before? What kind of game are you playing here?
> Thats not reality, if everyone got perfect formed input we wouldn't be having this debate,
Erm ... you do understand that, you know, there is feedback involved in this? That I am obviously not saying that noone would ever have typed broken HTML into a file if browsers had rejected broken HTML from the start?
I mean, it's even the norm for implementations of other computer languages to be rather strict about syntax, and it doesn't hinder their popularity with the same audience. The exact same people who produce garbage HTML do so using Perl or PHP or Ruby or ... whatever. And whatever you otherwise think about those languages, none of them will just make shit up when there is a syntax error in your program, they will simply reject it. And no, that does not mean that I am claiming that noone has ever made a syntactical mistake when writing code in those languages. But, you know, people are actually capable of fixing those mistakes when they are pointed out to them.
> So you never heard of ECN? The ECN bits being set where technically incorrect depending on how pedantic you where in the interpretation and some stacks rejected packets if the bits weren't set to zero. Due to he robustness principle most stacks ignored these bits allowing others to use them for ECN, allowing a graceful update to the spec. The stacks that took your stance however and rejected where simply roadblocks in the adoption.
Erm ... what? That's almost fractally wrong!?
None of the ECN problem was one of pedantry, it was simply one of a broken specification, namely the TCP specification. "Reserved for future use. Must be zero." is simply a bad specification. If you specify an extension mechanism, you have to always specify how the extension mechanism is supposed to work. What you call the pedantic interpretation is a perfectly valid interpretation of what the text says. You are just looking at it in hindsight, with the idea that it's supposed to support the operation of ECN, and then it's obviously a problem--but people who implemented TCP stuff before there was ECN could not possibly know that that is how people would expect to use this if the TCP specification doesn't specify that. There is nothing wrong with extension mechanisms that work by having the recipient discard messages with flags it doesn't know. That's just not what ECN chose to do, but that is kinda ECN's fault. You might just as well have ended up with a situation where someone would have tried to build an extension that assumes that recipients discard segments with unknown flags, and everyone would have been pointing fingers at those who chose to ignore the flags instead, and how they were pedantic to ignore the flags just because the specification does not explicitly say that such segments are invalid. It's just an accident of history that most implementations chose to ignore unknown flags, and therefore people now point to the exception, without any basis other than them being the majority.
Also, obviously, the "robustness principle" did not allow for a graceful update to the spec. The fact that a graceful update was not possible is the whole reason why you mentioned ECN at all. And that is not necessarily a result of failing to follow the robustness principle, as the robustness principle really doesn't tell you anything useful. All you can do with it is to point at things in hindsight and say "if everyone had built this the same way, then things would be compatible now!" But the robustness principle is useless for actually achieving that. For any format specification, there is an almost infinite number of ways you can deviate from the specification where humans could look at any individual one of those deviations and come to an agreement as to how that deviating message could reasonably be interpreted. And any one of those deviations could in principle be implemented as part of the corresponding parser. But implementing a parser that "correctly" interprets all of those possible deviations is at the very least a major undertaking, and usually even impossible due to contradictions between various deviations when they appear in combination.
And that is why hindsight is misleading: In hindsight, you only see one particular (small set of) deviation(s) causing interoperability problems, and it would almost always have been possible to make every parser coherently interpret those deviations just fine, and if everyone had done that, then you would not have any interoperability problems. But that isn't the perspective of someone who initially builds the implementation. They can only either strictly follow the spec (which works perfectly if everyone does so and the spec isn't broken) or they can increase complexity of and effort required for their implementation an order of magnitude or more to accept close to anything that could happen (which noone does for obvious reasons) or they can implement a random selection of deviations they like (which then leads to interoperability problems and the view in hindsight that everyone else could easily have done the same, which, of course, they couldn't, because they couldn't know what others were doing). Of course, there is a simple solution to that last approach: If you want to implement deviations from the agreed-upon spec but you don't want to run the risk of creating interoperability problems, you could get together with all the other implementers and talk about which deviations everyone is going to implement. But obviously, that's just the first approach in disguise: After you have agreed on the deviations, they aren't deviations anymore, you have simply created a new spec, and everyone then strictly follows that new spec.
Essentially, what is happening here is that you see one interpretation of something that the spec doesn't actually specify as obvious. And then you claim that the solution to interoperability problems is that everyone does the obvious thing. But you fail to recognize that the whole problem we are trying to solve with specifications in the first place is that what seems obvious is different for different people. Which is why this (a) can not work and (b) obviously in practice does not work. You can not solve the problem of people having different approaches to problems by simply saying "they should just all have the same approach" while at the same time saying that methods to create agreement (i.e., specifications) should not be taken too seriously.
> I'm not ignoring anything, I am just pointing out reality, the real world is messy and the stacks that try to keep working under messy conditions seem to be prevailing. Its not pretty and I don't deny the issues that arise, but here we are communicating on the largest most successful computer network ever built using a protocol and a markup language built with Postels law in mind.
Then your points are just irrelevant? I never said that broken systems can not be successful, did I? Yes, there clearly are evolutionary advantages to externalizing costs, and taking risks can pay off. But there are also other parties who have to pay those externalized costs, and taking risks can also end in a catastrophe. Externalizing costs is still an asshole move (and is generally frowned upon by society when people understand that that is what is happening) and whether the risks taken by the web, for example, have actually paid off is far from obvious.
Also, possibly all of this was built with Postel's law in mind. But what I would be interested in is whether that was to our benefit. Just because something was a factor in creating a certain overall positive situation does not mean that therefore that factor made that situation better than if it hadn't been there. In particular, evolutionary success does not mean that a different approach would not have produced a better result.
> I think most who know the history there would disagree with this opinion [1], it was obvious to me at the time why XHTML would fail even though I thought it a cleaner solution, I realized thats what was holding it back. It was much better to see your page come up with maybe a weird rendering artifact than just have the browser render nothing and throw an error if some small part was malformed.
How does that contradict what I said? Yes, it was obvious that XHTML would fail due to the massive incompetence of developers ... your point being?!
> Uh gee I don't know maybe Postel's law is kinda relevant when discussing TCP because Postel wrote the spec you know like what you asked in the post before? What kind of game are you playing here?
I am not sure what kind of game you are playing, but I had the impression like you were trying to make a point and not just state the historical fact that that's where Postel formulated the "robustness principle". Yeah, I agree, that's what he did. And it was a bad idea.
>And whatever you otherwise think about those languages, none of them will just make shit up when there is a syntax error in your program, they will simply reject it. And no, that does not mean that I am claiming that noone has ever made a syntactical mistake when writing code in those languages. But, you know, people are actually capable of fixing those mistakes when they are pointed out to them.
Except its pretty common now for programming languages add quality of life changes that loosen some of the strict parsing rules, such as trailing commas or optional semi colons. Same with whitespace, many languages don't pay much attention to it then you have a formatter that is strict about it (gofmt). This is Postels law in action, liberal acceptance strict output. The alternative is strict adherence to whitespace then no need for a formatter, just have the compiler reject it and put the burden on the programmer.
>Also, obviously, the "robustness principle" did not allow for a graceful update to the spec.
Again your opinion is not shared historically, ECN is held up as an example of the robustness principle having been followed in most stacks, with some problem ones that did not causing some issues [1].
>Then your points are just irrelevant?
Then your points are just irrelevant? We can play this game forever. Just the fact the you are using HTML and not XHTML and TCP under that to write these should make some relevant point that you can't seem to see.
>Yes, it was obvious that XHTML would fail due to the massive incompetence of developers ... your point being?!
Or more likely all these developers weren't incompetent including myself, just when given the choice the strictness of XHTML lost to the liberalness of HTML proving Postels law again. Messy and robust won over clean and fragile again, that's the point, get it?
>I am not sure what kind of game you are playing, but I had the impression like you were trying to make a point and not just state the historical fact that that's where Postel formulated the "robustness principle". Yeah, I agree, that's what he did. And it was a bad idea.
>As for TCP ... how is it relevant that Postel wrote the spec? Does that mean that the vulnerabilities in TCP never happened? Or are you saying that modern TCP implementations try to accept any crap whatsoever? (No, they don't, of course they don't, people have actually learned that that's a bad idea.)
Going back to you original question since your having hard time connecting the dots, Postel wrote the spec for TCP and put his law in the spec as guidance. ECN was developed taking advantage of that principle and most stacks accepted the malformed packets because of it. There are other examples of this [2], TCP is complicated if stacks didn't follow Postels law they would never get anything done on the internet.
> Except its pretty common now for programming languages add quality of life changes that loosen some of the strict parsing rules, such as trailing commas or optional semi colons. Same with whitespace, many languages don't pay much attention to it then you have a formatter that is strict about it (gofmt). This is Postels law in action, liberal acceptance strict output.
Erm ... no, it's obviously not? Or at least not in a way that is relevant to this discussion. I am obviously not objecting to specifying languages that give you a lot of freedom in how you format things, so what is the point of bringing up that you could interpret the robustness principle to mean just that? I am obviously objecting to accepting input that does not conform to the respective relevant specification, and the fact that making languages more flexible in their formatting is often useful has no relevance to that whatsoever.
You interpret some term to mean a broad range of things, I point out that one of those things is a bad idea, and your defense is that one of the other things is good ... how is that even an argument? How does that change that what I pointed out is a bad idea?
> The alternative is strict adherence to whitespace then no need for a formatter, just have the compiler reject it and put the burden on the programmer.
No, the alternative is strict adherence to the language specification. Or, really, it's not an alternative at all, because there is zero contradiction between specifying a language with flexible whitespace grammar (or separator grammar or whatever) and then strictly enforcing that grammar (and thus obviously avoiding interoperability problems).
> Again your opinion is not shared historically, ECN is held up as an example of the robustness principle having been followed in most stacks, with some problem ones that did not causing some issues [1].
In other words: You position is unfalsifiable? If there are no interoperability problems due to everyone interpreting messages identically, then that is obviously due to the robustness principle, and if there are interoperability problems because implementations deviate in how they interpret messages, then that is also obviously a success of the robustness principle? Is there any scenario where that robustness principle would not count as successful?
> Then your points are just irrelevant? We can play this game forever. Just the fact the you are using HTML and not XHTML and TCP under that to write these should make some relevant point that you can't seem to see.
How is the fact that I am using something in any way relevant to the question of whether an alternative would have avoided interoperability problems and vulnerabilities?
> Or more likely all these developers weren't incompetent including myself, just when given the choice the strictness of XHTML lost to the liberalness of HTML proving Postels law again. Messy and robust won over clean and fragile again, that's the point, get it?
How is it relevant that HTML won? How do you connect from "technology X won over technology Y" to "therefore, technology Y would not have had fewer interoperability problems and vulnerabilities than technology X"?
Why do you answer every question as to technical properties of a technology with "it lost" or "it won" while completely failing to say anything at all about the technical property being discussed?
NOONE DENIES THAT HTML WON OVER XHTML.
Also, it seems you almost completely ignored the central explanation of my previous post, simply to repeat your previous points as if I never had said anything. I am happy to read your explanation as to where my analysis is wrong, but I am completely uninterested in reading over and over points that I repeatedly explained why I don't agree with them with no insight at all into how my reasoning is wrong.
It's good wisdom but beware the counter-argument. Being liberal in what you accept causes your implementation to have to maintain a wide range of implicitly defined odd inputs in perpetuity and leads to a form of lock-in based on handling of edge-cases. HTML and browser implementations that allowed all sorts of garbage behaviors are a good example of this. The robustness principle works, but it does come with a cost, like all things.
Problem with this kind of description is that input and output are often two names for the same thing. One processing element's output is the next one's input.
Basically, don't substitute HTML cruft onto text, except at that stage in processing when that text is just about to be inserted into HTML.
Don't do HTML-ization prematurely.
You wouldn't encode and store data in Base64 just in case it might be needed that way in some future processing step.
A good point, but it is obscured by a terrible choice of vocabulary: in this article "sanitizing input" actually means specifically trying to sanitize input of the whole system as soon as it is received (quite impossible, given the open-ended and conflicting nature of sanitization needs) and "escaping output" actually means sanitizing the input of a specific subsystem for a specific purpose.
Given a faithfully persisted and assumed unsafe original text, SQL query builders can turn everything into SQL strings or die trying, XML parsers can check entity expansion size and other traps, HTML generation templates can introduce fancy markup to surround arbitrary text, XML generation templates can escape input wholesale in CDATA sections, and so on. It's the traditional principle of separation of concerns.
I think the title here is misleading. The title is, "Don’t try to sanitize input. Escape output."
The article itself is only talking about sanitizing user input to "prevent cross-site scripting attacks". He later on does require input checking: "Input sanitization is usually a bad idea, but input validation is a good thing... by all means validate it and return an error if it’s invalid."
It's vital for secure programs to check their inputs and minimize what they will accept. I do think it's a good idea to reject "&" and "<" when you can.
But I also agree that in most cases you can't completely forbid all inputs that have HTML metacharacters. In the case of cross-site scripting, the best countermeasure is output escaping. Many modern frameworks do output escaping by default; Rails (for example) has done it for years (in Rails, any "normal" string is automatically escaped when sent back out as HTML). A good reason to prefer one framework over another is because it has secure defaults; if your framework doesn't escape by default, you should consider using a better framework.
You can't depend on just one thing to suddenly make your software secure. You need to validate inputs so that only valid inputs are accepted into your program. You need to escape output, because there are often legitimate input characters that must be escaped. You need to prefer tools that have safe defaults. Use tools to scan your results, to find what you missed. It's not rocket science, but it does require a set of approaches; there is no silver bullet for making secure software.
Escaping output does not always work when, e.g. you have thousand of integrated systems and don't control any of them nor their upgrades.
If you don't filter malicious inputs, they will forever live in your database and it takes one bad release of some reporting tool somewhere for your users to become vulnerable.
If you rely on sanitizing before storing, you can end up with data in your database that somehow missed being sanitized, or is maliciously entered in your database.
You must escape the outputs, no matter how hard you try to sanitize the inputs. Losing "control" of any integrated systems means your system is vulnerable, even if only to someone at a terminal typing things into the DB manually.
Of course you must escape your outputs, get in-depth layers you don't even need now, track how data flows in your application, patch every known vulnerability, sandbox every single thing, implement capability - based security, and many more. I didn't mean otherwise.
Filtering input is not sufficient. But it is not optional.
Yes, it does. The only exception is if there is no way to represent a given value in the language that you speak to another component, but then you either need to reject the request or sanitize on output (if you can be reasonably sure that doing so won't break the semantics of the information that you are passing on).
I've worked on systems so shitty that some input was triple-escaped so that it would trickle down three unescape layers in old buggy clients deployed to the field.
The thing is, there's no wisdom to take away from this monstrosity. It's just old shitty legacy ad-hoc code. The grandparent post thinks there is.
That you might need to break every rule in the book to integrate shit-tier software doesn't need to be said when talking about software best practices.
Escaping input is of course path to nowhere, because you never know what kind of context your data will be displayed in. So, you cannot guess proper escaping rules.
Escaping data on input is novice mistake. Not having unfiltered data, reject data not conforming to some rules, so you know exactly what to expect getting you somewhere.
You can bake checks and constraints into your database model too if it is needed.
> Incidentally, the mother in the xkcd comic says, “I hope you’ve learned to sanitize your database inputs.” Which is somewhat confusing, but I’ll give Randall the benefit of the doubt and assume he meant “escape your database parameters”.
What a strangely roundabout way of saying that programming advice found in a webcomic may be actually wrong.
(To be fair, that Xkcd comic was a product of its time, when ‘sanitisation’ was all the rage.)
Maybe Sanitizing is the wrong word for what I'm doing. For example, I need to strip marketing/tracking information from URLs before saving them or else someone coming from Google will have a different URL than someone coming from FB and then the comments won't load.
Always use prepared statements to set query parameters. This will handle 99.9% of all query use-cases. Constructing dynamic SQL from user input is a fool's game.
Don't filter input. Instead, prevent certain characters from being input in text elements. This is a user experience problem, not a software problem. The software can validate that a "name" is rejected if it does not follow the front end validations, but it doesn't need to do any more than that.
Of course, this argument does not extend beyond a "name" field to more complex fields. But more complex fields are less susceptible to introducing UX problems if certain characters are sanitized.
The article misstates what 'sanitizing inputs' means.
I agree with posters who recommend passing data as parameters to methods that don't require sanitized input (e.g. stored procedures or KeyValue APIs).
Also, sanitizing input means transforming input so you retain the original content, but without escape or control characters. Sanitizing input does not mean throwing part of the input away (except when you know it is meaningless in your context, e.g. spaces at the end of a name).
If you take arguments to some sub-system (an example are database keys like the id of an entity instance), then you need to sanitize input.
Anyway, today I learnt something. If you have free-form data like text it makes sense not to sanitize it because in this case sanitizing depends on the output domain. For example < is dangerous for HTML and ' is dangerous for SQL, and so on.
A method I have generally found useful is to make a whitelist of safe characters (something like alphanumeric, comma, dot and space), and escape everything else. You might escape a bunch of stuff that technically didn't need escaping, but the method is simple, rock solid, and doesn't mangle anyone's names.
My name contains Ø, and im guessing i would not be able to enter that with your method. I would consider that mangling my name if i had to write o or oe.
This kind of thinking is how your users end up getting emails from your buggy service like "Hello Østein & friends, ..." and your JSON API consumers encounter the same silly output.
Don't escape input. Escape based on output. Escaping doesn't mean anything until you've also specified an output format. It's not always HTML.
You are grossly misrepresenting my post, I have said nothing about whether the escaping should be applied to input or output, please edit or delete your comment.
These two issues have been relevant for over 20 years, older than today's college grads. I find it fascinating the new blog posts are being written on a regular basis explaining these two pitfalls. And there are probably a million blogs with these same two examples going back decades. This is an epic re-post in spirit.
I'm tempted to ask, "Why hasn't this been fixed yet?" Where "fixed" means, "Something new programmers just starting off their careers don't have to jam into their brains?"
I know the expected answer will be: it's an abstraction of a more complex problem of understanding data and how it is used, and we can talk about how JS and PHP have added native functions to construct custom code to address this problem (hack cough cough hack).
But these two cases in particular stick in my craw because there is no fundamental solution presenting yet.
So I ask again (and I've been asking this since 2004-ish):
Why do these two examples still persist? Why do the frameworks not eliminate them by construction? This is such a repeated pattern, why is it even there?
It boils down to two things. One is library/tool design typically makes it too easy to make user input be in-band with execution. The other is that most tutorials/guides only show you how to do things in-band.
An example of a tutorial for SQL:
SELECT first_name FROM users WHERE last_name = ‘Smith’
They then have an exercise to hook this query to a text box in the program, where through omission, the programmer is guided to use string concatenation to build the query.
If from the first SQL statement to the last that a programmer saw was parameterized, it would be much harder for them to reach for string concatenation.
Most modern web development frameworks make it very hard to insert un-escaped text into the DOM. You have to go out of your way to introduce a XSS vulnerability in your web application with one, and most of the tutorials and documentation about the framework warn you about using the raw HTML functionality.
Another way to look at it is that the out-of-band way of doing things is typically perceived as either lower in performance, harder to do and/or less elegant (eg: C-style strings vs pascal strings).
I consider anything with user input that is done in-band (eg: escaping is a fix) to be doomed to fail. This is similar in idea to the cryptographic doom principle where decryption before authenticating the message is ultimately doomed to failure.
> If from the first SQL statement to the last that a programmer saw was parameterized, it would be much harder for them to reach for string concatenation.
I dunno -- I've been doing C programming for 30-ish years now, but just learned SQL about a year ago. Every man page I looked at, as well as every stackoverflow question, emphasized the importance of using parametrized queries. And IIRC in Python, "only execute a single statement" is enabled by default; if you want to execute multiple statements, you have to use a different call. So even if you somehow manage to forget to parameterize your queries, you'll still be safe from Little Bobby Tables.
Do SQL injection attacks still actually happen? How is it possible?
Making sure your query only executes a single statement is not enough to prevent sql injection (depending on how you concat the query) - you just have to use the provided context to get/set data you need to escalate your permissions (eg if its SELECT stuff FROM table, you might be able to inject it such that your query replaces stuff and then you can select whatever the querying user has access to.)
2019:
For its "State of the Internet" report, Akamai analyzed data gathered from users of its Web application firewall technology between November 2017 and March 2019. The exercise shows that SQL injection (SQLi) now represents nearly two-thirds (65.1%) of all Web application attacks. That's up sharply from the 44% of Web application layer attacks that SQLi represented just two years ago.
> The exercise shows that SQL injection (SQLi) now represents nearly two-thirds (65.1%) of all Web application attacks.
Are you sure that you don't actually mean "weird strings sent to web applications" when you write "web application attacks"? Sending a weird string to web applications is not an attack in any particularly relevant sense unless there is an actual vulnerability--other than when akamai wants to sell you snake-oil against all the danger of weird strings.
A lot of websites and applications are built for a purpose they get forgotten, I'm speaking especially people doing it in their free time for fun, or small company projects.
> The other is that most tutorials/guides only show you how to do things in-band.
Learning SQL is completely different from learning how to use SQL in whatever programming language/framework/library you may end up using. Learning how to safely interface with an RDBMS is going to be entirely specific to the stack you’re using for the rest of your application.
> Why do the frameworks not eliminate them by construction?
They do, other than the pure PHP example that simply predates modern approaches to web security and is somewhat of an intentional misuse, modern templating engines (that the article also mentions) default to escaped output. That still means that new devs have to be aware of the mechanism and not go out of their way to shoot themselves in the foot by not using the appropriate mechanisms, which I guess explains the blog posts. I honestly don't think it's that bad, that's just part of any generation of developers learning the basics of their domain. To me XSS still being in the OWASP top 10 was always more of an indication that we suck at training (for their basic stack and security minded development) rather than some conceptual failure of the frameworks we use.
There's plenty of "fixes by construction" out there, that doesn't eliminate new devs not using them or experienced folk making an error every once in a while.
That makes me wonder what kind of training companies require. How many companies hire based on DIY examples in interviews, and think "ok, this new hire knows enough", rather than run the risk of essentially re-training 90% of what they already know, despite that 10% being critical knowledge?
I don't have a sense of what dev training looks like across the industry.
Company I'm at now requires basic security training every year. TBH it kinda sucks at showing solutions to this kind of problems, but at least it makes people aware of the risks. I think that might be a PCI compliance thing, but I'm not sure.
I feel like that's a hard problem that exceeds hiring these days, would love an answer to that as well. My personal approach for junior positions has mostly been to hire rather selectively (when I can) to get people that at least recognize when they might lack knowledge in a certain area, team them up during the onboarding period and somewhat strict code review policies at least in the beginning.
Avoiding stipulating this training to all new hires is a symptom of me having an aversion to most classroom settings though, I've had quite a few developers that enjoyed getting this style of training after they indicated they wanted it down the road. I personally wouldn't have enjoyed the 90% retraining scenario (monetary loss that implies aside). I've found training on specific aspects with a bit of practical engagement to be more effective, e.g. there are great and engaging courses to transport basic web security. Not that these are always up to date or trainees retain everything but it gets them into the right mindset to be aware of issues.
But of course even with an approach that works 100% of the time, these days that doesn't guarantee that none of your dependencies or outsourced code production is up to the same standard.
tl;dr is "I don't know either" I guess but maybe you can take something away from it.
> (hack cough cough hack)... there is no fundamental solution presenting yet.
Because people think those hacks are fundamental solutions (see: this blog title).
But really, the fundamental solution is finally at long last treating programming as a form of engineering.
> I know the expected answer will be: it's an abstraction of a more complex problem of understanding data and how it is used... Why do the frameworks not eliminate them by construction?
Because in any non-trivial system there are always edge cases, and attackers will find the edge cases. This is why XSS persists even as template engines have taken over. "filter output" is not a panacea. Nothing can replace carefully thinking about the entire range of possible inputs and their related outputs.
But instead of educating programmers to think carefully about how to specify and design robust systems, the software industry repeats gang-of-four-style mantras like "escape output". Even while admitting those solutions don't work universally and offering "get security review" as some sort of universal fix.
It's interesting that single page apps actullay have a benefit here. If you generate DOM with code, you can just assign anything you like to el.textContent and you'll not need to muck around with sanitization libraries and edge cases.
Basically the same principle like using parametrized SQL queries.
Part of the problem is the use of a general "String" data types in many languages. Libraries that deal with SQL or HTML or anything similar shouldn't use String in their APIs. Instead they ought to have more specific "EscapedString" and "UnescapedString" types so that there's no ambiguity about which is which.
While I agree about String. EscapedString is conflating rendering/output with data model which is the core of the issue. Application developers should not care to touch escaping, nor text rendering on their right own for formats like xml, json, sql, etc.
There should not be xml build by concatenation, instead use DOM + proper render/transformer/write. Same for SQL, prepared statements + bindings...
Our programming languages suck at providing useful types.
"String" is a structural data type. "SQL query" and "HTML snippet" and "regular expression" and "user-entered text" are semantic types which can be stored in strings, but are all quite distinct in meaning and usage.
You shouldn't be allowed (by the language's type system) to pass user-entered text to a SQL query function, without perhaps first calling a function with a scary name like "convert_raw_unsafe_text_to_query". A string is not a string is not a string. Or make a DOM-for-SQL so we never have to touch syntactic strings.
(It's exactly the same problem as units of measure. 5.0 feet is not the same type as 5.0 meters, and you shouldn't be able to add 5.0 + 5.0 if you didn't declare they have matching units, or define a way to convert as necessary. Numeric types in most languages don't have associated units, either, unfortunately.)
Hungarian notation tried to partially solve this, by giving up on the built-in type system and using variable names to encode intent. That solution is ugly so it's been abandoned, and it's the wrong place to solve it, anyway.
Programming languages today don't provide appropriate abstract data types for strings, or make it easy to define your own. Popular libraries for SQL/HTML/regex/etc don't require special string types. Since there's no standard types, it'd be a pain for any users who need to use more than one library, too.
We need either one popular language to do this (which others might then copy), or two popular libraries (a coalition). It also needs a catchy name for this style of programming, to help shame old languages/libraries that don't support it.
React does this (at least the escaping of output part). Bunch of PHP frameworks do it too.
But I think it's just a natural part of the power of computers, except most people don't think in Lisp ("any" text could be turned into code), they think in Java (I have this static, rigid, compiled code, and that's the only thing that runs).
Why do these two examples still persist? Why do the frameworks not eliminate them by construction? This is such a repeated pattern, why is it even there?
If I can type a query into a SQL prompt but your framework won't let me put it in there, I am first going to conclude that your framework is broken. No matter how good the reasoning is for why you did it.
Worse yet, it sometimes is broken. Smart databases understand that whether they should use a particular index depends on the value that is passed in. Where using the index for the most common value is a huge performance penalty, and failing to use it for the rest is likewise. The only way to get good performance is to pass in the value for the case where it has to not use the index. (You can parametrize the rest, at least in Oracle. But the special one has to be passed hard-coded in the string so that the optimizer sees it.) This is a rare case but when it comes up, I really care.
If your framework won't let me fix a performance problem that I know how to fix, I'm going to switch frameworks.
And even worse, parsing SQL is more complex than you think. If your super safe framework doesn't agree with the database it can reject valid SQL or fail to provide the safety that it thought it did.
As an example, in PostgreSQL I can use $$ as a quote mark. This is super convenient for stored procedures. If your super-safe framework doesn't let me do that because it thinks it is a syntax error or recognizes it as unsafe, I will switch to something that can let me write stored procedures. If your super-safe framework doesn't recognize that it is a quote mark, then it isn't offering protection. If your super-safe framework tries to analyze it correctly, you're now attempting to analyze run-time strings that I am building inside of the database in a Turing complete language. Good luck with that. (Hint, Turing proved that it is an impossible task.)
Now I'm admittedly in the 0.1% of people using these tools. However others trust me to know what tools to recommend. So experts like me have an outsized impact.
Just recognize that you're holding SQL to an unfair standard here. You wouldn't reject an HTTP framework because you can't paste raw HTTP into it. You wouldn't reject an IMAP framework because you can't paste raw IMAP to it.
You're requiring that the maintenance hatch be the front door. It should be no surprise that such a design results in lots of people accidentally breaking things.
As you say, fewer than 1 in 1000 people have your needs. Why would you recommend a tool whose features are more dangerous than useful for them?
I am holding SQL to the same standard that I would hold, say, a web framework. Simple things should be simple.
Injecting stuff into a dynamic protocol is inherently harder than injecting text into a text document. A text framework that doesn't accept text is going to be a fail.
As you say, fewer than 1 in 1000 people have your needs.
My needs 99% of the time are not that unusual. What puts me in the 0.1% among general developers is the level of knowledge that I have about weird edge cases and how databases work on the inside.
Why would you recommend a tool whose features are more dangerous than useful for them?
Your question presumes the answer to a question that I think you are wrong on.
My very first point was that if I can type it into a SQL prompt, I need to be able to put it into my database.
For someone who is just learning, this convenience is essential. And any tool that complicates their life by forcing them to learn a bunch of stuff before they can do the very simplest thing is a barrier to learning. A barrier that they are likely to solve by finding a tool that makes the simple thing simple. They will only learn about the gotchas down the road.
Case in point. Back in the mid-90s someone wrote a bunch of CGI scripts to make personal home pages easy to write. In fact that is what it was called. Personal Home Page / Forms Interpreter. It accidentally turned into a language that, after several rewrites, is now known as PHP.
When I first encountered it in the early 2000s, every competent developer that I knew (myself included) said, "This is poorly designed crap that will cause a lot of problems." We were right. However it was poorly designed CONVENIENT crap. Convenience won.
Take input and transform it into output is the fundamentals of programming.
New developers often struggle with fundamentals and will usually only test the input they expect.
Some one else has to intentionally give you bad input before you realise thats a thing people will do and something you need to think about.
It doesn't help that most tutorials focus on getting output (yay results!) rather than focusing on how to get consistent transformation of input to output. The result is a lot of tutorials that focus on getting something done and forget / assume the fundamentals.
I haven’t used Elm, but I’ve read that it uses strong typing to distinguish between escaped and non-escaped data. That sounds like a good general solution to the problem, as the compiler will prevent you from using unescaped data in a dangerous context.
Any sort of language that allows you to define custom types (e.g., objects) and type-hint parameters allows you to do this. You can accomplish this same thing in PHP even (the type checking is at runtime, but same idea).
Types are not restricted to just a description of how the data is represented in the computer, otherwise we would need nothing but primitives.
When you perform calculations with physical measurements containing units you don't simply throw away all of the type information while performing calculations -- you perform the same operations on the units both as part of the answer and as an essential check that you've done the right thing. You should do the same thing with your data.
Even the Joel article makes what's arguably a mistake: he says that input from users is "unsafe" and must be escaped on output, while strings from elsewhere shouldn't. That may avoid security exploits, but it still results in incorrect output when a predefined value really does need to be escaped.
The issue isn't whether a value originated from the user. It's the units/data type, as you said, such as plain text vs. HTML.
I've heard that some Haskell frameworks do this as well.
I am heavily using Java's Servlet framework and the blatant spraying of Strings everywhere is astounding in this age. I understand that backwards compatibility is an issue, but one could have set another API beside for optional use and deprecate the current one.
I imagine all Haskell frameworks do; the ones I've tried surely do. Haskellers are accustomed to mixing string types, since the default String type is inefficient (a linked list of Char) and most import a library to provide immutable ~Pascal strings. And since this is such a common occurrence, the syntax has support form multiple string literal types. An application developer can literally create their own. It's also trivial to create a new type in Haskell that has minimal runtime cost, so it's pretty harmless.
I think if you don't have an easy way of creating string literals in the type you want, the developers will at some point reach for the deprecated api, and at that point you're just requiring good hygiene. Which is exactly what you're trying to stop. Language support is critical in being able to get away with this.
>the developers will at some point reach for the deprecated api
Agreed, however you can have organizational measures to prevent this (a build time check). And of course a change in the framework must be accompanied by decent conversion libraries (I don't think this is different in Haskell).
Have they been relevant? I haven't heard the ‘sanitise your inputs’ advice in years. In fact, the take in this blog post seems to be the predominant one.
Well, maybe it's still common in PHP, I don't know. I haven't touched that either in a while.
Problem with escaping that you need to know what you escape for, escaping is needed for almost everywhere (SQL, URL encoding, HTML, JSON, YAML) and double escaping could break the content.
Only escaping output has one significant disadvantage. Say that you are escaping &. You'll get &. Your user then wants to edit the text. You save the edited text. Now when you escape and output it again, you get &amp;. Rinse and repeat.
Wow, I feel really stupid now. That means I have misunderstood how does it work and have evaded it needlessly till now (not that I do much web app development).
Bad advice. Input is the boundary of your system. Always protect the boundary of your system orders of magnitude more than it’s internals. That’s like programming basics.
And what about buffer overflows makes particular data inherently(!) dangerous? As in, what would be an example of data where it is impossible to make it non-dangerous by fixing the buffer overflow that it exploits?
not interested in debating semantics. data is turned from input streams of bytes into coherent structures in your system. the process of turning data into coherent structures is input sanitization. sanitization or validation or conformation - you can argue how to call it and where to use it, but if you don't - your internal bugs will be exploitable externally, good luck.
> the process of turning data into coherent structures is input sanitization.
No, that is parsing.
> sanitization or validation or conformation - you can argue how to call it and where to use it
Sanitization is "change to make acceptable", validation is "check for conformance and reject what is non-conforming". The first one is practiced by tons of developers and it's a terrible idea, the latter is what you should do. But also, the latter is never about "dangerous" input, it's only about meaningless input.
A database query is an output. Anything that your code generates and sends to something else, whether that a web server to answer a user request or an API call to connect with an external service, or an internal request on the server, it's all outputs from your code.
No matter how good your input santization is, you still wouldn't ever send an unescaped query to a database, right? That's because the query is an output.
...so the blog post boils down to "sanitize all inputs that don't get piped to /dev/null; also, there are some good libraries that will do that for you (...by escaping outputs... but oh btw those only work sometimes of course, and in other cases, be careful?).
In other words, for the love of god please do sanitize your inputs.
"Sanitize inputs" means modifying the input before you even know where it's going. It's fine for stuff like normalizing user input (eg: "strip leading and trailing spaces") but should not be used to combat things like SQL injection or XSS.
For issues like SQL injection and XSS you should escape on output. Outputting HTML? HTML escape, or better yet: use templating framework that does it by default. Outputting to SQL? SQL escape, or better yet use prepared statements and pass in your arguments using an API that escapes by default.
In the "sanitize inputs" approach to handling these situations you can't store "O'Hara <3 Sue" as a value, because you need to "sanitize" the apostrophe for SQL and the less-than for HTML. In the "escape outputs" approach, you have "O'"Hara <3 Sue" in your SQL, and "O'Hara <3 Sue" in your HTML, and the user's input is preserved.
> "Sanitize inputs" means modifying the input before you even know where it's going.
Okay.
That's not how I've ever used that term or seen it used. Prepared statements are a form of input sanitation. HTML purifiers are a form of input sanitation. Maybe this lingo is specific to PHP-land?
In any case, "You need to know the semantics of the sink in order to know what to do with an untrusted source" seems like an obvious truism not worth writing about.
"You need to know the semantics of the sink in order to know what to do with an untrusted source" seems like an obvious truism not worth writing about.
Given how often developers get it wrong, I don't think it's written about enough.
Also, you say "untrusted source" here. Whether you trust the source or not is irrelevant. You should still be escaping the output where you use data from it in order to make sure your outputs are safe - the source could be compromised, or broken, or sending something valid that you didn't expect. Maybe this isn't quite so obvious after all.
You've probably not been around in the jolly days of PHP automatically adding quotes to all $_GET parameters and stuff like that, before it was even known where the data will be passed to, lol. Be glad.
> That's not how I've ever used that term or seen it used.
That's the terminology being used by the document under discussion.
Honestly, I think what causes a lot of people to get it wrong, is that they don't understand the distinction between input filtering and output escaping. They see them as the same thing, and so they use them interchangeably.
> Prepared statements are a form of input sanitation.
No. Input sanitization involves removing "bad" stuff from the input. For example, you remove the "'" in "O'Hara" so that it doesn't mess up your SQL, but you end up storing "OHara" in the DB.
Output escaping (which prepared statements fall under) removes nothing. Instead, characters that happen to be special are escaped so that they are treated as literal characters, and not as special characters. The DB gets the user's original input: "O'Hara"
> HTML purifiers are a form of input sanitation.
I assume you mean HTML sanitization (https://en.wikipedia.org/wiki/HTML_sanitization). In which case, usually, yes. Note that there's a difference here because you're removing part of the input, not doing a lossless transformation as with escaping.
Another way to think about the difference is whether you're doing type conversion or not. When escaping for SQL, you're converting from text/plain to SQL. When escaping for embedding in HTML, you're converting from text/plain to text/html.
When you do input sanitization instead, you aren't changing the type, you're just making certain values impossible. For HTML sanitization, this means turning stuff like "<em>safe</em> <script>unsafe()</script>" into "<em>safe</em> ". Both are texp/html, but the latter has been "sanitzed".
In this case, input sanitization makes sense, as long as you have a universal concept of what "safe" means, ans as long as your input was actually HTML.
The place where people mess up is in thinking that they need to "sanitize their inputs" in anticipation of something downstream using that same string as a different type. In the HTML exaple, this would be taking a text/plain string, like "I <3 HTML" and stripping out "bad" characters to turn it into "I 3 HTML".
> Maybe this lingo is specific to PHP-land?
I've never used PHP, so I wouldn't know.
> In any case, "You need to know the semantics of the sink in order to know what to do with an untrusted source" seems like an obvious truism not worth writing about.
In practice, that doesn't seem to be the case. Almost every time someone says "sanitize your inputs" in response to an XSS or SQL injection exploit, they're getting it wrong.
What does "sanitize inputs" even mean? What do you do with a backslash? What do you do with a "? What do you do with weird unicode? What properties does your "sanitized" input actually have?
The meaning of "sane" depends on where you're sending it to. A backslash is a perfectly reasonable character, for instance. Put it in the wrong place in a SQL string and you have bad news. Put a ' in the wrong place in a shell command, sometimes nothing bad happens, other times you get pwned.
The right way to escape strange characters is different if you're sending it to an SQL engine, or writing it into a JSON string, or into some HTML, etc.
> Another example of this kind of thing is SQL injection, an attack that’s closely related to cross-site scripting. NaiveSite is powered by MySQL, and it finds users like so:
>
> $query = "SELECT FROM users WHERE name = '{$name}'"*
>
> When a boy named Robert'); DROP TABLE users; comes along, NaiveSite’s entire user database is deleted. Oops!
Also from the article:
> And of course use your SQL engine’s parameterized query features so it properly escapes variables when building SQL:
>
> $stmt = $db->prepare('SELECT FROM users WHERE name = ?');
> $stmt->bind_param('s', $name);
And more from the article:
> The parallel for SQL injection might be if you’re building a data charting tool that allows users to enter arbitrary SQL queries. You might want to allow them to enter SELECT queries but not data-modification queries. In these cases you’re best off using a proper SQL parser (like this one) to ensure it’s a well-formed SELECT query – but doing this correctly is not trivial, so be sure to get security review.
> What you should do, to avoid problems, is quite simple: whenever you embed a string within foreign code, you must escape it, according to the rules of that language. For example, if you embed a string in some SQL targeting MySQL, you must escape the string with MySQL's function for this purpose (mysqli_real_escape_string). (Or, in case of databases, using prepared statements are a better approach, when possible.)
What is the exact (philosophical?) distinction between data and code? When you extract the host name from email to be able to send a email, you are interpreting a string. We could call this process executing a program that outputs the host name.
I suspect the opposite is true; the parent comment you're replying to is making subtle and insightful commentary on the nature of code and the futility of suggesting that inputs ought not be "code."
Any sufficiently complex program can be viewed as an interpreter for its inputs. Input into a calculator program is code which programs an equation. Input into a word processor is code which programs a document. Input into a video game is code which programs a real time simulation. Input into a compiler is code which programs an executable. These are all different types of executable code sequences.
I am a theoretical computer scientist. I can appreciate the insightfulness (on the surface) of that commentary.
However, the fact that modern computers can be exploited due to architectural and engineering decisions (eg memory unsafety) does not mean a separation between code and data is not possible.
In fact, it is precisely a hot topic how to cheaply bend current practices back to that model given the rampant amount of vulnerabilities in the wild.
My comment was not limited to the realm of hardware, ISAs and microcode. It was much more general.
If you never treat data as code, you can only do uninteresting things. My example was the email address. The instance you look into the "black box" of the string, you are starting to treat the string as an executable structure. An email address' raison d'etre is provide that; an address to send an email to. You can not do that without looking into it.
Now, from here we can discuss safe and unsafe ways of doing that. You could use string splits or what not, or you could use a parser combinator library. Doing the latter will make it easy to see that parsing and executing a program is not that different from parsing an email into an AST, (user, hostname), and then treating that as a higher order program (ie. we need to specialize with a message before we can execute it as a "send email" program).
No, he made a good point. It's a matter of perspective, what is considered data in one situation, can be consider code in other situations. Hell, they even created the NX-bit to prevent malicious memory space becoming executable.
It depends on context. The question is like asking "what's the difference between 'x + x' and '5'"; the former is 'code', an algorithm with placeholders for data; the latter is data.
In another context, though, the function 'f(x) = x + x' might itself be data. Within a given context, however, the answer is generally unambiguous: data is what is acted upon by code.
(The breaking of this distinction is one of the reasons self modifying code is Bad.)
This reminds me of the trick of repropgramming a computer game by exploiting side effect bugs on controller sequences. There is a youtube video somewhere where somebody writes an entirely new game simply using input sequences on the controller of an existing game. There were bugs that wrote values to various memory locations and by being exceptionally clever you could write assembly code. I wish I could find a link to it...
There's lots of examples, but a famous one is this video by Sethbling where he uses a controller as opposed to a TAS tool: https://youtu.be/hB6eY73sLV0
No, the fact that current archs encode data and code in the same memory space and there are vulnerabilities does not mean the separation is not possible.
I think he is wrong on handling code as input and visible output (for sites like StackOverflow). No need to filter such input. Escaping your strings will handle that as well. The code <tag/> , for example, will be escaped to <tag/> , appearing in the rendered page as <tag/> (but not _interpreted_ as <tag/> ).
Actually, rather than sanitize input, I would recommend whitelist and reject in most cases. ID should be an integer, but you get a string with spaces around an integer? - error out. There's html in your text input? Error out.
In the vast majority of cases you control the client (web form) - anything "surprising" will then be an error - or worse; malicious.
In the case of a json service, if the client doesn't submit valid json for your schema / api - error out.
I once registered for a forum to ask a question, but they had it configured so new users couldn't submit URLs, probably to deal with spam. Their solution was to reject posts that contained what looked like URLs which means any time you don't put a space after a period, it's probably a URL. Like "That is fine.Pizza is good." -> http://fine.pizza detected, post rejected
But it gets worse. I had a code block in my post and it was also detecting URLs in my code.
Good point. I guess I'm in agreement with tfa here - if special entities are ok, escape output. I don't think I want to allow a greater than sign in a "first name" field, but it might be perfectly valid in a text field. And if that field is text, not hypertext, escaping output seems sane. Maybe only where html is expected - say in the html part of an email, while the text part just uses a proper encoding.
TFA touches on this with validation - validation is ux improvement - help the user submit only valid data. At that point there should be no proper use-case for invalid data to be submitted - hence reject, not sanitize and insert.
It seems like this depends on your types? If you are storing user input to a database using a text field, it's best to assume the field can contain arbitrary text, since the database allows that. But if you're storing the input into a number field then it must be parsed as a number and you can assume it's a number. If you constrained the number field to a certain range then you can assume the number is in that range.
Storing arbitrary text is common, so we usually need to know how to render it correctly. There are fancier types and constraints, though.
Each piece of code should enforce its own contracts explicitly across communication boundaries. If your backend relies on x input being parseable as a number, it shouldn't assume that it is just because you know that it ultimately came from a number picker in an HTML form - it should check for itself. This is defensive not just for security, but so that when you change something later and screw it up you get actual error messages instead of silent breakage.
You should still sanitize the output, IMO. The code in question has no ties to the DB schema (usually) and if that changes for some reason you have no means (usually) to figure out where you have to change the code.
What is horrible advice is to tell people to never sanitise input, but then forget to switch the focus to what should be done instead. There's too much time spend justifying the headline vs. explaining what should be done instead and why this is more effective.
Instead prevent "injections" by using innerText instead of innerHTML and parameterize SQL queries instead of concatenating strings.
But you always want to sanitize user input! Ever wondered how the average age of your user base was so high, only to discover that some users claim they are several million years old. You don't want to sanitize, you want to sanitYize. People writing their e-mail as street address and vice versa.
The examples are very outdated. In PHP you would convert the offending characters to html entities >at input, yes..input<. It then only needs filtered >one time<.
So the SMS message I send from my service would look like this?
Alert: Bob & Jane O'Brien just commented on your recent post
If you store HTML in your database, now you have to convert from that to whatever you're outputting to later. And you haven't covered the dangerous characters for other contexts.
Just store whatever the user sent you. When outputting to various formats, convert accordingly.
This whole article is based on a DB row being used for everything. That is just not reality. From a cost point of view, if you're gonna just use that data a couple of ways, it is pointless to constantly convert it coming out. Even then, if that was the case, convert from entities, no more costly than what you are saying.
This feels like a distinction without a difference.
Escaping outputs is just one way of sanitizing inputs. Sometimes it works. Sometimes it doesn't. The author of this post even realizes that their prognostication is not general and then offers the advice to "be sure to get security review"...
At the end of the day, you need to make sure that any untrusted source is treated in a safe way by every sink and does not otherwise interfere with system specs (e.g., mangling user output). Whether that happens at line 5 (where the input is read) or line 155 (where the command is generated) doesn't really matter. Or to be more precise, is determined by whatever design patterns the framework developer chose.
What matters at the end of the day is that command injection isn't possible and the system's specs (including UI/UX specs) are respected.
Crucially, both input and output constraints are informed by the nature of both the source and the sink. Hence the existence of libraries like DomPurify and HTMLPurifier, which consider one very particular type of sink. Sometimes you will write code in domains where others haven't written excellent libraries but where sanitization (of either input or output) is needed. E.g., embedded systems.
I'd replace the author's advice with "carefully specify the semantics of your sources and sinks", which is ultimately what the author's actual advice (basically, "use trusted libraries and, when not, be sure to get security review") boils down to.
Not really, no. Output filtering is done in the context of a specific output domain. Input sanitization isn't; the developer who builds sanitization has to guess at all the possible output domains.
"Filter outputs not inputs" is a very old appsec truism.
I think the confusion comes from that not everybody thinks of "passing data to the database layer" as an output, but only an input to the next layer. If you think of this input as an output from the previous layer, then your advice makes perfect sense. But I don't think everyone thinks that way so it might help to clarify what "output" means in this context.
Output filtering is input sanitization. wtf is is that you think you are filtering? Inputs!
> the developer who builds sanitization has to guess at all the possible output domains.
No they don't. They need to carefully understand/document all the places input might be used and ensure no command injections are possible. In some cases (e.g., web apps, where everything is string) that works relatively well...
Until, of course, you're the one writing the input sanitization logic in the HTML purifier / prepared statements generator. And those code bases do have occasional CVEs. So, random PHP dev can put faith in a library but the system itself never gets away from having to sanitize input!
Output filtering has the complimentary problem -- you need to understand every possible input. That's not always trivial like it is in PHP-based websites. Think about e.g. an embedded system santiziing potentially adverarial time series data (what does this mean / how do you detect it? Harder, right?). Or a compiler. The blog post author even points this out: "...In these cases you’re best off using a proper SQL parser (like this one) to ensure it’s a well-formed SELECT query – but doing this correctly is not trivial, so be sure to get security review."
Ultimately, "Filter outputs not inputs" is incomplete advice that kinda sorta works well for the most part in web apps. The correct advice is, again, "carefully specify the semantics of your sources and sinks".
> Output filtering has the complimentary problem -- you need to understand every possible input.
No, you simply need to understand the encoding rules of the sink. Which is precisely why "sanitizing input" is plain nonsense: Whether a particular unescaped character has some meta character function is not a property of the character, but of the output language, so you can not possibly "sanitize input" in any meaningful sense, unless you mean by that "randomly garble the input".
When I talk to developers about this, I point use database storage as an example. There may be computations behind the scenes that mangle the nicely input-sanitized database contents. Concatenation with other values, string work, data from some other system. Thus, data that was sanitized upon input is now questionable for output.
This is well-intentioned, but leads to a false sense of security, and sometimes mangles perfectly good input.
And in some applications, for example, ones that must process data in a forensic environment, any change to the input is prohibited.
Thus, the only useful way to think about this is that the contents of the database is toxic and must be sanitized on output. Simply working with the input gives the programmer no useful idea about what is in the database when it comes time to output it.
Frameworks these days help significantly with providing tools to properly parameterize SQL. However, it is unlikely that they handle all the cases. Consider an example where user input from a web page is used to build a column name or table name. This isn't covered by frameworks. That needs to be carefully processed in the code.
>Ultimately, "Filter outputs not inputs" is incomplete advice that kinda sorta works well for the most part in web apps. The correct advice is, again, "carefully specify the semantics of your sources and sinks".
It is in fact the primary advice that should be followed.
So sanitization of input is a good idea, but if output is not properly encoded, somebody else is likely to profit.
Sorry, this still seems like a terribly hacky way to think about code.
Again, if you write a template engine or a SQL engine, the code the library's developer writes to determine how holes are safely filled is literally sanitizing input! You never get away from sanitizing inputs, you just do it further from the source and closer to the sink.
> So sanitization of input is a good idea
Right. "Don’t try to sanitize input" is bad advice. Also, the whole point of escaping outputs is that you don't trust inputs. Escaping outputs is done to sanitize inputs.
If by "sanitize input" you mean "add some backslashes to $_GET values like it's 1995", well, I guess, point taken. But then, the actually good advice should be "step back learn how to think more systematically about your code", not "escape outputs instead of inputs!"
> I'd replace the author's advice with "carefully specify the semantics of your sources and sinks",
I think that's the abstraction, but the author is presenting it in a way that requires repeating frequently simply because new programmers arrive ready to do damage every day, and the two forms of input sanitization are a great intro into how the Real World (tm) conspires against you.
We did this at reddit. We had basic SQL sanitization on the way in, and then a full pass on the way back to the user. The advantage this gave us is that when someone discovered a new way to hack our sanitization, all we had to do was update the output filter and everything was magically safe.
We didn't have to do full database scans to find all the bad data and change it.
Edit: Apparently I shouldn't have simplified "paramaterazation of SQL" as "sanitize your input". I used the more generic term since I was talking about any kind of data store. But yes, it was of course parameterized.