Don’t try to sanitize input – escape output

jedberg · on Feb 27, 2020

You should do both. Sanitize your inputs so that it can be safely stored in your data store, and then sanitize your output so it can be safely displayed.

We did this at reddit. We had basic SQL sanitization on the way in, and then a full pass on the way back to the user. The advantage this gave us is that when someone discovered a new way to hack our sanitization, all we had to do was update the output filter and everything was magically safe.

We didn't have to do full database scans to find all the bad data and change it.

Edit: Apparently I shouldn't have simplified "paramaterazation of SQL" as "sanitize your input". I used the more generic term since I was talking about any kind of data store. But yes, it was of course parameterized.

the8472 · on Feb 27, 2020

Most input by humans is unicode. Most people do not understand unicode. Don't try to sanitize it other than checking it's valid utf8 - which hopefully your programming language's string type/http parameter deserializer/DB engine already do for you.

For example some people think that stripping ZWNJ, ZWJ or other kinds of spaces is a thing they ought to do because it confuses their markup parsers or can be used to encode hidden information in posts or stuff like that. Guess what, it breaks emoji, arabic, some asian languages and a bunch of other things.

If you only sanitize your output and realize you made a mistake you can easily fix it by changing your output sanitzing algorithm. If you santizied your input you threw away data and can't fix your problem.

nine_k · on Feb 27, 2020

See also: wtf-8 encoding, used by Mozilla for potentially broken Unicode input.

klodolph · on Feb 27, 2020

WTF-8 is for a very specific application—to encode in an 8-bit sequence a 16-bit sequence which is nominally UTF-16 but may contain unpaired surrogates. It’s not used for weird user inputs, instead, it’s used for e.g. interop with the Win32 API.

laumars · on Feb 27, 2020

Are you sure that’s what is happening at Reddit? You shouldn’t need to sanitise your inputs for SQL. Paramatised SQL has been a thing in some languages for two decades now. This really is a long solved problem by now.

Output is a different matter though but that’s because of rendering content safely down to HTML, JavaScript or JSON (to name a few examples). SQL shouldn’t come into the equation by this point.

tie_ · on Feb 27, 2020

This. I'm tired of people implying or right out stating that SQL injection is an input validation problem. Why couldn't you have foo' OR 1=1; as a title of your post? It is all good characters as far as text entry is concerned.

SQL injection is really a problem of how you pass parameters to your SQL layer. Parametrized queries are the (easy and widely available) solution. If you are concatenating input to your SQL queries, you're doing it wrong.

squiggleblaz · on Feb 27, 2020

How many people named O'Brien are told they can't sign up, or passwords get rejected because they contain special characters?

It's crazy.

Even if you're using 1990s technology without parameterised queries, it's not like it's impossible to say `insert into users (name, motto) values ('O\'Brien', 'foo\' OR 1=1;')`.

fwsgonzo · on Feb 27, 2020

Yep. I've stopped counting the amount of websites that force me to use weak passwords. It's crazy that this is still a thing in 2020.

I wish the controls on browsers came with a green V that implements best practices (8+ symbols, no filter) so that people who made websites understand that this is what they should conform to. Not their own misconceptions about password security.

lodovic · on Feb 28, 2020

I wish websites would stop enforcing a "password policy". An insecure password should be a choice. If you are so sure you cannot secure your site, leave authentication to a third party provider. All this leads to is zillions of user accounts that are used only once.

hinkley · on Feb 29, 2020

I’m sorry, your password must be 16 characters or less and contain no white space or punctuation.

chaz6 · on Feb 28, 2020

I have heard stories about people in Ireland struggling to get certain services because their name contains a Fada, but some of the identity paperwork they have is missing the Fada due to lack of support by computer systems.

simias · on Feb 27, 2020

I blame PHP. Many webdevs active today started with it, and the standard library's solution to injections was escaping everything half a dozen times just in case. Because PHP being PHP nobody saw any red flags when they implemented a function named "mysql_real_escape_string". Apparently they've deprecated these functions since then, but the damage is done.

ivanhoe · on Feb 27, 2020

But that's not a thing for 15 years or more? PDO was added around 2005, and even before that anyone in their right mind used mysqli extension for prepared statements. Since 2012 you can't even use the mysql extension without getting a depreciation warning.

And yes, in 90s php's security sucked, but that was nothing php specific, it was just the sentiment of that time. Everyone did it, in all languages. I remember using tons of $dbh->do() in Perl's DBI back then, intentionally avoiding to prepare statements for a quick and dirty stuff (and most of the scripts back then were quick and dirty stuff). It's in a big part because we were used to building desktop apps and thinking in terms of security that applied for them like being careful about your pointers and input strings lengths and stack overflows and stuff. Web was still pretty new thing.

teh_klev · on Feb 27, 2020

> But that's not a thing for 15 years or more? PDO was added around 2005

Ex-shared hosting bod here, who had the joy of managing our PHP environments :(

Sadly in the real world, even after the great big (and pointless) act of deprecating and removing the mysql_* library, naive developers (and experienced ones that should've know better) just moved onto mysqli_* or PDO and still used string concatenation with raw inputs, instead of learning how to parameterise their queries.

Used to drive me flippin' nuts.

ivanhoe · on Feb 28, 2020

> naive developers (and experienced ones that should've know better) just moved onto mysqli_* or PDO and still used string concatenation with raw inputs, instead of learning how to parameterise their queries.

True, I stand corrected, I've just checked and Wordpress still does it just like that: https://github.com/WordPress/WordPress/blob/master/wp-includ...

hinkley · on Feb 29, 2020

The one and only time I argued with someone about PHP, the forum the guy ran got hacked the very next morning and was running a botnet of some sort. I was smug but quiet, and I never really thought about it but the timing makes me wonder if he thought I was involved.

Anyway, there’s a patch release the next day, and somewhere I find the diff. Now I can’t read PHP but I know what string concatenation looks like, especially if someone does a diff on it. I’ll be damned if the diff didn’t fix one SQL string concatenation that was less than five lines from code with the same structure. Scsry.

nakkijono · on Feb 27, 2020

PHP and vulnerable example code is an additional thing. Most people just copypaste from tutorials. For example, the first search result for "PHP mysql example" gives you the wrong example first https://www.w3schools.com/php/php_mysql_select.asp

auiya · on Feb 27, 2020

> Are you sure that’s what is happening at Reddit?

"I was the first (paid) employee of reddit" https://www.jedberg.net/

I mean...

laumars · on Feb 28, 2020

That doesn't mean the information supplied was accurate. Which, as it happened, it wasn't and this whole thing was really just a massive misunderstanding.

jedberg · on Feb 27, 2020

When I said "sanitized input" I was simplifying parameterized SQL in the case of reddit. But since I said "for your datastore" I was being less specific since different datastores require different methods.

laumars · on Feb 27, 2020

Parametrised SQL isn’t sanitising your input. It’s injecting those values in at byte code so your values are separate to the query language. Calling that process “sanitisation” is, at best, highly misleading.

Also the methods don’t really change across different SQL databases, at least not conceptually. Sure the RDBMS drivers might change but these days that stuff is usually abstracted away into a single framework for SQL. The real significant change would be switching to a NoSQL database but if you’re doing that then it’s not SQL you need to be “sanitising” anyway.

jedberg · on Feb 27, 2020

> Calling that process “sanitisation” is, at best, highly misleading.

That's a fair criticism. But that is what I've called it for a long time.

> The real significant change would be switching to a NoSQL database but if you’re doing that then it’s not SQL you need to be “sanitising” anyway.

Right, that's why I started off by saying "Sanitize your inputs so that it can be safely stored in your data store", so that it could apply to any data store.

It's just a terminology argument at this point. But my main point was that you need to still think carefully about how you're going to store your data and do it safely, and then also make things safe on the way out, like the article suggests.

shawnz · on Feb 27, 2020

> That's a fair criticism. But that is what I've called it for a long time.

I think many people would agree that "sanitization" is loosely defined and I think that's exactly what led to the misunderstandings that the OP's article is trying to address.

> that's why I started off by saying "Sanitize your inputs so that it can be safely stored in your data store"

That could be interpreted like "escape quotes at the start of the request if you know that you're using a database where quotes have a special meaning" a la PHP magic quotes, which I'm guessing is not what you meant but it is what the OP is criticizing. The key is that the sanitization (or whatever you want to call it) shouldn't happen until you're ready to insert into the DB, otherwise that data will be coupled with database logic through the whole flow of your app

wccrawford · on Feb 27, 2020

You don't "sanitize your inputs" for your datastore. You escape the outputs as you send it to your datastore. (Either through oldschool methods, or parameterization.)

Sanitizing your input means changing them permanently. You don't actually want to do that. You want to store exactly what the original value was, but you want to do it safely. When you retrieve it again, it should be the same as it was originally.

If you "sanitize" the input, you won't necessarily have the original value ever again.

ashearer · on Feb 27, 2020

Yes. It's the word "sanitize" itself that misleads people. It creates the mindset that input from users is dirty and must be made clean, and "clean" is "safe" to use in any context.

(I've seen the line of thought taken one step further: taking the realization that it's impractical to make strings universally safe for any context—even if you HTML entity-encode it twice, what if a recipient decodes it three times?—and concluding that security is hard and we can only approach it asymptotically, so shrugs XSS-like bugs are normal and unavoidable given finite time & budget.)

If the mindset is more like converting units, it becomes clearer. You can't concatenate HTML with a general Unicode string without converting the string to HTML first, any more than you can add inches and centimeters directly. "Cleaning" the centimeters would make no sense.

nulbyte · on Feb 27, 2020

> But my main point was that you need to still think carefully about how you're going to store your data...

I think this is true, but doesn't quite reach the point GP is making: speaking correctly is also important. Calling parameterization input sanitization communicates the wrong message. And abstracting the wrong solution to apply it to a different problem isn't all that helpful. You could just as easily encode or hash input to fit the underlying data format without losing data (except in the case of truncation), but that isn't input sanitization, either.

Input sanitization is strictly checking front-end input against a ruleset and rejecting anything that does not comply. This is fundamentally different than dealing with anything thrown at you and handling it gracefully.

Izkata · on Feb 27, 2020

> Input sanitization is strictly checking front-end input against a ruleset and rejecting anything that does not comply.

That's validation. Sanitization involves altering the data to make it safe (comes from the word "sanitary").

organsnyder · on Feb 27, 2020

Rejecting input is a valid sanitization strategy, though.

SahAssar · on Feb 27, 2020

That's not how the word is usually used within development in my experience though.

I think most devs think of sanitization as "make X safe", not "see if X is safe, if not reject" since that is usually called validation.

Using hand sanitizer does not remove your hands if they have harmful bacteria. The hands are still there, just cleaned from (some) of the harmful parts.

Sean1708 · on Feb 27, 2020

The article refers to that as escaping output rather than sanitising input:

> And of course use your SQL engine’s parameterized query features so it properly escapes variables when building SQL:

mannykannot · on Feb 27, 2020

We should certainly use parameterization always, and this will allow us to save 'Bobby Tables' type input into our databases, but we should acknowledge that there is then the potential risk that some internal, probably non- public-facing program or script, either now or in the future, will contain a bug that leads to its execution.

The spread of natural language processing into systems and analysis tools might increase the scope for this sort of thing.

laumars · on Feb 27, 2020

I agree you can never say never but you also can't sanitise against a risk that hasn't been defined yet simply because you wouldn't know what needs to be sanitised.

For example what if your NLP is bootstrapped from a shell script and your database content has been stripped of SQL but still contains stuff that might be interpreted as $(sub-shells)? Before long you run into a situation where literally no characters are considered safe (eg even alpha characters in the English alphabet are used as tokens in some programming languages and "what if someone builds a script in one of those languages?").

The only sane way to address the unknown is to treat raw strings as "dirty" and follow best practices when handling them (plus all the usual processes to properly test your code before it's used in production). In which case you're back to no longer needing input sanitisation.

mannykannot · on Feb 27, 2020

My point is not that input sanitization is a solution, but that parameterized database input is not the end of the issue, from a broader, whole-systems security point of view.

Many data types have some concept of well-formedness, and in those cases, there are pragmatic reasons for only accepting well-formed input that go even beyond the security aspect.

laumars · on Feb 27, 2020

I completely agree and I also said this in my original comment you replied to.

This is why I got confused when you said "but we should acknowledge..." (ie thinking you were raising a point other than what myself and others had already acknowledged).

mannykannot · on Feb 27, 2020

It is not clear to me that the post I originally replied to did acknowledge this point.

laumars · on Feb 27, 2020

> You shouldn’t need to sanitise your inputs

> Output is a different matter though

"Output" doesn't have to be front end rendered nor even internet facing. It could also be input for another internal process.

mannykannot · on Feb 27, 2020

Indeed, so there was a point to be made.

laumars · on Feb 28, 2020

I honestly don't understand what the point you're trying to make is then.

If you're saying people should be aware that handling data safely requires more steps than just parametrised SQL then yes I touched on that, as have others, and it's not something anyone is unaware of. Hence why there has been so many high quality posts discussing the different methods of validation, sanitisation and escaping. So it's a rather strange position to assume when you say "we need to acknowledge" given that's what everyone (including me) has been doing. But it never hurts to be categoric about important points like that so your original post is still relevant.

If you're making some other point then I've already had two stabs at deciphering it and failed both times. So it's really not clear what that point is.

However if your intention was just trolling me then fine, I bit and you won.

TeMPOraL · on Feb 27, 2020

> Sanitize your inputs so that it can be safely stored in your data store,

That's not sanitizing input, that's escaping output - if you subdivide concerns appropriately. The data you're saving into database is the output of your program.

Really, the problem is of language translation. User input is an unstructured blob. SQL, or HTML, are structured languages, with their own semantics. Whenever you cross the language barrier, you need to translate data from one language to the other. Parametrized queries are the usual API to SQL drivers, and they do this for you under the hood, producing a valid SQL query string[0]. When going to HTML+JS, you need to invoke some library (or do translation yourself).

(I really don't like the term "escaping". Translating between languages is more than just sticking slashes in front of double quotes.)

This is why "sanitizing inputs" is a nonsense concept. The problem is of language translation, and you can't translate if you don't know the destination language. A blob correctly sanitized for SQL will not be correctly sanitized for HTML, and an input correctly sanitized for both will look bad in either.

SQL injection and XSS are the same bug. Failure to translate between languages. Usually caused by a pretty stupid but somehow very popular idea - building target language expressions by gluing plain strings together[1].

--

[0] - Sometimes. I remember this is how it worked in the past, but not sure if server RDBMS APIs haven't changed since. For comparison, with SQLite, you're passing the query with placeholders to the SQLite functions, and the parameter values are passed as arguments. The SQLite internals turn this into an executable query, but I'm pretty sure this does not involve a query string with actual parameter values in it ever existing in memory.

[1] - A related source of footguns is using templating engines for web pages. HTML is a tree of nodes, not a plain string. Using a template system is a recipe for XSS problems.

gmueckl · on Feb 27, 2020

Any actually decent RDBMS isn't stupid enough to first escape parameters, then parsing the query string to find placeholders, then do a bunch of string concatenation and then run the concatenated string through a second parser. It is really simpler and more robust to parse the query string once and grab the actual data value from the parameter array once a placeholder is found.

However, a lot of client side libraries are cheating and embed the parameters into the query string before passing it to the server instead of implementing the proper parts of the protocol. This is about as safe as doing plain string concatenation in the first place. I don't trust the library authors to actually get this right.

benhoyt · on Feb 27, 2020

There's value in doing both in certain cases (as long as you're definitely escaping output), but I'd be careful about the motivation. You say: "Sanitize your inputs so that it can be safely stored in your data store" -- but it's safe to store any string in your database as long as you escape/encode it correctly (i.e., use parameterized queries).

For example, what if someone posts a legitimate comment on reddit helping someone with SQL syntax for deleting tables and includes "DROP TABLE users" -- did you sanitize that away?

jedberg · on Feb 27, 2020

> did you sanitize that away?

No, see my edit above. It was just parameterized.

iainmerrick · on Feb 27, 2020

What does the “basic SQL sanitization on the way in” consist of and what does it do for you?

The article makes sense to me; if I’m just storing strings in a database, I don’t see why they would need to be sanitized at rest, even if they contain malicious SQL code. Only when I actually come to use those strings for some purpose.

bardan · on Feb 27, 2020

But what if at some point somewhere down the line someone forgets to sanitize the output? Surely better procedure to sanitize at both ends. Nobody is perfect.

JimDabell · on Feb 27, 2020

You're thinking about this as if data can be in one of two states – untrusted or sanitised. This is not the case.

When you output arbitrary data, you need to encode it in a way that is suitable for that context. These contexts might be:

- Generating a web page.

- Including in a JSON response from an API.

- Sending an email.

- Storing in an SQL database.

These all use different formats / protocols that use different syntax to encode data. How you correctly encode data for one of them is different to how you correctly encode data for another of them. There is no method of taking untrusted data and "sanitising" it so that it is correct for all of them. What works for one will break for the rest.

If you want to handle arbitrary data correctly and safely, store it as-is and when the time comes to use it, encode it appropriately for the context you are using it in. Where possible, use tools and systems that get it right by default instead of requiring developers to remember to encode correctly, e.g. generate HTML with templating engines that encode data as HTML by default, and use parameterised queries with SQL.

TeMPOraL · on Feb 27, 2020

> generate HTML with templating engines that encode data as HTML by default

Don't, unless you're sure the templating engine actually parses the HTML into a tree of nodes before interpolating and re-emitting it. Otherwise it's likely someone will interpolate something in an improper context, e.g. inside <script> or <style> block.

cjfd · on Feb 27, 2020

This is completely the wrong attitude. Code should be correct, not fail-safe. All of the safety hatches that people tend to introduce make the code less predictable and eventual problems tend to arise far away from where they originated, making them difficult to diagnose. What is the result of this sanitization? Now we have some undefined, changing internal string format running around in our application, and possibly multiple undefined and changing internal string formats, where it is also unclear at what point a string is supposed to be in what format. If something is an arbitrary string it should be allowed to be an arbitrary string and the things handling that should escape it the appropriate way. The article is correct.

Scarblac · on Feb 27, 2020

Code should ideally be correct and fail-safe; deal gracefully with bad input, ensure correct output, at all levels. Ideally.

That still doesn't mean you should care about output sanitization on data entry, as you don't know how it will be output yet.

zAy0LfpBZLC8mAC · on Feb 27, 2020

> Code should ideally be correct and fail-safe; deal gracefully with bad input, ensure correct output, at all levels. Ideally.

NO!

If you get bad input, you should fail, loudly. Anything else is a recipe for disaster.

(Not as in "crash the whole system", of course, but as in "reject the request".)

trashbindigger · on Feb 27, 2020

I’m confused because I agree with this comment, but your other comments mentioned:

> "sanitizing input" is plain nonsense

> "Unsafe input" is not a thing.

Have you ever used hand sanitizer? The point of hand sanitizer is to reject infectious diseases outright.... you know, so they don’t get IN your body. You seem to have adopted a narrowly defined sense of sanitization which does not include “mercilessly discard/destroy”.

zAy0LfpBZLC8mAC · on Feb 27, 2020

> The point of hand sanitizer is to reject infectious diseases outright.... you know, so they don’t get IN your body.

There is no such thing as "infectious input data".

> You seem to have adopted a narrowly defined sense of sanitization which does not include “mercilessly discard/destroy”.

None of the dictionaries I just checked support such a definition. They are all about "changing something to be more sane/sanitary/pleasant/acceptable/...".

Also, mind you a hand sanitizer doesn't destroy your hand, it destroys microbes, in order to make your hand sanitary, so as to enable you to continue using your hand instead of rejecting/disarding it. Which is exactly the kind of thing you should not ever do with input data.

eythian · on Feb 27, 2020

This tends to leads to systems that don't work and die on real-world edge cases.

zAy0LfpBZLC8mAC · on Feb 27, 2020

No, it doesn't. That is what "accept anything" programming leads to.

The idea that "accepting all the inputs" somehow gives you an advantage is an illusion: If the semantics of some input are not well-defined, then the only thing you gain by accepting it anyway are hard to debug interoperability problems and vulnerabilities. When some input is not well-defined accoridng to the spec, then your interpretation is just a random guess, and the next developer will make a different random guess as to what that input means, and so an interoperability problem and potential vulnerability is born. If you reject the invalid input, you will notice the error and thus fix the source of the invalid input to produce input for which the semantics are actually well-defined.

ashearer · on Feb 27, 2020

This sounds good in theory, but I'll give a counterexample.

Requirement: Name input box.

Implementation: We'll sanitize the input by rejecting any characters likely to be dangerous if mishandled, like single quotes, or anything else we don't immediately imagine to be useful. If a character turns out to be needed later, that's no problem. We'll just change the list.

Security audit: Passes

Later customer complaint: I can't sign up! — J. O'Brien

Dev team: Sorry, too bad. We'd have to re-audit everything and possibly modify code to allow your last name, because there might be code somewhere that relies on the original sanitization for security. That was the point of sanitizing on input, after all. If you want to sign up, it would be easiest for us if you would just change your name.

zAy0LfpBZLC8mAC · on Feb 27, 2020

I think you misunderstood my point. I am not saying that you should reject valid (that is: semantically meaningful) input, but that if you are confronted with semantically meaningless input, you should reject it rather than garble it so that it gains some random meaning.

So:

Name input field, value "J. O'Brien": accept

JSON parameter, value "{foo:bar}": reject

The context was the idea that you should gracefully accept bad input. If your code considers "J. O'Brien" bad input for a name, then that's the problem, not that it doesn't accept bad input.

ashearer · on Feb 27, 2020

Yes, I completely agree in the above case. The JSON input has a well-defined format and input validation should reject it outright.

The issue is that when developers hear they should "reject bad input" in order to avoid vulnerabilities, they often interpret it as a call to reject any user input that isn't already known to be good. Since user inputs are often free text, like the name field, they wind up forbidding any input they hadn't specifically imagined, which doesn't align with any particular recipient's actual data requirement. It creates false-negative edge cases while only providing illusory help against vulnerabilities.

zAy0LfpBZLC8mAC · on Feb 27, 2020

I mean, I generally agree, but I think it's already problematic to frame it as "user input that isn't already known to be good". Because "J. O'Brien" is known to be good. The problem is that anyone thinks in the first place that some semantically meaningful input value for some reason is not good.

Scarblac · on Feb 27, 2020

You can't sanitize for output at input time, as the sanitization that needs to be applied is different for HTML, JS and JSON. You don't know that at input time.

XMPPwocky · on Feb 27, 2020

Double-escaping is silly & it\'s just plain incorrect.

laumars · on Feb 27, 2020

You might need to escape strings differently depending on where you’re outputting eg HTML or JSON.

jve · on Feb 27, 2020

Well, use libraries/frameworks that ENFORCE you to sanitize and makes that exceptional case to output raw content.

Examples. PHP: Using mysql_escape_string is a no-no - you will forget to add it one day. Using parametrized queries you won't write unsafe SQL.

.NET Core - Outputting to HTML by default only outputs those chars to HTML which are in predefined UTF range. All other chars will be converted to HTML entities. If you want to output raw, you must explicitly use @Html.Raw https://docs.microsoft.com/en-us/aspnet/core/mvc/views/razor...

thephyber · on March 1, 2020

This doesn't pass the sniff test. If someone can forget to sanitize the output, someone can also forget to sanitize the input. The most important things are to understand where the content is used, use the appropriate output encoding/escaping, have rigorous tests to ensure your expectation correctly escapes nasty strings, and that you keep the output escaping code up to date to protect against novel attacks and new browser/app features.

I worked at a social media company with one of the largest text-based user-content-stores in the world at the time. Some of the features had input-side encoding and some had output-side encoding. I was there ~10 years after the bad practice of input-side encoding started and it very quickly became too cumbersome to know exactly which fields were encoded with what encoding (and I mean both character encoding and htmlentities / specialchars / specific character stripping / etc). We started getting ridiculous bugs like passwords could not contain '&' characters or logins would fail matching what we had in the DB.

It's not about being perfect. That will never happen. It's about storing exactly what the user submitted (if it is accepted by the POST submission logic) and to correctly encode the output for the correct security context (HTML, XML, JSON, html entities, html attribute, script tag, styles/stylesheet, urls, uploaded filename / file contents, filesystem injection, command injection, etc). These all have different rules. You can unintentionally open yourself to a vulnerability in one if you only expect the output to be displayed in HTML.

fetbaffe · on Feb 27, 2020

I'm no fan of sanitizing inputs, transform unsafe input to safe input and the storing it, because as you say, someone will find a way to circumvent that.

What I always do is to exactly specify what is allowed in any input by parsing, schema validation. If it is HTML I run a HTML parser to validate accepted tags & attributes. If it is plain text i validate that there is no HTML in it, etc.

If the input fails the filter then you deny the request.

This has the advantage that you always know what data structures you are storing in the database and that will make future data migrations much easier.

Drawback is that if your filter is too strict then you deny a valid request, however it is easier too loosen a filter later than migrate unwanted/unknown data that you accidentally accepted.

Stored input is also part of your database schema.

And of course, always escape output even if you know the data is "safe".

lucideer · on Feb 27, 2020

Replying to the edit for further clarification:

> Apparently I shouldn't have simplified "paramaterazation of SQL" as "sanitize your input"

It's not a simplification, it's incorrect. An SQL query is output. You are sending data out of your application code, via the SQL library driver.

It may seem like I'm splitting hairs here, but I've seen this distinction misunderstood in this way often enough in situations that severely compromise security.

Some people think of "output" as purely the end product of application flow, and all I/O that happens in between is somehow lumped together as indistinct "input". There's a reason I/O has two separate letters, and the distinction is crucial for securing your application.

TL;DR:

values passed to an I/O function (like stdout, file.write(), response.write(), db.query(), etc.) are all output, those returned from an I/O function (like stdin, file read, db query results or requestObject.getQueryParam()) are input. Sanitize the former, NOT the latter.

Validate the latter (ideally. Though I'd say this is more about stability than security).

bloak · on Feb 27, 2020

> Sanitize your inputs so that it can be safely stored in your data store

What kind of data store can't hold all possible strings of characters? Except perhaps '\0'. I'll make an exception for '\0'.

(EDIT: There is of course the question of what you mean by a character. Sometimes it's an octet. Sometimes it's a Unicode code point.)

jedberg · on Feb 27, 2020

It’s more a question of how you get them in there. For example if your data store is JSON based, you’ll need to escape some strings.

TeMPOraL · on Feb 27, 2020

That sounds like a broken datastore API. In a properly designed API, you don't need to escape anything, because the API implementation ensures your data doesn't get read as code.

dropmann · on Feb 27, 2020

It depends on the perspective, in case of SQL I would argue that sanitizing the input is the same as escaping the output, because the query you are sending to the database is the output.

Escaping the output however as a term implies you are doing it right, while sanitizing the input could also mean you just replace("DROP", "") etc. (My last name is Dropmann, I know what I am talking about)

ashearer · on Feb 27, 2020

The difference is where it's done. "Sanitizing the input" implies that it happens when the value is read, so that all uses of the value are stuck with a single result. "Escaping the output", in your example, would happen in the database or its driver, for parameterized queries. HTML output of the same value in the same request would be escaped differently within a function that builds HTML output.

pirate_dev · on Feb 27, 2020

Can you explain to me why reddit feels like it is held together with duct tape? IMHO, it has the most problems with site uptime and basic functionality of any major site on the net. I am always getting search problems, site unavailable, or some other such glitch with it. I can't believe you guys just don't know what you're doing, so what does the present setup offer that is worth this shitty performance?

cerberusss · on Feb 27, 2020

> always getting search problems, site unavailable, or some other such glitch

This sounds like my spouse saying "you always...". It's obviously not "always"; I think you'll get a better reply when you quantify it. For example, in the last month, how many times did you get a search problem (which was it), a site unavailable or something else (what was it).

pirate_dev · on Feb 27, 2020

It's about once per session for me. Seriously, compared to every other major site I know of, it is in a class of its own for flaky UX. At least one of the things I describe above per afternoon, lets say. Often many more than one if it is having serious problems. As for the issues with you and your spouse, I will just say it sounds like there is a opportunity for improved communication there. Best of luck.

robotnikman · on Feb 27, 2020

IIRC I do not think he is part of Reddit's administration team anymore.

Alex3917 · on Feb 27, 2020

> You should do both.

You should really do all three:

1) Sanitize input on the server.

2) Use a CSP to prevent your browser from rendering any asset unless specifically whitelisted.

3) Use your front end framework to sanitize text being displayed.

iamaelephant · on Feb 27, 2020

Are you implying that Reddit is concatenating SQL queries instead of parameterizing properly? Because based on the reliability of the website I'm not surprised by this, but it's still hilarious.

jedberg · on Feb 27, 2020

No it is not just concatenated.

zAy0LfpBZLC8mAC · on Feb 27, 2020

Now, you have ammended your comment to really say something very different (how is parameterized SQL 'basic' anything when it's actually the correct complete solution to the problem?).

But in any case, this still suggests a complete misunderstanding of the point of that blog post. As far as that blog post's point is concerned, the SQL database is an output of your program. And the whole point is that you need to escape/encode all outputs correctly, but you should not ever sanitize anything.

jedberg · on Feb 27, 2020

Because as others have said, parameterized SQL is a long standing solution, so I consider it pretty basic, but I still consider it a form of sanitization that not everyone does, even though it's a long standing solution.

I think it is totally fair to interpret the article as saying that the database is a program output. And if you interpret it that way, what I said doesn't make sense.

That's just not how I interpreted it.

IshKebab · on Feb 27, 2020

If it's even possible for someone to "find a way to hack your sanitisation" then you're doing it wrong.

anonsivalley652 · on Feb 27, 2020

Yeah, I "LOL" at these type of One True Way™ proclamatory headlines.

What I don't understand is the lack of using proper escaping functions when generating SQL. Templating SQL without escaping is the surest way to a SQL injection.

--- Bad

    "SELECT * FROM USERS WHERE name = '%'" % (name)

--- Good

    "SELECT * FROM USERS WHERE name = '%'" % sq_esc(name)

    # where sq_esc() doesn't add outer ', but escapes anything that needs it

And to defensive coding:

0. Sanitize input. Always, always.

1. Assert pre-condition invariants.*

2. Process.

3. Assert post-condition invariants.*

4. Generate correct output by understanding the output domain.

* Unit tests, smoke tests, integration tests and code-coverage alone are insufficient to cover complex code paths. Fuzzing with asserted invariants is a good way to shake the dust out of hairy code.

DelightOne · on Feb 27, 2020

Defensive coding makes sense.

What is sql templating used for in the face of parameterized queries?

rossdavidh · on Feb 27, 2020

So, essentially he is saying: 1) go ahead and accept the risky thing from your users, and store it right there in your database, but 2) make sure that you remember, in every single place in your code where you read that out of the database, to treat it properly, and 3) make sure that every other programmer, now or at any time in the future, remembers to do this also, in any code they write which reads user input out of the database and puts it on the screen.

What a bad idea. Don't leave landmines there for other maintainers of the code to step on. Especially because the other maintainer may actually be you, six months or a year from now.

Sanitize your inputs. Also, escape your outputs.

DagAgren · on Feb 27, 2020

That simply does not work. You can't sanitise, escape and reproduce correctly all at the same time.

Say you run a blog. I post a comment saying "But in this case, B<A!"

This is clearly dangerous input! But it is also exactly what I wanted to say. How do you sanitise this? Change < to < in the database? Now you have to remember to NOT escape that again when outputting! And you have to make sure that, say, your text resources in your UI are all also escaped the exact same way, or you have to remember to escape them DIFFERENTLY than user-provided input.

Or maybe you "sanitise" by stripping out dangerous characters like "<". Now you have broken my comment.

The only strategy that is at all maintainable is to store the comment as received, and to escape on output. Anything else is massively fragile or broken.

marcosdumay · on Feb 27, 2020

> You can't sanitise, escape and reproduce correctly all at the same time.

That's why you do them at different times...

Let's go to the example:

> This is clearly dangerous input!

That's not clear at all. There is a set of values allowed for a comment, this one is probably within them, while, for example, an empty value usually isn't, as isn't and invalid UTF8 sequence. This one should pass sanitization as is.

> Change < to < in the database?

You escape it when converting into HTML. It's not the same as sanitization.

SahAssar · on Feb 27, 2020

I think what you are saying is to validate, not sanitize.

Sanitize (as I have understood it) usually means to "modify to be safe", while you are talking about rejecting invalid responses.

marcosdumay · on Feb 27, 2020

If your rules say comments will be truncated to 1000 chars, yo do it on the input. If your rules say all prices are in dollars, but your frontend accepts other currencies, you convert on the input, and overstaff the consumer support.

Honestly, those names mean a lot of different stuff to different people. It's not good that there are so many, it's more a consequence of the widespread of bad practices.

SahAssar · on Feb 27, 2020

I think what people thinking of the term like me (sanitize means modify, validate means accept/reject) will think is that if your rule is "comments will 1000 chars or less", then the validate reasoning would say reject the 1001 chars comment, while the sanitize reasoning would say truncate to 1000 chars.

Do you have a different reading of the terms?

DagAgren · on Feb 28, 2020

You seem to be agreeing with me. I am saying to escape, not sanitise. What I described is sanitising.

dictum · on Feb 27, 2020

Maybe I'm overengineering, but couldn't you store the sanitized version as the normal value, and also store and make publicly available the original unsanitized value in an ominously and obviously named key (say, dangerouslyUnsanitizedValue) that happens to be easily greppable/lintable?

GuB-42 · on Feb 27, 2020

I think you are overengineering ;)

Plain text can contain anything and it shall be treated as such, it is that simple.

As for security, don't assume everything in your database came from a trusted source. Maybe there are remains from an old version of your code that didn't sanitize, maybe you improperly used admin tools that bypassed checks.

inimino · on Feb 27, 2020

The idea that one string is more dangerous than another is the problem.

asheroth · on Feb 27, 2020

How would you determine which value to display? It seems to suffer from the same issue where if you display the sanitized value then the comment is still missing necessary characters, but if you use the unsanitized value then your application will be vulnerable to XSS.

rossdavidh · on Feb 27, 2020

In most cases, that would be overengineering, but it is an entirely plausible solution if you happen to have a case where you need to allow the user to enter things like angle brackets, and for some reason you cannot escape them.

kochthesecond · on Feb 27, 2020

Thanks for explaining this better than I could.

Someone · on Feb 27, 2020

”Change < to < in the database?”

Of course not. The fact that “<“ is risky isn’t part of the string, it’s part of the output format (HTML).

If you were to write that string to json or csv, you would have to special-case double quotes. In. POSIX shell, asterisks and question marks need special attention, etc.

amenod · on Feb 27, 2020

> This is clearly dangerous input! ...

You are missing the point.

You should sanitize the input when possible, so that numbers are really numbers, strings are really strings, slugs and similar are cleaned... But of course you can't clean text so that it will be safe when displayed. After all, `<` is only problematic if you are displaying the text as HTML, which, while common, is not a given.

When displaying anything, you should however use a _framework_ that doesn't allow you to display anything that would not be safe (unless you use some function with "UNSAFE" or "DANGEROUS" in its name). For example React does that, and others too.

There are many different kinds of attacks and the less leeway an attacker has, the safer you are. So sanitize both, input and output.

emn13 · on Feb 27, 2020

The solution: don't try to reproduce input exactly. That's a weird thing to want in general anyhow - what exactly are reproducing so exactly? Text? Including Markup? How about some animation thrown in? Maybe Interactivity? Hey, let's just reproduce arbitrary executable code accurately?

The whole point of sane sanitization is that you don't need to reproduce all that stuff exactly. Pick a small domain, and reproduce that. Often, it's OK to reproduce approximately; e.g. not worrying about things like retaining multiple consecutive whitespaces, or perhaps leading/trailing whitespace, or whatever.

The point of sanitization is to make it easier not to make a dangerous mistake accidentally. If you have an input that needs to support layout, that's a pain. But if you can live with just text - so much the better. If you do need to support markup; then I don't see the wisdom in sanitizing it late; that's just asking for bugs to lead to security issues.

Frankly the whole tradeoff is nonsensical. These aren't mutually exclusive alternatives, and don't even really address the same issues. Yes, you should sanitize (and validate) your input. And you also need to escape output as appropriate.

If the point is that it's not wise to skip escaping because you "know" the input is safe due to sanitization - then sure, while theoretically sometimes sound, that's pratically a nasty bug waiting to happen. Don't do that, sure.

DagAgren · on Feb 28, 2020

So are you saying I should just not ever want to say "A < B"?

emn13 · on Feb 29, 2020

No; that's just you picking an absurd example rather than being practical.

Pick a reasonable domain for each input field, considering what kind of input is useful, and what kind of usage in output (i.e. plain text output is likely much less risky than rich text). There's rarely a reason to ban < in plain text; but retaining stuff like zero-width joiners or rtl-ltr-transitions is likely less valueable, and potentially an issue with in things like usernames or email addresses (because they make it trivial to make apparently identical usernames). Similarly, if you're storing a telephone number and want to retain spaces - are you going to retain nul-chars too?

Not all input should allow arbritrary plain text. I'd guess most don't, and lot's of input is at least rich text nowadays (not to mention images and other media - you think it's a good idea to just reproduce an arbitrary image exactly?).

jonathan_s · on Feb 27, 2020

No, there is nothing to remember. Every decent HTML templating engine these days will handle all free form text as unsafe and escape it automatically. The same for SQL libraries.

Input sanizitation doesn't work, because it doesn't know what is dangerous and what is not dangerous. That depends completely on the output domain, and at the point where the inputs are received, the output domain is often unknown. Data can flow through many layers of business logic and then be passed to an SQL query, an HTML templating engine or anything else.

If you don't consider database strings to be free form text when constructing HTML, then there's a good chance there will be vulnerabilities anyway, regardless of whether any sanitization has been applied.

The article is fine.

reflectiv · on Feb 27, 2020

> Input sanizitation doesn't work, because it doesn't know what is dangerous and what is not dangerous.

This is only true if you write or use shitty validation rules. You act like its impossible to do it right...it is not.

bcrosby95 · on Feb 27, 2020

This has absolutely nothing to do with shitty validation rules. It has to do with what you're outputting it as.

If you're outputting it to HTML, commas are fine. If you're outputting it to CSV, commas are bad. And your validation rules suck if you don't allow commas in any text field because it might be output to CSV someday.

the8472 · on Feb 27, 2020

Unicode RTL override can be dangerous in filenames (in the sense of confusing humans, not computers). It is necessary to preserve in content management systems dealing with bidi content.

Of course you can try to have different validations on your model, but then you need to make sure to know all output domains on the model level instead of doing it when handing over to the view.

Joker_vD · on Feb 27, 2020

> Sanitize your inputs.

Sadly, there is no "the sanitization". JSON, SQL, HTML, CSS, and URI (and the future formats not invented yet) all require different escape schemes, so while you can indeed filter out anything that can be interpreted as an escape sequence in any of those formats, that's not something you can always do.

Instead, render your data properly (yes, that includes escaping whatever you're outputting).

tmpz22 · on Feb 27, 2020

I particularly enjoy React's dangerouslySetInnerHTML syntax

``` <div dangerouslySetInnerHTML={__html: '<p>hello world</p>'} /> ```

Like they're BEGGING you to never use it but understand sometimes it may be necessary or at least is needed to avoid worse hacks to achieve the same function.

I've taken to using prefixes like DANGEROUS or UNSAFE in various parts of our codebase to better indicate to the user where extra caution is needed.

SahAssar · on Feb 27, 2020

Or treat your database as a user and don't blindly accept what it gives you without validating it.

At a previous job I implanted a user record that would have given me admin credentials on the next db migration (which happened pretty regularly) because the developer of the migration tool said "Why should I not trust this data, it came from the database". It was "sanitized" by the app for it's intended use, but not for the ETL tool.

If you sanitize your inputs you automatically create the assumption that the database is "safe", but you also have to sanitize it for every potential future usecase that the data might be used for, which is not clear when you are writing the sanitization code. Can you foresee every type of use the data will have in the future? can you know every ETL step that will be written 5, 10 years down the line? If not, it's safer to treat the data as untrusted, and if you are going to do that anyway it's a whole lot easier to just not sanitize since you will otherwise deal with double-sanitization or double-escaping.

rossdavidh · on Feb 27, 2020

"...what it gives you..."

Right there is the flaw. It's not "you", it's "any programmer who ever works with this system, now or any time in the future.

SahAssar · on Feb 27, 2020

Programmers generally agree that user input is untrusted, right?

My point is that should just be extended to the database, and even where it isn't, it sorta already is for attacks.

inimino · on Feb 27, 2020

"The risky thing" here is a text string. So yes, you should be able to accept an arbitrary string and store it in your database without shooting yourself in the foot, and the same when you read it out of your database and put it in HTML, or JSON, or CSV, or XML, or ASN.1, or protobufs, or in another database.

Handling escaping in HTML or REST API is web dev 101, this shouldn't be controversial.

BurningFrog · on Feb 27, 2020

One of my most important rules:

Whenever you say, think, or imply something starting with "So from now on we just have to remember...", you're really saying "Let's decide this part of the system will always keep breaking!".

ris · on Feb 27, 2020

> So, essentially he is saying: 1) go ahead and accept the risky thing from your users, and store it right there in your database

Yes. It's hard to know what "risky" is when you're taking input. You don't have the context of its use. What if someone is discussing html in a comment and wants to refer to an example? Or the same for SQL. You're going to be using a heuristic to try and "guess" this, and you're forever going to be trading off annoying your users with security, which is never a good place to be in.

> 2) make sure that you remember, in every single place in your code where you read that out of the database, to treat it properly (and 3 more of the same)

No. You simply do not use unsafe methods of mixing this data with your output. Use any remotely modern markup templating system which has a way of tracking the escaped status of text and auto-escaping it when necessary. Use an ORM or at least a database connector which has inbuilt parameter escaping support. Do not just do regular string formatting with this kind of stuff, and don't hire the sort of people who do.

> Don't leave landmines there for other maintainers of the code to step on

You don't know what is a landmine until you know the context of its use.

It is much easier to secure a perimeter at one point and know that everything one side of it is safe and everything the other side of it is not safe. The only practical place to put this perimeter is the point of use/output/whatever you want to call it. Having several different places where this data gets mangled and lacking clarity of exactly which pieces of data are safe and which are not at any one point is a recipe for disaster.

"But we'll also escape everything on output" - this leads to double-escaping and weird bespoke hacks to work around the resulting artefacts, which themselves will likely open up holes.

And this is not even touching on the idea of a data field's safeness "living with" the data. "Field xyz is safe, we sanitize it on input" - cool, but we only started sanitizing the input in September 2019, so any fields from before that are unsafe.

coding123 · on Feb 27, 2020

I like the way react does it:

dangerouslySetInnerHTML={{__html: variable}}

Not only can you easily see every place in the code that is using this construct, but, the framework provides a warning.

I definitely don't think it's bad to sanitize HTML content. but in general MOST text in a web app should be just text, and rendered with whatever HTML it contains. In very few places should the web application give user rich-text (aka HTML) access. In any place that the application does that, sanitization should be used.

alanfranz · on Feb 27, 2020

Yes. And he's saying the right thing. You can't sanitize for all your possible usages.

palant · on Feb 28, 2020

No, nobody has to remember anything. You use prepared statements for your database, you use autoescaping templates for the output, and it just happens automatically. I wrote about this four years ago: https://palant.de/2016/03/02/why-you-should-go-with-secure-b...

On the other hand, when you "sanitize" input you get immensely complex code. Whenever you get some data in, you have to remember to sanitize it. You have to know in advance how that data is going to be used and what might be problematic there. Worse yet, as your usage of that data changes (or your understanding of the problems), the data sanitization has to change as well - both for existing data in the database and everywhere where this data comes in.

I've seen codebases doing this, where it's impossible to tell whether a particular piece of code is a vulnerability without looking up tons of context. That's the minefield for other maintainers. Don't do this.

mikey_p · on Feb 27, 2020

Nope, that doesn't work, because you can never know what context your data will be used it so you can't know what's safe or not up front:

See also: http://acko.net/blog/safe-string-theory-for-the-web/

Forge36 · on Feb 27, 2020

What if your output isn't always HTML? Now you need to know to decode the database AND reencode in the destination format.

swiley · on Feb 27, 2020

Or maybe just treat strings and user input as opaque objects and leave it to the libraries to escape them.

vxNsr · on Feb 27, 2020

It's interesting how so much of programming is knowing about the existence of libraries and not trying to rebuild things that already exist. I'm sure there are thousands of people out there who sanitize inputs and outputs naively and don't know about great libs like DomPurify.

I know I'm guilty of trying to build something from first principles only to google it after banging my head against edge cases and finding a ready-made library or util that with a tiny bit of finesse or modification does the job.

inapis · on Feb 27, 2020

TBF, most of these libraries are not easily discovered. You have to luck on the chance that the library author used similar words as your search query. Barring major stuff (authentication, databases etc), you will rarely know about a library in existence. Search engines are still limited by the language of humans. Unless and until, you note down every possible library you come across for future reference, this problem is here to stay.

Maybe a language having a vast standard library won’t suffer from this problem but it will definitely have other problems.

vxNsr · on Feb 27, 2020

yeah, creating a personal search db is time consuming and kinda impossible... a while ago everyone was coming up with bookmark managers that kinda sorta could function as a personal search db, but it still required a lot of customization. Also they were all cloud based, and didn't really function offline.

SahAssar · on Feb 27, 2020

DomPurify by default allows style tags ( https://github.com/cure53/DOMPurify/blob/master/src/tags.js#... ) which can be used for tracking visits and some types of password extraction from certain types of frameworks (see https://www.mike-gualtieri.com/posts/stealing-data-with-css-... and https://github.com/maxchehab/CSS-Keylogging ).

Also regarding these types of mitigations see https://www.youtube.com/watch?v=lG7U3fuNw3A which explains how some good mitigations might not catch all edge cases.

I don't think there is a golden rule for this stuff besides having a very strict CSP that only allows same-site resources and no eval/inline with reporting to see when someone has tried to do something bad.

simonw · on Feb 27, 2020

One of the most useful skills I've developed over the years is a very strong instinct for "someone else will have solved this already in an open source library", combined with intuition on exactly how they would have described it so I can find it with search.

CiPHPerCoder · on Feb 27, 2020

> I'm sure there are thousands of people out there who sanitize inputs and outputs naively and don't know about great libs like DomPurify.

Also HTMLPurifier for PHP, and probably several others in other languages.

billpg · on Feb 27, 2020

I wrote a very similar piece six or seven years ago. People responded here that I was just making sematic arguments or otherwise fell over themselves to ignore the point I was trying to make.

It is good to see from the responses here that we've learnt absolutely nothing.

Shameless plug: https://blog.hackensplat.com/2013/09/never-sanitize-your-inp...

palant · on Feb 27, 2020

I've been saying the same for at least a decade, e.g. in https://palant.de/2016/03/02/why-you-should-go-with-secure-b.... It's ridiculous that somebody still has to explain it.

firefoxd · on Feb 27, 2020

It's definitely an interesting article but he didn't go deep enough with subject of escaping the output.

I always recommend this excellent article from Joel Spolsky, Making wrong code look wrong.

https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...

nothrabannosir · on Feb 27, 2020

Agreed, it seems what this article (along with most of the comments) is dancing around is the idea of a stronger type system. String != string, e.g. UserText vs SqlStatement. Using explicit conversion methods between those types helps clarify the actual “boundaries” of your system, or rather the independent parts and their individual boundaries. Joel Spolsky’s article illustrates the problem well.

The problem stems from our simplistic type systems, which reflect a value’s storage class (“array of bytes”) rather than its semantic type (“raw input from user”, “sql safe TEXT literal”). Once your type system can differentiate between those two, it can help you identify where conversion between the two (aka “escaping” or “encoding”) is necessary. Then the problem of “dangerous string” disappears, because there is no more string: if it’s a UserText, it can’t be concatenated with a SqlStatement without conversion. Just like an “int” for example, or an array.

Anyway I’m just rehashing Spolsky’s article, poorly. Don’t let my inaccurate summary reflect negatively on his point :p

patrec · on Feb 27, 2020

Arrrghh, it's 2020 and developers are still too dumb to understand the difference between a string (list of characters) and a tree (sql, html, ...).

No, the solution is not to "sanitize" input, output or both! The solution is to not use the same type for text and trees!

We should rid the world of brain-damaged templating solutions like jinja, go template and similar garbage that pretends your SQL or HTML is some flat string that just needs a bit of extra contextual "escaping" magic [1].

If you interpolate into a proper AST there is no problem and you need no "escaping". For efficiency reasons you may not want to directly interpolate into an actual ast and then de-serialize all of it to a string again just to output it down a socket in the next line, but that's just an optimization.

Bourne shell fucked this up as well (for an easier case of interpolation, since there is basically no nesting) and it remains a constant source of severe bugs and security holes in shell scripting as well. By contrast lisp has been doing this right for literally decades:

    `(html (div ,some-string-to-interpolate ,@a-list-of-inline-elements-to-splice))

[1] Yes, I get it's "convenient" to have a "universal" solution that works for any type of file. But html and sql in particular are easily important enough to have a correct solution and it's not hard, it's not inherently slower and it's way more convenient in any real sense than puzzling about several rube-goldberg sanitization schemes, running some stupid security linter and paying $$$ to pen-testers for hunting down this crap.

breischl · on Feb 27, 2020

I think this stuff is hilarious too, especially in the era of webpages that are really Javascript apps fed by a JSON API. It is not at all hard to have your JS app directly operate on DOM nodes without using markup at all. Create the DIVs and whatever else you want, create text nodes and shove them in there, done. Zero possibility of injection attacks because you told the browser "this is plain text, don't parse it" instead of making it guess.

I hand-rolled a library to do this back in like 2011. It took less than a day. The only downside was that you had to write the markup in JS code instead of templatized HTML, but it wasn't even that hard with a bit of syntactic sugar. It's fast too - creating DOM nodes directly is much easier than parsing HTML to create DOM nodes.

matheusmoreira · on Feb 27, 2020

Parser support is non-existent in most if not all languages. Every language I know is able to parse regular languages at best. Parsing HTML and SQL and manipulating the resulting tree is not the first solution developers think of.

We should be able to look up some RFC, give the EBNF grammar to a library and get a parser out of it. In order to do that today, we need to use ancient parser generator tools. Why? A parse(grammar, input) -> tree function would be easier to use. The Earley algorithm can receive a grammar as input.

Related: http://langsec.org/

patrec · on Feb 27, 2020

Well, I'm not some prefix-fanatic, but much of the problem would not exist in the first place if we had just used some sexp style syntax for HTML, it would be more pleasant to edit, and faster and much easier to parse for both humans and machines to boot. Another billion dollar mistake.

So I feel a bit ambivalent about attempts to lower the costs of pushing out more over complex grammars into the world. When have you last used in earnest a non-sexp/internal DSL for something like build systems that didn't engender in you an occasional urge to visit physical violence on its creators? But what I'd unambiguously like to see is easier parsing of "sane" languages and the death of perl-style regexps.

Still my guess would be: 95% of trouble comes from two: html (including js, css, svg etc. unfortunately) and sql. And most of the remaining 5% from bash :) So just dealing with those three would make a big dent.

Also things are not quite so bleak as you make them out to be: jsx is much saner and any established mainstream language has a conforming html5 parser these days (sadly, to do it properly you also want something that deals with the various other languages that get munged into html: css, javascript, gimped xml and here the situation is less good). SQL is thornier (and has many wildly different dialects), but unless you need dynamic queries, parameterized queries are available everywhere.

In fact a 1% effort/80% of the benefits approach is to not bother with parsing at all and to just use different types for e.g. html and text (interpolate html into html as string-interpolation as text (i.e. plain strings) into html as escaped string).

thdrdt · on Feb 27, 2020

Ofcourse you can validate input!

Validation != Sanitation

In this thread some people seem to confuse those.

You can validate if all input characters are utf-8. But the moment you start to 'sanitize' non-utf-8 input into utf-8 you are in trouble. It's best to notify the user that the input validation failed and that you don't accept the input.

varelaz · on Feb 27, 2020

As for XSS I think browsers could have done more to fix this. For example add some tag like <unsafe> or <sandbox> for part of HTML that cannot have access to cookies and javascript on the page and disables any active components, like iframes and objects. Developers can use them to renderer rich content provided by users. Right now you can do this with iframes and CORS only, but that's too heavy to implement. These tags could have their own CORS limits for example.

Why I think that it's a browser problem. Security of the output needs to be reviewed with browser features added and only on browser level it can be up to date with all new features.

brlewis · on Feb 27, 2020

Do you think CSP doesn't do the job? https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP

varelaz · on Feb 27, 2020

It does but only on whole document. Problem that by default it allows everything, people are lazy to find how it works and setup it right. Also you need to open every third party separatelly which could work bad for ads. If there will be a sandbox, you can allow only what you need for a particular part. I understand that this conception looks complex and more like iframe. Basically right now a lot of ads content rendered in iframes without src which are kind of sandboxes in this case.

megous · on Feb 27, 2020

Some people would still come up with:

echo "<unsafe>$comment_content</unsafe>";

and feel falsely safe.

0xbadcafebee · on Feb 27, 2020

You don't "sanitize" input, you "sane-itize" input. The whole point of checking the input is to make sure it's valid, not to try to scrape away cruft you don't think is valid.

Example: phone numbers. Is the input you accepted a phone number? Well, if you were just "sanitizing input", you might pass it through some generic "sanitizing function" that just checks if it had "malicious characters" or something and strip those out. But what you should actually be checking is is this a phone number? By making sure the input is what it's supposed to be, you not only gain a better security posture, but you improve your program's reliability by making it operate in the way you expect it to.

Some input fields like "give me a random block of text and I'll store it" are very hard to validate, so for those fields you can encode them as Base64 for storage, and at output time decide how to format them safely.

Also consider wrappers for any functions which pass input from a user. Perl's Taint Mode (https://en.wikipedia.org/wiki/Taint_checking) is a global way to enforce this, but for languages that don't have a Taint Mode, you'll have to implement it yourself.

jessaustin · on Feb 27, 2020

Some people really go overboard with this sort of validation, however. I've had to argue with vendors who didn't believe email addresses can end in something other than ".com" or ".org".

michael-ax · on Feb 27, 2020

++ for "validate" and "sane-itize"! its is about input validation after all

rawfan · on Feb 27, 2020

It's actually quite easy:

1. Sanitize input when you actually need/want to do that but at least to a degree that it's save to put in a database (e.g. through prepared statements)

2. Always validate input

3. Always escape output (unless you have a reason not to).

ethhics · on Feb 27, 2020

Postel’s law—be conservative in what you do, but liberal in what you accept from others

SigmundA · on Feb 27, 2020

Exactly what came to mind:

https://en.wikipedia.org/wiki/Robustness_principle

Having done plenty of interfaces between systems these are words to live by.

zAy0LfpBZLC8mAC · on Feb 27, 2020

No, it's terrible advice as it only causes unnecessary interoperability problems and vulnerabilities. There is no reason why anyone should need to generate invalid input to your program and it is never a better idea to make every consumer more complex to deal with broken input than to make one producer create non-broken output.

The only robustness to invalid input you should have is that you should not fall over when you encounter broken input, but simply reject it.

SigmundA · on Feb 28, 2020

Both TCP which Postel wrote the spec and HTML follow this principle so it it seems have its merits.

You know what followed your principle? XHTML...and the arguments were the same, its not well formed just reject it, why would you ever accept broken input.

Sure it makes parsing faster and simpler and yet what actually works and is robust in the real world, HTML...

zAy0LfpBZLC8mAC · on Feb 28, 2020

What is your argument here?

That HTML is a platform with an extraordinary security track record? Noone has ever exploited all the ambiguities that result from the incoherent mess that is the web?

Or is it that we never had any interoperability problems with HTML? All browsers always reliably rendered websites consistently? "This website is optimized for IE" never happened?

How isn't that just the best example to support my point?

As for TCP ... how is it relevant that Postel wrote the spec? Does that mean that the vulnerabilities in TCP never happened? Or are you saying that modern TCP implementations try to accept any crap whatsoever? (No, they don't, of course they don't, people have actually learned that that's a bad idea.)

SigmundA · on Feb 28, 2020

People seem to prefer web sites that render inconsistently rather than not at all because of one little issue in the markup. It is more robust to render something rather than nothing and is one big reason XHTML was abandoned.

Yes a system that no one uses is more secure than one everybody does.

Postel's Law is literally in the TCP RFC [1], don't you think that makes it relevant?

1. https://tools.ietf.org/html/rfc761#section-2.10

zAy0LfpBZLC8mAC · on Feb 28, 2020

> People seem to prefer web sites that render inconsistently rather than not at all because of one little issue in the markup.

Except those are not the alternatives. The alternatives are consistently rendered websites or inconsistently rendered websites. If browsers had strictly enforced HTML syntax from the beginning, noone would ever have built websites with "little issues in the markup".

IP stacks do not accept randomly misformatted IP packets. The result is obviously not that you constantly encounter internet services that you can not access because your IP stack is picky about broken IP packets, the result is that noone ever sends you broken IP packets.

> It is more robust to render something rather than nothing

No, it just isn't. You are just looking at a very small part of the consequences of this implementation strategy that indeed happens to be positive, but completely ignoring the big picture of all the externalities and other indirect damage that result from it.

> and is one big reason XHTML was abandoned.

Erm ... no? The reason why XHTML was abandoned was because people are incompetent at writing software, and there existed an alternative that allowed them to keep their idiotic practices, including all the vulnerabilities and interoperability problems that result from those, so that's what people did.

> Yes a system that no one uses is more secure than one everybody does.

How does that follow? And what does that have to do with anything?

> Postel's Law is literally in the TCP RFC [1], don't you think that makes it relevant?

Relevant ... for what?

SigmundA · on Feb 29, 2020

>Except those are not the alternatives. The alternatives are consistently rendered websites or inconsistently rendered websites. If browsers had strictly enforced HTML syntax from the beginning, noone would ever have built websites with "little issues in the markup".

Thats not reality, if everyone got perfect formed input we wouldn't be having this debate, the reality is it occurs, so what do you do, reject it or accept it and try and do something with it. XHTML simply rejects malformed markup and you get a blank page, HTML tries to make sense of it and render something.

>IP stacks do not accept randomly misformatted IP packets. The result is obviously not that you constantly encounter internet services that you can not access because your IP stack is picky about broken IP packets, the result is that noone ever sends you broken IP packets.

So you never heard of ECN? The ECN bits being set where technically incorrect depending on how pedantic you where in the interpretation and some stacks rejected packets if the bits weren't set to zero. Due to he robustness principle most stacks ignored these bits allowing others to use them for ECN, allowing a graceful update to the spec. The stacks that took your stance however and rejected where simply roadblocks in the adoption.

>No, it just isn't. You are just looking at a very small part of the consequences of this implementation strategy that indeed happens to be positive, but completely ignoring the big picture of all the externalities and other indirect damage that result from it.

I'm not ignoring anything, I am just pointing out reality, the real world is messy and the stacks that try to keep working under messy conditions seem to be prevailing. Its not pretty and I don't deny the issues that arise, but here we are communicating on the largest most successful computer network ever built using a protocol and a markup language built with Postels law in mind.

>Erm ... no? The reason why XHTML was abandoned was because people are incompetent at writing software, and there existed an alternative that allowed them to keep their idiotic practices, including all the vulnerabilities and interoperability problems that result from those, so that's what people did.

I think most who know the history there would disagree with this opinion [1], it was obvious to me at the time why XHTML would fail even though I thought it a cleaner solution, I realized thats what was holding it back. It was much better to see your page come up with maybe a weird rendering artifact than just have the browser render nothing and throw an error if some small part was malformed.

>How does that follow? And what does that have to do with anything?

Because complaining about security vulnerabilities found in some of the most used software in the world while comparing to something that no one uses doesn't help your point.

>Relevant ... for what?

Uh gee I don't know maybe Postel's law is kinda relevant when discussing TCP because Postel wrote the spec you know like what you asked in the post before? What kind of game are you playing here?

1. https://thehistoryoftheweb.com/when-standards-divide/

zAy0LfpBZLC8mAC · on Feb 29, 2020

> Thats not reality, if everyone got perfect formed input we wouldn't be having this debate,

Erm ... you do understand that, you know, there is feedback involved in this? That I am obviously not saying that noone would ever have typed broken HTML into a file if browsers had rejected broken HTML from the start?

I mean, it's even the norm for implementations of other computer languages to be rather strict about syntax, and it doesn't hinder their popularity with the same audience. The exact same people who produce garbage HTML do so using Perl or PHP or Ruby or ... whatever. And whatever you otherwise think about those languages, none of them will just make shit up when there is a syntax error in your program, they will simply reject it. And no, that does not mean that I am claiming that noone has ever made a syntactical mistake when writing code in those languages. But, you know, people are actually capable of fixing those mistakes when they are pointed out to them.

> So you never heard of ECN? The ECN bits being set where technically incorrect depending on how pedantic you where in the interpretation and some stacks rejected packets if the bits weren't set to zero. Due to he robustness principle most stacks ignored these bits allowing others to use them for ECN, allowing a graceful update to the spec. The stacks that took your stance however and rejected where simply roadblocks in the adoption.

Erm ... what? That's almost fractally wrong!?

None of the ECN problem was one of pedantry, it was simply one of a broken specification, namely the TCP specification. "Reserved for future use. Must be zero." is simply a bad specification. If you specify an extension mechanism, you have to always specify how the extension mechanism is supposed to work. What you call the pedantic interpretation is a perfectly valid interpretation of what the text says. You are just looking at it in hindsight, with the idea that it's supposed to support the operation of ECN, and then it's obviously a problem--but people who implemented TCP stuff before there was ECN could not possibly know that that is how people would expect to use this if the TCP specification doesn't specify that. There is nothing wrong with extension mechanisms that work by having the recipient discard messages with flags it doesn't know. That's just not what ECN chose to do, but that is kinda ECN's fault. You might just as well have ended up with a situation where someone would have tried to build an extension that assumes that recipients discard segments with unknown flags, and everyone would have been pointing fingers at those who chose to ignore the flags instead, and how they were pedantic to ignore the flags just because the specification does not explicitly say that such segments are invalid. It's just an accident of history that most implementations chose to ignore unknown flags, and therefore people now point to the exception, without any basis other than them being the majority.

Also, obviously, the "robustness principle" did not allow for a graceful update to the spec. The fact that a graceful update was not possible is the whole reason why you mentioned ECN at all. And that is not necessarily a result of failing to follow the robustness principle, as the robustness principle really doesn't tell you anything useful. All you can do with it is to point at things in hindsight and say "if everyone had built this the same way, then things would be compatible now!" But the robustness principle is useless for actually achieving that. For any format specification, there is an almost infinite number of ways you can deviate from the specification where humans could look at any individual one of those deviations and come to an agreement as to how that deviating message could reasonably be interpreted. And any one of those deviations could in principle be implemented as part of the corresponding parser. But implementing a parser that "correctly" interprets all of those possible deviations is at the very least a major undertaking, and usually even impossible due to contradictions between various deviations when they appear in combination.

And that is why hindsight is misleading: In hindsight, you only see one particular (small set of) deviation(s) causing interoperability problems, and it would almost always have been possible to make every parser coherently interpret those deviations just fine, and if everyone had done that, then you would not have any interoperability problems. But that isn't the perspective of someone who initially builds the implementation. They can only either strictly follow the spec (which works perfectly if everyone does so and the spec isn't broken) or they can increase complexity of and effort required for their implementation an order of magnitude or more to accept close to anything that could happen (which noone does for obvious reasons) or they can implement a random selection of deviations they like (which then leads to interoperability problems and the view in hindsight that everyone else could easily have done the same, which, of course, they couldn't, because they couldn't know what others were doing). Of course, there is a simple solution to that last approach: If you want to implement deviations from the agreed-upon spec but you don't want to run the risk of creating interoperability problems, you could get together with all the other implementers and talk about which deviations everyone is going to implement. But obviously, that's just the first approach in disguise: After you have agreed on the deviations, they aren't deviations anymore, you have simply created a new spec, and everyone then strictly follows that new spec.

Essentially, what is happening here is that you see one interpretation of something that the spec doesn't actually specify as obvious. And then you claim that the solution to interoperability problems is that everyone does the obvious thing. But you fail to recognize that the whole problem we are trying to solve with specifications in the first place is that what seems obvious is different for different people. Which is why this (a) can not work and (b) obviously in practice does not work. You can not solve the problem of people having different approaches to problems by simply saying "they should just all have the same approach" while at the same time saying that methods to create agreement (i.e., specifications) should not be taken too seriously.

> I'm not ignoring anything, I am just pointing out reality, the real world is messy and the stacks that try to keep working under messy conditions seem to be prevailing. Its not pretty and I don't deny the issues that arise, but here we are communicating on the largest most successful computer network ever built using a protocol and a markup language built with Postels law in mind.

Then your points are just irrelevant? I never said that broken systems can not be successful, did I? Yes, there clearly are evolutionary advantages to externalizing costs, and taking risks can pay off. But there are also other parties who have to pay those externalized costs, and taking risks can also end in a catastrophe. Externalizing costs is still an asshole move (and is generally frowned upon by society when people understand that that is what is happening) and whether the risks taken by the web, for example, have actually paid off is far from obvious.

Also, possibly all of this was built with Postel's law in mind. But what I would be interested in is whether that was to our benefit. Just because something was a factor in creating a certain overall positive situation does not mean that therefore that factor made that situation better than if it hadn't been there. In particular, evolutionary success does not mean that a different approach would not have produced a better result.

> I think most who know the history there would disagree with this opinion [1], it was obvious to me at the time why XHTML would fail even though I thought it a cleaner solution, I realized thats what was holding it back. It was much better to see your page come up with maybe a weird rendering artifact than just have the browser render nothing and throw an error if some small part was malformed.

How does that contradict what I said? Yes, it was obvious that XHTML would fail due to the massive incompetence of developers ... your point being?!

> Uh gee I don't know maybe Postel's law is kinda relevant when discussing TCP because Postel wrote the spec you know like what you asked in the post before? What kind of game are you playing here?

I am not sure what kind of game you are playing, but I had the impression like you were trying to make a point and not just state the historical fact that that's where Postel formulated the "robustness principle". Yeah, I agree, that's what he did. And it was a bad idea.

SigmundA · on Feb 29, 2020

>And whatever you otherwise think about those languages, none of them will just make shit up when there is a syntax error in your program, they will simply reject it. And no, that does not mean that I am claiming that noone has ever made a syntactical mistake when writing code in those languages. But, you know, people are actually capable of fixing those mistakes when they are pointed out to them.

Except its pretty common now for programming languages add quality of life changes that loosen some of the strict parsing rules, such as trailing commas or optional semi colons. Same with whitespace, many languages don't pay much attention to it then you have a formatter that is strict about it (gofmt). This is Postels law in action, liberal acceptance strict output. The alternative is strict adherence to whitespace then no need for a formatter, just have the compiler reject it and put the burden on the programmer.

>Also, obviously, the "robustness principle" did not allow for a graceful update to the spec.

Again your opinion is not shared historically, ECN is held up as an example of the robustness principle having been followed in most stacks, with some problem ones that did not causing some issues [1].

>Then your points are just irrelevant?

Then your points are just irrelevant? We can play this game forever. Just the fact the you are using HTML and not XHTML and TCP under that to write these should make some relevant point that you can't seem to see.

>Yes, it was obvious that XHTML would fail due to the massive incompetence of developers ... your point being?!

Or more likely all these developers weren't incompetent including myself, just when given the choice the strictness of XHTML lost to the liberalness of HTML proving Postels law again. Messy and robust won over clean and fragile again, that's the point, get it?

>I am not sure what kind of game you are playing, but I had the impression like you were trying to make a point and not just state the historical fact that that's where Postel formulated the "robustness principle". Yeah, I agree, that's what he did. And it was a bad idea.

>As for TCP ... how is it relevant that Postel wrote the spec? Does that mean that the vulnerabilities in TCP never happened? Or are you saying that modern TCP implementations try to accept any crap whatsoever? (No, they don't, of course they don't, people have actually learned that that's a bad idea.)

Going back to you original question since your having hard time connecting the dots, Postel wrote the spec for TCP and put his law in the spec as guidance. ECN was developed taking advantage of that principle and most stacks accepted the malformed packets because of it. There are other examples of this [2], TCP is complicated if stacks didn't follow Postels law they would never get anything done on the internet.

1. https://tools.ietf.org/html/draft-ietf-tcpm-generalized-ecn-... 2. https://www.snellman.net/blog/archive/2016-02-01-tcp-rst/

zAy0LfpBZLC8mAC · on March 1, 2020

> Except its pretty common now for programming languages add quality of life changes that loosen some of the strict parsing rules, such as trailing commas or optional semi colons. Same with whitespace, many languages don't pay much attention to it then you have a formatter that is strict about it (gofmt). This is Postels law in action, liberal acceptance strict output.

Erm ... no, it's obviously not? Or at least not in a way that is relevant to this discussion. I am obviously not objecting to specifying languages that give you a lot of freedom in how you format things, so what is the point of bringing up that you could interpret the robustness principle to mean just that? I am obviously objecting to accepting input that does not conform to the respective relevant specification, and the fact that making languages more flexible in their formatting is often useful has no relevance to that whatsoever.

You interpret some term to mean a broad range of things, I point out that one of those things is a bad idea, and your defense is that one of the other things is good ... how is that even an argument? How does that change that what I pointed out is a bad idea?

> The alternative is strict adherence to whitespace then no need for a formatter, just have the compiler reject it and put the burden on the programmer.

No, the alternative is strict adherence to the language specification. Or, really, it's not an alternative at all, because there is zero contradiction between specifying a language with flexible whitespace grammar (or separator grammar or whatever) and then strictly enforcing that grammar (and thus obviously avoiding interoperability problems).

> Again your opinion is not shared historically, ECN is held up as an example of the robustness principle having been followed in most stacks, with some problem ones that did not causing some issues [1].

In other words: You position is unfalsifiable? If there are no interoperability problems due to everyone interpreting messages identically, then that is obviously due to the robustness principle, and if there are interoperability problems because implementations deviate in how they interpret messages, then that is also obviously a success of the robustness principle? Is there any scenario where that robustness principle would not count as successful?

> Then your points are just irrelevant? We can play this game forever. Just the fact the you are using HTML and not XHTML and TCP under that to write these should make some relevant point that you can't seem to see.

How is the fact that I am using something in any way relevant to the question of whether an alternative would have avoided interoperability problems and vulnerabilities?

> Or more likely all these developers weren't incompetent including myself, just when given the choice the strictness of XHTML lost to the liberalness of HTML proving Postels law again. Messy and robust won over clean and fragile again, that's the point, get it?

How is it relevant that HTML won? How do you connect from "technology X won over technology Y" to "therefore, technology Y would not have had fewer interoperability problems and vulnerabilities than technology X"?

Why do you answer every question as to technical properties of a technology with "it lost" or "it won" while completely failing to say anything at all about the technical property being discussed?

NOONE DENIES THAT HTML WON OVER XHTML.

Also, it seems you almost completely ignored the central explanation of my previous post, simply to repeat your previous points as if I never had said anything. I am happy to read your explanation as to where my analysis is wrong, but I am completely uninterested in reading over and over points that I repeatedly explained why I don't agree with them with no insight at all into how my reasoning is wrong.

gregw2 · on Feb 27, 2020

It's good wisdom but beware the counter-argument. Being liberal in what you accept causes your implementation to have to maintain a wide range of implicitly defined odd inputs in perpetuity and leads to a form of lock-in based on handling of edge-cases. HTML and browser implementations that allowed all sorts of garbage behaviors are a good example of this. The robustness principle works, but it does come with a cost, like all things.

kazinator · on Feb 27, 2020

Problem with this kind of description is that input and output are often two names for the same thing. One processing element's output is the next one's input.

Basically, don't substitute HTML cruft onto text, except at that stage in processing when that text is just about to be inserted into HTML.

Don't do HTML-ization prematurely.

You wouldn't encode and store data in Base64 just in case it might be needed that way in some future processing step.

HelloNurse · on Feb 27, 2020

A good point, but it is obscured by a terrible choice of vocabulary: in this article "sanitizing input" actually means specifically trying to sanitize input of the whole system as soon as it is received (quite impossible, given the open-ended and conflicting nature of sanitization needs) and "escaping output" actually means sanitizing the input of a specific subsystem for a specific purpose.

Given a faithfully persisted and assumed unsafe original text, SQL query builders can turn everything into SQL strings or die trying, XML parsers can check entity expansion size and other traps, HTML generation templates can introduce fancy markup to surround arbitrary text, XML generation templates can escape input wholesale in CDATA sections, and so on. It's the traditional principle of separation of concerns.

dwheeler · on Feb 27, 2020

I think the title here is misleading. The title is, "Don’t try to sanitize input. Escape output."

The article itself is only talking about sanitizing user input to "prevent cross-site scripting attacks". He later on does require input checking: "Input sanitization is usually a bad idea, but input validation is a good thing... by all means validate it and return an error if it’s invalid."

It's vital for secure programs to check their inputs and minimize what they will accept. I do think it's a good idea to reject "&" and "<" when you can.

But I also agree that in most cases you can't completely forbid all inputs that have HTML metacharacters. In the case of cross-site scripting, the best countermeasure is output escaping. Many modern frameworks do output escaping by default; Rails (for example) has done it for years (in Rails, any "normal" string is automatically escaped when sent back out as HTML). A good reason to prefer one framework over another is because it has secure defaults; if your framework doesn't escape by default, you should consider using a better framework.

You can't depend on just one thing to suddenly make your software secure. You need to validate inputs so that only valid inputs are accepted into your program. You need to escape output, because there are often legitimate input characters that must be escaped. You need to prefer tools that have safe defaults. Use tools to scan your results, to find what you missed. It's not rocket science, but it does require a set of approaches; there is no silver bullet for making secure software.

hamilyon2 · on Feb 27, 2020

Escaping output does not always work when, e.g. you have thousand of integrated systems and don't control any of them nor their upgrades.

If you don't filter malicious inputs, they will forever live in your database and it takes one bad release of some reporting tool somewhere for your users to become vulnerable.

wccrawford · on Feb 27, 2020

If you rely on sanitizing before storing, you can end up with data in your database that somehow missed being sanitized, or is maliciously entered in your database.

You must escape the outputs, no matter how hard you try to sanitize the inputs. Losing "control" of any integrated systems means your system is vulnerable, even if only to someone at a terminal typing things into the DB manually.

hamilyon2 · on Feb 27, 2020

Of course you must escape your outputs, get in-depth layers you don't even need now, track how data flows in your application, patch every known vulnerability, sandbox every single thing, implement capability - based security, and many more. I didn't mean otherwise.

Filtering input is not sufficient. But it is not optional.

zAy0LfpBZLC8mAC · on Feb 27, 2020

> Escaping output does not always work when

Yes, it does. The only exception is if there is no way to represent a given value in the language that you speak to another component, but then you either need to reject the request or sanitize on output (if you can be reasonably sure that doing so won't break the semantics of the information that you are passing on).

> If you don't filter malicious inputs

There is no such thing.

strictnein · on Feb 27, 2020

Not sure why you're being downvoted. Probably because a lot of the people doing so have only worked in small startups.

In a larger org the data you have may end up in a lot of places you didn't expect. Integrated into systems both new and decades old.

hombre_fatal · on Feb 27, 2020

I've worked on systems so shitty that some input was triple-escaped so that it would trickle down three unescape layers in old buggy clients deployed to the field.

The thing is, there's no wisdom to take away from this monstrosity. It's just old shitty legacy ad-hoc code. The grandparent post thinks there is.

That you might need to break every rule in the book to integrate shit-tier software doesn't need to be said when talking about software best practices.

hamilyon2 · on Feb 28, 2020

Always filter, not escape.

Escaping input is of course path to nowhere, because you never know what kind of context your data will be displayed in. So, you cannot guess proper escaping rules.

Escaping data on input is novice mistake. Not having unfiltered data, reject data not conforming to some rules, so you know exactly what to expect getting you somewhere.

You can bake checks and constraints into your database model too if it is needed.

pwdisswordfish2 · on Feb 27, 2020

> Incidentally, the mother in the xkcd comic says, “I hope you’ve learned to sanitize your database inputs.” Which is somewhat confusing, but I’ll give Randall the benefit of the doubt and assume he meant “escape your database parameters”.

What a strangely roundabout way of saying that programming advice found in a webcomic may be actually wrong.

(To be fair, that Xkcd comic was a product of its time, when ‘sanitisation’ was all the rage.)

zarmin · on Feb 27, 2020

colloquial sanitization?

winrid · on Feb 27, 2020

Generally I try to sanitize the input and store the original raw value in case we find sanitation bugs.

I'd prefer fixing the old data when needed to having overhead on every read, at least for FastComments...

winrid · on Feb 27, 2020

downvotes?

zAy0LfpBZLC8mAC · on Feb 27, 2020

Yes. Sanitizing inputs is a bad idea, still.

winrid · on Feb 27, 2020

Maybe Sanitizing is the wrong word for what I'm doing. For example, I need to strip marketing/tracking information from URLs before saving them or else someone coming from Google will have a different URL than someone coming from FB and then the comments won't load.

I guess I meant normalization.

lenkite · on Feb 27, 2020

Always use prepared statements to set query parameters. This will handle 99.9% of all query use-cases. Constructing dynamic SQL from user input is a fool's game.

water42 · on Feb 27, 2020

Don't filter input. Instead, prevent certain characters from being input in text elements. This is a user experience problem, not a software problem. The software can validate that a "name" is rejected if it does not follow the front end validations, but it doesn't need to do any more than that.

Of course, this argument does not extend beyond a "name" field to more complex fields. But more complex fields are less susceptible to introducing UX problems if certain characters are sanitized.

wurp · on Feb 27, 2020

The article misstates what 'sanitizing inputs' means.

I agree with posters who recommend passing data as parameters to methods that don't require sanitized input (e.g. stored procedures or KeyValue APIs).

Also, sanitizing input means transforming input so you retain the original content, but without escape or control characters. Sanitizing input does not mean throwing part of the input away (except when you know it is meaningless in your context, e.g. spaces at the end of a name).

dana321 · on Feb 27, 2020

In the example given there, use json_encode($name)

That will encode any data structure properly for output into JavaScript.

_nalply · on Feb 27, 2020

It depends.

If you take arguments to some sub-system (an example are database keys like the id of an entity instance), then you need to sanitize input.

Anyway, today I learnt something. If you have free-form data like text it makes sense not to sanitize it because in this case sanitizing depends on the output domain. For example < is dangerous for HTML and ' is dangerous for SQL, and so on.

NohatCoder · on Feb 27, 2020

A method I have generally found useful is to make a whitelist of safe characters (something like alphanumeric, comma, dot and space), and escape everything else. You might escape a bunch of stuff that technically didn't need escaping, but the method is simple, rock solid, and doesn't mangle anyone's names.

DuckyC · on Feb 27, 2020

My name contains Ø, and im guessing i would not be able to enter that with your method. I would consider that mangling my name if i had to write o or oe.

NohatCoder · on Feb 27, 2020

No, escape means keep, in HTML for instance Ø would become Ø escaped, but it is still there visible, same as every other character.

hombre_fatal · on Feb 27, 2020

This kind of thinking is how your users end up getting emails from your buggy service like "Hello Østein & friends, ..." and your JSON API consumers encounter the same silly output.

Don't escape input. Escape based on output. Escaping doesn't mean anything until you've also specified an output format. It's not always HTML.

NohatCoder · on Feb 27, 2020

You are grossly misrepresenting my post, I have said nothing about whether the escaping should be applied to input or output, please edit or delete your comment.

jiveturkey · on Feb 28, 2020

amazing that in 2020 this is still so poorly understood.

not as evidenced the TFA (which I didn't bother to read), but by the strong, uh "opinions" here on HN.

dangerface · on Feb 27, 2020

Why not both?

I don't filter my input I just make sure its sane.

ch · on Feb 27, 2020

One's output is another's input.

jancsika · on Feb 27, 2020

In the example given, why can't the programmer simply set the textContent for the given element to the arbitrary user input?

Jugurtha · on Feb 27, 2020

"Don't tell me what the fuck to do - tell me what you did and what has worked for you in a given context."

justinator · on Feb 27, 2020

Why is Billy the Kid, a Wild West outlaw, talking like a pirate?

benhoyt · on Feb 27, 2020

Good call! Guess I was thinking of Captain Kidd. Fixed, kind of. :-)

Then again, I'm not sure what Billy the Kid's doing surfing the web in 2020 either...

tobyhinloopen · on Feb 27, 2020

This is a terrible idea.

0xff00ffee · on Feb 27, 2020

These two issues have been relevant for over 20 years, older than today's college grads. I find it fascinating the new blog posts are being written on a regular basis explaining these two pitfalls. And there are probably a million blogs with these same two examples going back decades. This is an epic re-post in spirit.

I'm tempted to ask, "Why hasn't this been fixed yet?" Where "fixed" means, "Something new programmers just starting off their careers don't have to jam into their brains?"

I know the expected answer will be: it's an abstraction of a more complex problem of understanding data and how it is used, and we can talk about how JS and PHP have added native functions to construct custom code to address this problem (hack cough cough hack).

But these two cases in particular stick in my craw because there is no fundamental solution presenting yet.

So I ask again (and I've been asking this since 2004-ish):

Why do these two examples still persist? Why do the frameworks not eliminate them by construction? This is such a repeated pattern, why is it even there?

extrapickles · on Feb 27, 2020

It boils down to two things. One is library/tool design typically makes it too easy to make user input be in-band with execution. The other is that most tutorials/guides only show you how to do things in-band.

An example of a tutorial for SQL:

SELECT first_name FROM users WHERE last_name = ‘Smith’

They then have an exercise to hook this query to a text box in the program, where through omission, the programmer is guided to use string concatenation to build the query.

If from the first SQL statement to the last that a programmer saw was parameterized, it would be much harder for them to reach for string concatenation.

Most modern web development frameworks make it very hard to insert un-escaped text into the DOM. You have to go out of your way to introduce a XSS vulnerability in your web application with one, and most of the tutorials and documentation about the framework warn you about using the raw HTML functionality.

Another way to look at it is that the out-of-band way of doing things is typically perceived as either lower in performance, harder to do and/or less elegant (eg: C-style strings vs pascal strings).

I consider anything with user input that is done in-band (eg: escaping is a fix) to be doomed to fail. This is similar in idea to the cryptographic doom principle where decryption before authenticating the message is ultimately doomed to failure.

gwd · on Feb 27, 2020

> If from the first SQL statement to the last that a programmer saw was parameterized, it would be much harder for them to reach for string concatenation.

I dunno -- I've been doing C programming for 30-ish years now, but just learned SQL about a year ago. Every man page I looked at, as well as every stackoverflow question, emphasized the importance of using parametrized queries. And IIRC in Python, "only execute a single statement" is enabled by default; if you want to execute multiple statements, you have to use a different call. So even if you somehow manage to forget to parameterize your queries, you'll still be safe from Little Bobby Tables.

Do SQL injection attacks still actually happen? How is it possible?

hobs · on Feb 27, 2020

SQL injection attacks still happen, a LOT.

Making sure your query only executes a single statement is not enough to prevent sql injection (depending on how you concat the query) - you just have to use the provided context to get/set data you need to escalate your permissions (eg if its SELECT stuff FROM table, you might be able to inject it such that your query replaces stuff and then you can select whatever the querying user has access to.)

2019: For its "State of the Internet" report, Akamai analyzed data gathered from users of its Web application firewall technology between November 2017 and March 2019. The exercise shows that SQL injection (SQLi) now represents nearly two-thirds (65.1%) of all Web application attacks. That's up sharply from the 44% of Web application layer attacks that SQLi represented just two years ago.

zAy0LfpBZLC8mAC · on Feb 27, 2020

> The exercise shows that SQL injection (SQLi) now represents nearly two-thirds (65.1%) of all Web application attacks.

Are you sure that you don't actually mean "weird strings sent to web applications" when you write "web application attacks"? Sending a weird string to web applications is not an attack in any particularly relevant sense unless there is an actual vulnerability--other than when akamai wants to sell you snake-oil against all the danger of weird strings.

tsukurimashou · on Feb 27, 2020

A lot of websites and applications are built for a purpose they get forgotten, I'm speaking especially people doing it in their free time for fun, or small company projects.

bluedino · on Feb 27, 2020

Many applications were written 5, 10, or even 25 years ago. They haven't been updated to current standards.

AmericanChopper · on Feb 27, 2020

> The other is that most tutorials/guides only show you how to do things in-band.

Learning SQL is completely different from learning how to use SQL in whatever programming language/framework/library you may end up using. Learning how to safely interface with an RDBMS is going to be entirely specific to the stack you’re using for the rest of your application.

tastroder · on Feb 27, 2020

> Why do the frameworks not eliminate them by construction?

They do, other than the pure PHP example that simply predates modern approaches to web security and is somewhat of an intentional misuse, modern templating engines (that the article also mentions) default to escaped output. That still means that new devs have to be aware of the mechanism and not go out of their way to shoot themselves in the foot by not using the appropriate mechanisms, which I guess explains the blog posts. I honestly don't think it's that bad, that's just part of any generation of developers learning the basics of their domain. To me XSS still being in the OWASP top 10 was always more of an indication that we suck at training (for their basic stack and security minded development) rather than some conceptual failure of the frameworks we use.

There's plenty of "fixes by construction" out there, that doesn't eliminate new devs not using them or experienced folk making an error every once in a while.

0xff00ffee · on Feb 27, 2020

Thanks for the concise response.

> that we suck at training

That makes me wonder what kind of training companies require. How many companies hire based on DIY examples in interviews, and think "ok, this new hire knows enough", rather than run the risk of essentially re-training 90% of what they already know, despite that 10% being critical knowledge?

I don't have a sense of what dev training looks like across the industry.

breischl · on Feb 27, 2020

Company I'm at now requires basic security training every year. TBH it kinda sucks at showing solutions to this kind of problems, but at least it makes people aware of the risks. I think that might be a PCI compliance thing, but I'm not sure.

tastroder · on Feb 27, 2020

I feel like that's a hard problem that exceeds hiring these days, would love an answer to that as well. My personal approach for junior positions has mostly been to hire rather selectively (when I can) to get people that at least recognize when they might lack knowledge in a certain area, team them up during the onboarding period and somewhat strict code review policies at least in the beginning.

Avoiding stipulating this training to all new hires is a symptom of me having an aversion to most classroom settings though, I've had quite a few developers that enjoyed getting this style of training after they indicated they wanted it down the road. I personally wouldn't have enjoyed the 90% retraining scenario (monetary loss that implies aside). I've found training on specific aspects with a bit of practical engagement to be more effective, e.g. there are great and engaging courses to transport basic web security. Not that these are always up to date or trainees retain everything but it gets them into the right mindset to be aware of issues.

But of course even with an approach that works 100% of the time, these days that doesn't guarantee that none of your dependencies or outsourced code production is up to the same standard.

tl;dr is "I don't know either" I guess but maybe you can take something away from it.

throwawayjava · on Feb 27, 2020

> (hack cough cough hack)... there is no fundamental solution presenting yet.

Because people think those hacks are fundamental solutions (see: this blog title).

But really, the fundamental solution is finally at long last treating programming as a form of engineering.

> I know the expected answer will be: it's an abstraction of a more complex problem of understanding data and how it is used... Why do the frameworks not eliminate them by construction?

Because in any non-trivial system there are always edge cases, and attackers will find the edge cases. This is why XSS persists even as template engines have taken over. "filter output" is not a panacea. Nothing can replace carefully thinking about the entire range of possible inputs and their related outputs.

But instead of educating programmers to think carefully about how to specify and design robust systems, the software industry repeats gang-of-four-style mantras like "escape output". Even while admitting those solutions don't work universally and offering "get security review" as some sort of universal fix.

megous · on Feb 27, 2020

It's interesting that single page apps actullay have a benefit here. If you generate DOM with code, you can just assign anything you like to el.textContent and you'll not need to muck around with sanitization libraries and edge cases.

Basically the same principle like using parametrized SQL queries.