[go: up one dir, main page]

Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Arrrghh, it's 2020 and developers are still too dumb to understand the difference between a string (list of characters) and a tree (sql, html, ...).

No, the solution is not to "sanitize" input, output or both! The solution is to not use the same type for text and trees!

We should rid the world of brain-damaged templating solutions like jinja, go template and similar garbage that pretends your SQL or HTML is some flat string that just needs a bit of extra contextual "escaping" magic [1].

If you interpolate into a proper AST there is no problem and you need no "escaping". For efficiency reasons you may not want to directly interpolate into an actual ast and then de-serialize all of it to a string again just to output it down a socket in the next line, but that's just an optimization.

Bourne shell fucked this up as well (for an easier case of interpolation, since there is basically no nesting) and it remains a constant source of severe bugs and security holes in shell scripting as well. By contrast lisp has been doing this right for literally decades:

    `(html (div ,some-string-to-interpolate ,@a-list-of-inline-elements-to-splice))

[1] Yes, I get it's "convenient" to have a "universal" solution that works for any type of file. But html and sql in particular are easily important enough to have a correct solution and it's not hard, it's not inherently slower and it's way more convenient in any real sense than puzzling about several rube-goldberg sanitization schemes, running some stupid security linter and paying $$$ to pen-testers for hunting down this crap.


I think this stuff is hilarious too, especially in the era of webpages that are really Javascript apps fed by a JSON API. It is not at all hard to have your JS app directly operate on DOM nodes without using markup at all. Create the DIVs and whatever else you want, create text nodes and shove them in there, done. Zero possibility of injection attacks because you told the browser "this is plain text, don't parse it" instead of making it guess.

I hand-rolled a library to do this back in like 2011. It took less than a day. The only downside was that you had to write the markup in JS code instead of templatized HTML, but it wasn't even that hard with a bit of syntactic sugar. It's fast too - creating DOM nodes directly is much easier than parsing HTML to create DOM nodes.


Parser support is non-existent in most if not all languages. Every language I know is able to parse regular languages at best. Parsing HTML and SQL and manipulating the resulting tree is not the first solution developers think of.

We should be able to look up some RFC, give the EBNF grammar to a library and get a parser out of it. In order to do that today, we need to use ancient parser generator tools. Why? A parse(grammar, input) -> tree function would be easier to use. The Earley algorithm can receive a grammar as input.

Related: http://langsec.org/


Well, I'm not some prefix-fanatic, but much of the problem would not exist in the first place if we had just used some sexp style syntax for HTML, it would be more pleasant to edit, and faster and much easier to parse for both humans and machines to boot. Another billion dollar mistake.

So I feel a bit ambivalent about attempts to lower the costs of pushing out more over complex grammars into the world. When have you last used in earnest a non-sexp/internal DSL for something like build systems that didn't engender in you an occasional urge to visit physical violence on its creators? But what I'd unambiguously like to see is easier parsing of "sane" languages and the death of perl-style regexps.

Still my guess would be: 95% of trouble comes from two: html (including js, css, svg etc. unfortunately) and sql. And most of the remaining 5% from bash :) So just dealing with those three would make a big dent.

Also things are not quite so bleak as you make them out to be: jsx is much saner and any established mainstream language has a conforming html5 parser these days (sadly, to do it properly you also want something that deals with the various other languages that get munged into html: css, javascript, gimped xml and here the situation is less good). SQL is thornier (and has many wildly different dialects), but unless you need dynamic queries, parameterized queries are available everywhere.

In fact a 1% effort/80% of the benefits approach is to not bother with parsing at all and to just use different types for e.g. html and text (interpolate html into html as string-interpolation as text (i.e. plain strings) into html as escaped string).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: