Arrrghh, it's 2020 and developers are still too dumb to understand the differenc...

breischl · on Feb 27, 2020

I think this stuff is hilarious too, especially in the era of webpages that are really Javascript apps fed by a JSON API. It is not at all hard to have your JS app directly operate on DOM nodes without using markup at all. Create the DIVs and whatever else you want, create text nodes and shove them in there, done. Zero possibility of injection attacks because you told the browser "this is plain text, don't parse it" instead of making it guess.

I hand-rolled a library to do this back in like 2011. It took less than a day. The only downside was that you had to write the markup in JS code instead of templatized HTML, but it wasn't even that hard with a bit of syntactic sugar. It's fast too - creating DOM nodes directly is much easier than parsing HTML to create DOM nodes.

matheusmoreira · on Feb 27, 2020

Parser support is non-existent in most if not all languages. Every language I know is able to parse regular languages at best. Parsing HTML and SQL and manipulating the resulting tree is not the first solution developers think of.

We should be able to look up some RFC, give the EBNF grammar to a library and get a parser out of it. In order to do that today, we need to use ancient parser generator tools. Why? A parse(grammar, input) -> tree function would be easier to use. The Earley algorithm can receive a grammar as input.

Related: http://langsec.org/

patrec · on Feb 27, 2020

Well, I'm not some prefix-fanatic, but much of the problem would not exist in the first place if we had just used some sexp style syntax for HTML, it would be more pleasant to edit, and faster and much easier to parse for both humans and machines to boot. Another billion dollar mistake.

So I feel a bit ambivalent about attempts to lower the costs of pushing out more over complex grammars into the world. When have you last used in earnest a non-sexp/internal DSL for something like build systems that didn't engender in you an occasional urge to visit physical violence on its creators? But what I'd unambiguously like to see is easier parsing of "sane" languages and the death of perl-style regexps.

Still my guess would be: 95% of trouble comes from two: html (including js, css, svg etc. unfortunately) and sql. And most of the remaining 5% from bash :) So just dealing with those three would make a big dent.

Also things are not quite so bleak as you make them out to be: jsx is much saner and any established mainstream language has a conforming html5 parser these days (sadly, to do it properly you also want something that deals with the various other languages that get munged into html: css, javascript, gimped xml and here the situation is less good). SQL is thornier (and has many wildly different dialects), but unless you need dynamic queries, parameterized queries are available everywhere.

In fact a 1% effort/80% of the benefits approach is to not bother with parsing at all and to just use different types for e.g. html and text (interpolate html into html as string-interpolation as text (i.e. plain strings) into html as escaped string).