fix: Reject parse trees with missing nodes #587

wetneb · 2025-09-11T18:19:34+02:00

wetneb commented

2025-09-11 18:19:34 +02:00

Fixes #554.
I had to adjust a bunch of test cases which cannot parse anymore.
The fact that our JSON parser doesn't accept trailing commas is a
concern, I think we should migrate to a more accepting one, such as
https://github.com/Joakker/tree-sitter-json5.

Fixes #554. I had to adjust a bunch of test cases which cannot parse anymore. The fact that our JSON parser doesn't accept trailing commas is a concern, I think we should migrate to a more accepting one, such as https://github.com/Joakker/tree-sitter-json5.

wetneb added 1 commit

2025-09-11 18:19:36 +02:00

fix: Reject parse trees with missing nodes

/ test (pull_request) Successful in 42s

Details

a55b46ce07

Fixes #554.
I had to adjust a bunch of test cases which cannot parse anymore.
The fact that our JSON parser doesn't accept trailing commas is a
concern, I think we should migrate to a more accepting one, such as
https://github.com/Joakker/tree-sitter-json5.

ada4a commented

2025-09-11 18:27:37 +02:00

The fact that our JSON parser doesn't accept trailing commas is a
concern

May I ask why? That is consistent with the spec, after all

> The fact that our JSON parser doesn't accept trailing commas is a concern May I ask why? That [is consistent with the spec](https://stackoverflow.com/a/201856), after all

wetneb commented

2025-09-11 18:30:03 +02:00

I think the variants without commas at the end are widespread enough that it would be worth supporting it.

ada4a reviewed

2025-09-11 18:52:53 +02:00

ada4a left a comment

LGTM, just some questions regarding the error message

src/ast.rs Outdated

					
				@ -329,0 +330,4 @@

				            let full_range = node.range();

				            return Err(format!(

				                "parse error at {}:{}, expected a missing token: {}",

ada4a commented

2025-09-11 18:47:46 +02:00

I think having the node kind in backticks would help readibility, WDYT?

Also, "expecting" sounds a bit wrong here I think? If anything, we "encountered" a missing node. But given that we output the kind of the missing node, maybe phrasing it like "parse error at {pos}, a missing {kind} node" would work? Not really sure about this one though

I think having the node kind in backticks would help readibility, WDYT? Also, "expecting" sounds a bit wrong here I think? If anything, we "encountered" a missing node. But given that we output the kind of the missing node, maybe phrasing it like "parse error at {pos}, a missing `{kind}` node" would work? Not really sure about this one though

wetneb commented

2025-09-11 18:56:42 +02:00

Well, if you look at it from mergiraf's internal point of view, yes, we are encountering a "MISSING token" returned by the tree-sitter parser. But from a user perspective, I would rather describe the problem as "we expect this bit of syntax at this position and it's not there". I'm trying to formulate the error message in a way that makes more sense to the user.

wetneb commented

2025-09-11 18:58:06 +02:00

So maybe just "expected {}" would do, the missing part is sort of redundant I would say.

So maybe just `"expected `{}`"` would do, the missing part is sort of redundant I would say.

ada4a commented

2025-09-11 19:01:31 +02:00

So maybe just "expected {}" would do, the missing part is sort of redundant I would say.

Oh, that does sound better! Yeah, let's go with that

> So maybe just `"expected `{}`"` would do, the missing part is sort of redundant I would say. Oh, that does sound better! Yeah, let's go with that

ada4a marked this conversation as resolved

src/ast.rs Outdated

					
				@ -1241,0 +1252,4 @@

				        let ctx = ctx();

				        let lang_profile = LangProfile::detect_from_filename("test.java")

				            .expect("Could not load the Java lang profile");

				        let parse = AstNode::parse("class Test {", lang_profile, &ctx.arena, &ctx.ref_arena);

ada4a commented

2025-09-11 18:51:01 +02:00

nit: I think at this point we could as well use Arena::new() directly here..

nit: I think at this point we could as well use `Arena::new()` directly here..

ada4a marked this conversation as resolved

src/ast/bundle_comments.rs Outdated

					
				@ -570,3 +570,3 @@

				    fn dont_bundle_into_delims() {

				        let ctx = ctx();

				        let source = "(/* this is a comment */)";

				        let source = "fn test(/* this is a comment */) {}";

ada4a commented

2025-09-11 18:34:52 +02:00

nit: I think I remember this being parsed as an expression_statement, so adding a semicolon after the tuple would've probably sufficed? Though your fix works just as well of course

nit: I think I remember this being parsed as an `expression_statement`, so adding a semicolon after the tuple would've probably sufficed? Though your fix works just as well of course

ada4a marked this conversation as resolved

ada4a commented

2025-09-11 19:00:07 +02:00

I think the variants without commas at the end are widespread enough that it would be worth supporting it.

Sorry, do you mean variants with commas? I honestly have no idea, but I imagine that if someone was consciouy opting into the JSON5 grammar, they would the file the .json5 ending, or something like that? Because if they went with .json, it'd just be a matter of time before some program with a conformant parser would reject their file?

> I think the variants without commas at the end are widespread enough that it would be worth supporting it. Sorry, do you mean variants _with_ commas? I honestly have no idea, but I imagine that if someone was consciouy opting into the JSON5 grammar, they would the file the `.json5` ending, or something like that? Because if they went with `.json`, it'd just be a matter of time before some program with a conformant parser would reject their file?

wetneb commented

2025-09-11 20:32:31 +02:00

Yes sorry, I mean with trailing commas… I think in general it's good if our parsers err more on the accepting side (which is often the case for syntax-highlighting grammars), so that we are able to deal with slightly broken input.
Another reason to switch to that parser would be, for instance, to better support JSON snippets in markdown which could have comments, for documentation purposes, even though it's not valid JSON:

# Hello

This is a nice markdown file explaining some things:
```json
{
     "key": 3, // this value is really important
     "foo": 4,
}
```
and the documentation goes on.

You can see how Forgejo itself accepts to highlight the comment even though I specified that this sample is injson, not json5.

Anyway, I don't have strong feelings about it, but I wouldn't do it in this PR anyway.
I would be excited to merge this PR because it's blocking the addition of a test case I wrote for the minimizer, which would enable us to merge that one too… ^^

Yes sorry, I mean with trailing commas… I think in general it's good if our parsers err more on the accepting side (which is often the case for syntax-highlighting grammars), so that we are able to deal with slightly broken input. Another reason to switch to that parser would be, for instance, to better support JSON snippets in markdown which could have comments, for documentation purposes, even though it's not valid JSON: ````md # Hello This is a nice markdown file explaining some things: ```json { "key": 3, // this value is really important "foo": 4, } ``` and the documentation goes on. ```` You can see how Forgejo itself accepts to highlight the comment even though I specified that this sample is in`json`, not `json5`. Anyway, I don't have strong feelings about it, but I wouldn't do it in this PR anyway. I would be excited to merge this PR because it's blocking the addition of a test case I wrote for the minimizer, which would enable us to merge that one too… ^^

ada4a commented

2025-09-11 20:35:19 +02:00

Anyway, I don't have strong feelings about it, but I wouldn't do it in this PR anyway.
I would be excited to merge this PR because it's blocking the addition of a test case I wrote for the minimizer, which would enable us to merge that one too… ^^

Of course, go on! (after the error message thing is done^^)

> Anyway, I don't have strong feelings about it, but I wouldn't do it in this PR anyway. > I would be excited to merge this PR because it's blocking the addition of a test case I wrote for the minimizer, which would enable us to merge that one too… ^^ Of course, go on! (after the error message thing is done^^)

😆 1

wetneb force-pushed 554-missing-nodes from a55b46ce07 to 9fdae959aa

2025-09-11 21:06:59 +02:00

Compare

wetneb commented

2025-09-11 21:11:15 +02:00

Ah yes sorry about that, it looks like my attention span is getting dangerously small… To my defense, I went into a succession of nested rabbit holes:

I want to get the minimizer PR merged, so I need at least test for that
the minimizer is too aggressive because it accepts to parse things where some tokens are missing, so let's fix #554
to understand where the missing tokens are, I need to run the tree-sitter playground on some grammars
but the tree-sitter playground is broken, so I opened https://github.com/tree-sitter/tree-sitter/issues/4815
then investigate and file https://github.com/tree-sitter/tree-sitter.github.io/pull/17

it's rare that it gets that bad! :-D

Ah yes sorry about that, it looks like my attention span is getting dangerously small… To my defense, I went into a succession of nested rabbit holes: - I want to get the minimizer PR merged, so I need at least test for that - the minimizer is too aggressive because it accepts to parse things where some tokens are missing, so let's fix #554 - to understand where the missing tokens are, I need to run the tree-sitter playground on some grammars - but the tree-sitter playground is broken, so I opened https://github.com/tree-sitter/tree-sitter/issues/4815 - then investigate and file https://github.com/tree-sitter/tree-sitter.github.io/pull/17 it's rare that it gets that bad! :-D

😆 1

wetneb force-pushed 554-missing-nodes from 9fdae959aa to 68845ede84

2025-09-11 21:20:10 +02:00

Compare

ada4a approved these changes

2025-09-11 22:12:32 +02:00

ada4a left a comment

LGTM, thank you!

wetneb merged commit 43d4c41c30 into main

2025-09-11 23:01:33 +02:00

wetneb referenced this pull request from a commit

2025-09-11 23:01:33 +02:00