fix: Reject parse trees with missing nodes #587

Merged
wetneb merged 1 commit from 554-missing-nodes into main 2025-09-11 23:01:33 +02:00
Owner

Fixes #554.
I had to adjust a bunch of test cases which cannot parse anymore.
The fact that our JSON parser doesn't accept trailing commas is a
concern, I think we should migrate to a more accepting one, such as
https://github.com/Joakker/tree-sitter-json5.

Fixes #554. I had to adjust a bunch of test cases which cannot parse anymore. The fact that our JSON parser doesn't accept trailing commas is a concern, I think we should migrate to a more accepting one, such as https://github.com/Joakker/tree-sitter-json5.
fix: Reject parse trees with missing nodes
All checks were successful
/ test (pull_request) Successful in 42s
a55b46ce07
Fixes #554.
I had to adjust a bunch of test cases which cannot parse anymore.
The fact that our JSON parser doesn't accept trailing commas is a
concern, I think we should migrate to a more accepting one, such as
https://github.com/Joakker/tree-sitter-json5.
Owner

The fact that our JSON parser doesn't accept trailing commas is a
concern

May I ask why? That is consistent with the spec, after all

> The fact that our JSON parser doesn't accept trailing commas is a concern May I ask why? That [is consistent with the spec](https://stackoverflow.com/a/201856), after all
Author
Owner

I think the variants without commas at the end are widespread enough that it would be worth supporting it.

I think the variants without commas at the end are widespread enough that it would be worth supporting it.
ada4a left a comment
Owner

LGTM, just some questions regarding the error message

LGTM, just some questions regarding the error message
src/ast.rs Outdated
@ -329,0 +330,4 @@
let full_range = node.range();
return Err(format!(
"parse error at {}:{}, expected a missing token: {}",
Owner

I think having the node kind in backticks would help readibility, WDYT?

Also, "expecting" sounds a bit wrong here I think? If anything, we "encountered" a missing node. But given that we output the kind of the missing node, maybe phrasing it like "parse error at {pos}, a missing {kind} node" would work? Not really sure about this one though

I think having the node kind in backticks would help readibility, WDYT? Also, "expecting" sounds a bit wrong here I think? If anything, we "encountered" a missing node. But given that we output the kind of the missing node, maybe phrasing it like "parse error at {pos}, a missing `{kind}` node" would work? Not really sure about this one though
Author
Owner

Well, if you look at it from mergiraf's internal point of view, yes, we are encountering a "MISSING token" returned by the tree-sitter parser. But from a user perspective, I would rather describe the problem as "we expect this bit of syntax at this position and it's not there". I'm trying to formulate the error message in a way that makes more sense to the user.

Well, if you look at it from mergiraf's internal point of view, yes, we are encountering a "MISSING token" returned by the tree-sitter parser. But from a user perspective, I would rather describe the problem as "we expect this bit of syntax at this position and it's not there". I'm trying to formulate the error message in a way that makes more sense to the user.
Author
Owner

So maybe just "expected {}" would do, the missing part is sort of redundant I would say.

So maybe just `"expected `{}`"` would do, the missing part is sort of redundant I would say.
Owner

So maybe just "expected {}" would do, the missing part is sort of redundant I would say.

Oh, that does sound better! Yeah, let's go with that

> So maybe just `"expected `{}`"` would do, the missing part is sort of redundant I would say. Oh, that does sound better! Yeah, let's go with that
ada4a marked this conversation as resolved
src/ast.rs Outdated
@ -1241,0 +1252,4 @@
let ctx = ctx();
let lang_profile = LangProfile::detect_from_filename("test.java")
.expect("Could not load the Java lang profile");
let parse = AstNode::parse("class Test {", lang_profile, &ctx.arena, &ctx.ref_arena);
Owner

nit: I think at this point we could as well use Arena::new() directly here..

nit: I think at this point we could as well use `Arena::new()` directly here..
ada4a marked this conversation as resolved
@ -570,3 +570,3 @@
fn dont_bundle_into_delims() {
let ctx = ctx();
let source = "(/* this is a comment */)";
let source = "fn test(/* this is a comment */) {}";
Owner

nit: I think I remember this being parsed as an expression_statement, so adding a semicolon after the tuple would've probably sufficed? Though your fix works just as well of course

nit: I think I remember this being parsed as an `expression_statement`, so adding a semicolon after the tuple would've probably sufficed? Though your fix works just as well of course
ada4a marked this conversation as resolved
Owner

I think the variants without commas at the end are widespread enough that it would be worth supporting it.

Sorry, do you mean variants with commas? I honestly have no idea, but I imagine that if someone was consciouy opting into the JSON5 grammar, they would the file the .json5 ending, or something like that? Because if they went with .json, it'd just be a matter of time before some program with a conformant parser would reject their file?

> I think the variants without commas at the end are widespread enough that it would be worth supporting it. Sorry, do you mean variants _with_ commas? I honestly have no idea, but I imagine that if someone was consciouy opting into the JSON5 grammar, they would the file the `.json5` ending, or something like that? Because if they went with `.json`, it'd just be a matter of time before some program with a conformant parser would reject their file?
Author
Owner

Yes sorry, I mean with trailing commas… I think in general it's good if our parsers err more on the accepting side (which is often the case for syntax-highlighting grammars), so that we are able to deal with slightly broken input.
Another reason to switch to that parser would be, for instance, to better support JSON snippets in markdown which could have comments, for documentation purposes, even though it's not valid JSON:

# Hello

This is a nice markdown file explaining some things:
```json
{
     "key": 3, // this value is really important
     "foo": 4,
}
```
and the documentation goes on.

You can see how Forgejo itself accepts to highlight the comment even though I specified that this sample is injson, not json5.

Anyway, I don't have strong feelings about it, but I wouldn't do it in this PR anyway.
I would be excited to merge this PR because it's blocking the addition of a test case I wrote for the minimizer, which would enable us to merge that one too… ^^

Yes sorry, I mean with trailing commas… I think in general it's good if our parsers err more on the accepting side (which is often the case for syntax-highlighting grammars), so that we are able to deal with slightly broken input. Another reason to switch to that parser would be, for instance, to better support JSON snippets in markdown which could have comments, for documentation purposes, even though it's not valid JSON: ````md # Hello This is a nice markdown file explaining some things: ```json { "key": 3, // this value is really important "foo": 4, } ``` and the documentation goes on. ```` You can see how Forgejo itself accepts to highlight the comment even though I specified that this sample is in`json`, not `json5`. Anyway, I don't have strong feelings about it, but I wouldn't do it in this PR anyway. I would be excited to merge this PR because it's blocking the addition of a test case I wrote for the minimizer, which would enable us to merge that one too… ^^
Owner

Anyway, I don't have strong feelings about it, but I wouldn't do it in this PR anyway.
I would be excited to merge this PR because it's blocking the addition of a test case I wrote for the minimizer, which would enable us to merge that one too… ^^

Of course, go on! (after the error message thing is done^^)

> Anyway, I don't have strong feelings about it, but I wouldn't do it in this PR anyway. > I would be excited to merge this PR because it's blocking the addition of a test case I wrote for the minimizer, which would enable us to merge that one too… ^^ Of course, go on! (after the error message thing is done^^)
wetneb force-pushed 554-missing-nodes from a55b46ce07 to 9fdae959aa 2025-09-11 21:06:59 +02:00 Compare
Author
Owner

Ah yes sorry about that, it looks like my attention span is getting dangerously small… To my defense, I went into a succession of nested rabbit holes:

it's rare that it gets that bad! :-D

Ah yes sorry about that, it looks like my attention span is getting dangerously small… To my defense, I went into a succession of nested rabbit holes: - I want to get the minimizer PR merged, so I need at least test for that - the minimizer is too aggressive because it accepts to parse things where some tokens are missing, so let's fix #554 - to understand where the missing tokens are, I need to run the tree-sitter playground on some grammars - but the tree-sitter playground is broken, so I opened https://github.com/tree-sitter/tree-sitter/issues/4815 - then investigate and file https://github.com/tree-sitter/tree-sitter.github.io/pull/17 it's rare that it gets that bad! :-D
wetneb force-pushed 554-missing-nodes from 9fdae959aa to 68845ede84 2025-09-11 21:20:10 +02:00 Compare
ada4a approved these changes 2025-09-11 22:12:32 +02:00
ada4a left a comment
Owner

LGTM, thank you!

LGTM, thank you!
wetneb merged commit 43d4c41c30 into main 2025-09-11 23:01:33 +02:00
ada4a deleted branch 554-missing-nodes 2025-09-17 13:51:14 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#587
No description provided.