test(supported_langs): check that commutative parents' children all have signatures defined #507

Open
ada4a wants to merge 3 commits from comm-parents-check-have-all-sigs into main
Owner

for #489

for #489
add todo
Some checks failed
/ test (pull_request) Failing after 36s
8cc06cabf2
ada4a force-pushed comm-parents-check-have-all-sigs from 8cc06cabf2 to 76fb1168e8 2025-07-14 21:57:10 +02:00 Compare
@ -167,0 +181,4 @@
// `field_declaration` above -- is there a way to reuse it instead?
Field("declarator"),
Field("name"),
],
Owner

I'm surprised by the three components of this signature. I would expect that it's not possible to declare two records with the same name, so vec![Field("name")] would be enough, no?

I'm surprised by the three components of this signature. I would expect that it's not possible to declare two records with the same name, so `vec![Field("name")]` would be enough, no?
Author
Owner

I don't really know if I'm being honest. I took this from method_declaration, which does need it because of overloading, but I guess structs/records/enums don't?

I don't really know if I'm being honest. I took this from `method_declaration`, which does need it because of overloading, but I guess structs/records/enums don't?
Owner

Yes, not only they don't need it, but adding it will prevent the signatures to be equal if the records have different fields.

For instance, in a case like this:

public record Person (String name, String address) {}

public record Person (String firstName, String lastName, String address) {}

we do want to flag those as being duplicates, because the Java compiler will not accept that.
With the signature definition you have, those two declarations would get different signatures, so we wouldn't detect that they are in conflict.

Yes, not only they don't need it, but adding it will prevent the signatures to be equal if the records have different fields. For instance, in a case like this: ```java public record Person (String name, String address) {} public record Person (String firstName, String lastName, String address) {} ``` we do want to flag those as being duplicates, because the Java compiler will not accept that. With the signature definition you have, those two declarations would get different signatures, so we wouldn't detect that they are in conflict.
Author
Owner

You're totally right, I'll go fix it

You're totally right, I'll go fix it
ada4a marked this conversation as resolved
ada4a force-pushed comm-parents-check-have-all-sigs from 76fb1168e8 to d38fde0970 2025-07-14 22:14:07 +02:00 Compare
Author
Owner

In Rust (I decided to work on a language I actually know something about^^), let_declarations are marked as children of commutative parents of both declaration_list (e.g. inside an impl block) and source_file, but I don't think either of those is even allowed to include let-declarations?

I think the reason we have this is that in the node-types.json, we have both declaration_list and source_file list _declaration_statement as one of the possible child types, and the latter in turn has let_declaration as one of its kinds. So this is probably an oversimplification on the grammar side, which we need not follow?

In Rust (I decided to work on a language I actually know something about^^), `let_declaration`s are marked as children of commutative parents of both `declaration_list` (e.g. inside an impl block) and `source_file`, but I don't think either of those is even allowed to include let-declarations? I think the reason we have this is that in the `node-types.json`, we have both `declaration_list` and `source_file` list `_declaration_statement` as one of the possible child types, and the latter in turn has `let_declaration` as one of its kinds. So this is probably an oversimplification on the grammar side, which we need not follow?
Owner

That's a good question. I think it's indeed a simplification. One could try making the grammar tighter and seeing if it still passes the test suite (but it's probably not worth adding more complexity to the parser for the sake of introducing this restriction).

Removing let_declarations from the children groups would probably make sense.

That's a good question. I think it's indeed a simplification. One could try making the grammar tighter and seeing if it still passes the test suite (but it's probably not worth adding more complexity to the parser for the sake of introducing this restriction). Removing `let_declaration`s from the children groups would probably make sense.
not actually a commutative child
WIP: add Rust/impl_item
Some checks failed
/ test (pull_request) Failing after 33s
26ca1386ac
need to decide what the signature should be
@ -262,9 +274,9 @@ pub static SUPPORTED_LANGUAGES: LazyLock<Vec<LangProfile>> = LazyLock::new(|| {
"type_item",
"function_item",
"function_signature_item",
"impl_item",
Author
Owner

looks like this:

impl Foo
where
    Foo: Copy,
{
    fn foo() -> f64 {}
    const TRUE: bool = true;
}

parses like this:

└source_file Commutative
  └impl_item
    ├impl
    ├type: identifier Foo
    ├where_clause Commutative
    │ ├where
    │ ├where_predicate Signature [[Foo], [: Copy]] // this is WIP, ignore for now
    │ │ └ ...
    │ └,
    └body: declaration_list Commutative
      ├{
      ├function_item Signature [[foo]]
      │ └ ...
      ├const_item Signature [[TRUE]]
      │ └ ...
      └}

Firstly, this can very well be marked as a commutative child, since impl blocks are reorderable.

Secondly, if we do add them as a commutative child, what would their signature be?

Theoretically, nothing stops one from having two impl blocks for the same type in the same file -- though normally one would either:

  • create two blocks only if they have different where-bounds (e.g. HashMap::new has no bounds, but HashMap::insert requires the key to be hashable, so you put them in different blocks)
  • create two blocks with the same (missing) bounds, and get a Clippy warning (though not by default apparently)

So in order to detect identical impl blocks, we would need to consider all of:

  • item that the impl is for
  • where bounds
  • all the items defined inside? this one sounds especially daunting
  • apparently, impl Trait for Foo is also an impl_item?.. In that case, we'd also need to look at the type field, where Trait is stored in this case. Although in that case, having two impl X for Y for the same (X,Y) is disallowed
looks like this: ```rs impl Foo where Foo: Copy, { fn foo() -> f64 {} const TRUE: bool = true; } ``` parses like this: ``` └source_file Commutative └impl_item ├impl ├type: identifier Foo ├where_clause Commutative │ ├where │ ├where_predicate Signature [[Foo], [: Copy]] // this is WIP, ignore for now │ │ └ ... │ └, └body: declaration_list Commutative ├{ ├function_item Signature [[foo]] │ └ ... ├const_item Signature [[TRUE]] │ └ ... └} ``` Firstly, this can very well be marked as a commutative child, since impl blocks are reorderable. Secondly, if we do add them as a commutative child, what would their signature be? Theoretically, nothing stops one from having two impl blocks for the same type in the same file -- though normally one would either: - create two blocks only if they have different where-bounds (e.g. `HashMap::new` has no bounds, but `HashMap::insert` requires the key to be hashable, so you put them in different blocks) - create two blocks with the same (missing) bounds, and get a [Clippy warning](https://rust-lang.github.io/rust-clippy/master/index.html#multiple_inherent_impl) (though not by default apparently) So in order to detect _identical_ impl blocks, we would need to consider all of: - item that the impl is for - where bounds - all the items defined inside? this one sounds especially daunting - apparently, `impl Trait for Foo` is also an `impl_item`?.. In that case, we'd also need to look at the `type` field, where `Trait` is stored in this case. Although in that case, having two `impl X for Y` for the same (X,Y) _is_ disallowed
Owner

That's a tricky case indeed. I guess we could change the grammar so that it parses impl Trait for Foo as a different node type. Then, for that node type, we could define a signature which bundles up:

  • the trait
  • the type
  • the where_clause if any

And those all together should be unique in a given file, no?

For the general impl Foo { … } case, we could perhaps also just use vec![vec![]] as signature… (and maybe it's also a sensible temporary solution for impl Trait for Foo in the meantime)

That's a tricky case indeed. I guess we could change the grammar so that it parses `impl Trait for Foo` as a different node type. Then, for that node type, we could define a signature which bundles up: * the trait * the type * the where_clause if any And those all together should be unique in a given file, no? For the general `impl Foo { … }` case, we could perhaps also just use `vec![vec![]]` as signature… (and maybe it's also a sensible temporary solution for `impl Trait for Foo` in the meantime)
Author
Owner

I guess we could change the grammar so that it parses impl Trait for Foo as a different node type

I mean I understand why the grammar might not want to concern itself with that -- highlighting and stuff don't really change just because of the Trait for part. But if it would be possible to reuse "impl_body" in both impl_item and "impl_trait_item", then it could probably work I think. Though interestingly enough, not even the compiler distinguishes between the two, neither in the AST nor in the HIR.

All of this makes me think that maybe we should just restrict ourselves to vec![vec![]] after all..

> I guess we could change the grammar so that it parses impl Trait for Foo as a different node type I mean I understand why _the grammar_ might not want to concern itself with that -- highlighting and stuff don't really change just because of the `Trait for` part. But if it would be possible to reuse "`impl_body`" in both `impl_item` and "`impl_trait_item`", then it could probably work I think. Though interestingly enough, not even the compiler distinguishes between the two, neither [in the AST](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_ast/ast/struct.Impl.html) nor [in the HIR](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_hir/hir/struct.Impl.html). All of this makes me think that maybe we should just restrict ourselves to `vec![vec![]]` after all..
ada4a marked this conversation as resolved
ada4a force-pushed comm-parents-check-have-all-sigs from 26ca1386ac to fdf564ce76 2025-07-15 20:01:37 +02:00 Compare
@ -332,2 +344,4 @@
// source_file
signature("use_declaration", vec![vec![Field("argument")]]),
signature("extern_crate_declaration", vec![vec![Field("name")]]),
signature("foreign_mod_item", todo!()),
Author
Owner

this is similar to impl_item, in that one can have multiple extern "C" { ... } items in the source code, and the only real way to find out if they are identical is to go through all the functions defined inside the block (and the foreign ABI convention ("C" here) as well, of course)

this is similar to `impl_item`, in that one can have multiple `extern "C" { ... }` items in the source code, and the only real way to find out if they are identical is to go through all the functions defined inside the block (and the [foreign ABI convention](https://doc.rust-lang.org/nomicon/ffi.html#foreign-calling-conventions) (`"C"` here) as well, of course)
Owner

Then maybe use vec![vec![]] as signature? It means the whole node is treated as a signature, which would at least prevent isomorphic elements from being added.

Then maybe use `vec![vec![]]` as signature? It means the whole node is treated as a signature, which would at least prevent isomorphic elements from being added.
Author
Owner

makes sense!

makes sense!
ada4a marked this conversation as resolved
Author
Owner

The reason I push these not-quite-ready commits is so that we could have one discussion thread for each signature (since I feel like there are many--different--non-trivial cases) -- of course, I would've preferred to be able to open threads here directly...

The reason I push these not-quite-ready commits is so that we could have one discussion thread for each signature (since I feel like there are many--different--non-trivial cases) -- of course, I would've preferred to be able to open threads here directly...
@ -348,0 +362,4 @@
// where_clause
signature(
"where_predicate",
vec![vec![Field("left")], vec![Field("bounds")]],
Author
Owner

here's an example of its use:

the where clause in this:

fn foo<T>()
where
    T: Sized,
    T: Send,
{
}

gets parsed as this:

└where_clause Commutative
  ├where
  ├where_predicate
  │ ├left: identifier T
  │ └bounds: trait_bounds Commutative
  │   ├:
  │   └identifier Sized Signature [[Sized]]
  ├,
  ├where_predicate
  │ ├left: identifier T
  │ └bounds: trait_bounds Commutative
  │   ├:
  │   └identifier Send Signature [[Send]]
  └,

and we're asked to provide signatures for the where_predicates.

I initially went with the approach of left + (list of all identifiers in bound). But then I saw that use_declaration uses the simpler approach, where it creates the signature from just argument, which is everything that follows use. And so I thought where_predicate could do the same for the right half of its signature.

Although I'm not quite sure that the argument thing is safe to do in the first place? I think the following two use declarations:

use foo::{Bar, Baz};
use foo::{
    Bar,
    Baz,
};

would technically have different arguments (and thus different signatures) because of the different formatting? At least Ast::ascii_tree does show the signatures as with the formatting respected

here's an example of its use: the where clause in this: ```rs fn foo<T>() where T: Sized, T: Send, { } ``` gets parsed as this: ``` └where_clause Commutative ├where ├where_predicate │ ├left: identifier T │ └bounds: trait_bounds Commutative │ ├: │ └identifier Sized Signature [[Sized]] ├, ├where_predicate │ ├left: identifier T │ └bounds: trait_bounds Commutative │ ├: │ └identifier Send Signature [[Send]] └, ``` and we're asked to provide signatures for the `where_predicate`s. I initially went with the approach of `left` + (list of all `identifier`s in `bound`). But then I saw that `use_declaration` uses the simpler approach, where it creates the signature from just `argument`, which is everything that follows `use`. And so I thought `where_predicate` could do the same for the right half of its signature. Although I'm not _quite_ sure that the `argument` thing is safe to do in the first place? I think the following two use declarations: ```rs use foo::{Bar, Baz}; use foo::{ Bar, Baz, }; ``` would _technically_ have different `argument`s (and thus different signatures) because of the different formatting? At least `Ast::ascii_tree` does show the signatures as with the formatting respected
Owner

What you did looks good to me. It's true that it will fail to identify bounds that are slightly different, but it's the best we can do given the system we have.

Concerning your example about use statements above, the deciding factor that would make the signatures different is the trailing comma (the formatting differences shouldn't make a difference since nodes in signatures are treated up to isomorphism).

What you did looks good to me. It's true that it will fail to identify bounds that are slightly different, but it's the best we can do given the system we have. Concerning your example about `use` statements above, the deciding factor that would make the signatures different is the trailing comma (the formatting differences shouldn't make a difference since nodes in signatures are treated up to isomorphism).
ada4a marked this conversation as resolved
ada4a force-pushed comm-parents-check-have-all-sigs from fdf564ce76 to abc6d55858 2025-07-23 16:04:53 +02:00 Compare
ada4a force-pushed comm-parents-check-have-all-sigs from abc6d55858 to e7abb14911 2025-07-23 16:06:25 +02:00 Compare
ada4a force-pushed comm-parents-check-have-all-sigs from e7abb14911 to 0bb42380ca 2025-07-26 20:29:19 +02:00 Compare
ada4a force-pushed comm-parents-check-have-all-sigs from 0bb42380ca to a5098e3472 2025-07-26 20:32:42 +02:00 Compare
add the rest of Java signatures
Some checks failed
/ test (pull_request) Failing after 36s
45b8c682f2
WIP: add Kotlin/annotation
Some checks failed
/ test (pull_request) Failing after 34s
b63c22a5bd
@ -239,1 +282,4 @@
signature("delegation_specifier", vec![vec![]]),
// modifiers
signature(
"annotation",
Author
Owner

So the Kotlin grammar is a bit annoying -- it seems to only ever use ChildTypes, which in the particular case of annotation causes the following problem:

An annotation normally looks like this:

@Anno

but it can optionally include a use site target, for example:

@field:Anno

And the way the grammar defines this use_site_target is as follows: 3dea6dfa9c/grammar.js (L534-L537), so what we get from parsing is something like this:

annotation Signature [[Anno]]
 ├@
 ├use_site_target
 │ ├field
 │ └:
 └user_type
   └identifier Anno

so we'd need to add an alternative sigdef path for all possible variants of use_site_target (similar to #540)

Or, much more simply, change the sigdef to be vec![vec![]], since there aren't that many moving parts there anyway...

Well, not really, there's also this.. thing: https://kotlinlang.org/docs/annotations.html#arrays-as-annotation-parameters, so this:

@AnnWithArrayMethod(names = ["abc", "foo", "bar"])

gets parsed to this:

annotation Signature [[]]
 ├@
 └constructor_invocation
   ├user_type
   │ └identifier AnnWithArrayMethod
   └value_arguments
     ├(
     ├value_argument
     │ ├identifier names
     │ ├=
     │ └collection_literal
     │   ├[
     │   ├string_literal
     │   │ ├"
     │   │ ├string_content abc
     │   │ └"
     │   ├,
     │   ├string_literal
     │   │ ├"
     │   │ ├string_content foo
     │   │ └"
     │   ├,
     │   ├string_literal
     │   │ ├"
     │   │ ├string_content bar
     │   │ └"
     │   └]
     └)

and so doesn't even pass into the sigdef I currently have, but also wouldn't quite work with the naive vec![vec![]], because I think the parameters aren't important as far as the duplicate signature check goes?

Although it looks like none of this is important, because there's a meta-attribute that can allow an attribute to be used multiple times: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.annotation/-repeatable/

This is a bit too much, I'll go touch some grass for a bit..

So the Kotlin grammar is a bit annoying -- it seems to only ever use `ChildType`s, which in the particular case of `annotation` causes the following problem: An annotation normally looks like this: ```kt @Anno ``` but it can optionally include a [use site target](https://kotlinlang.org/docs/annotations.html#annotation-use-site-targets), for example: ```kt @field:Anno ``` And the way the grammar defines this `use_site_target` is as follows: https://github.com/tree-sitter-grammars/tree-sitter-kotlin/blob/3dea6dfa9c0129deb7c4315afbda806c85c41667/grammar.js#L534-L537, so what we get from parsing is something like this: ``` annotation Signature [[Anno]] ├@ ├use_site_target │ ├field │ └: └user_type └identifier Anno ``` so we'd need to add an alternative sigdef path for all possible variants of `use_site_target` (similar to #540) Or, much more simply, change the sigdef to be `vec![vec![]]`, since there aren't that many moving parts there anyway... Well, not really, there's also this.. thing: https://kotlinlang.org/docs/annotations.html#arrays-as-annotation-parameters, so this: ```kt @AnnWithArrayMethod(names = ["abc", "foo", "bar"]) ``` gets parsed to this: ``` annotation Signature [[]] ├@ └constructor_invocation ├user_type │ └identifier AnnWithArrayMethod └value_arguments ├( ├value_argument │ ├identifier names │ ├= │ └collection_literal │ ├[ │ ├string_literal │ │ ├" │ │ ├string_content abc │ │ └" │ ├, │ ├string_literal │ │ ├" │ │ ├string_content foo │ │ └" │ ├, │ ├string_literal │ │ ├" │ │ ├string_content bar │ │ └" │ └] └) ``` and so doesn't even pass into the sigdef I currently have, but also wouldn't quite work with the naive `vec![vec![]]`, because I think the parameters aren't important as far as the duplicate signature check goes? Although it looks like none of this is important, because there's a meta-attribute that can allow an attribute to be used multiple times: https://kotlinlang.org/api/core/kotlin-stdlib/kotlin.annotation/-repeatable/ This is a bit too much, I'll go touch some grass for a bit..
Author
Owner

decided to go with plain vec![vec![]], just like we do for annotations in Java

decided to go with plain `vec![vec![]]`, just like we do for annotations in Java
ada4a force-pushed comm-parents-check-have-all-sigs from b63c22a5bd to 0957e3a0a2 2025-08-01 23:36:11 +02:00 Compare
add Go/import_declaration
Some checks failed
/ test (pull_request) Failing after 31s
2bb1b592eb
Author
Owner

I would propose we merge what we've got so far (except the test of course), and then continue? Especially if you're planning to do a release soon-ish

I would propose we merge what we've got so far (except the test of course), and then continue? Especially if you're planning to do a release soon-ish
Owner

Makes sense!

Makes sense!
ada4a force-pushed comm-parents-check-have-all-sigs from 2bb1b592eb to e0ff006144 2025-08-14 19:37:26 +02:00 Compare
Author
Owner

Ok the extraction thing was a bit awkward.. Let's do this another way:

  1. in this PR, add #[ignore = "not all signatures are added yet"] to the test, and merge it
  2. add further signatures in separate PRs -- they can temporarily remove the #[ignore] for testing, but shouldn't include that change in the final version

Alternatively, we could do my favorite thing: a UI test! For that, we could:

  1. don't add the #[ignore] to the test
  2. instead, save the current list of node types with missing signatures to a file
  3. in the test, compare that list with the one we get from the test

This has the advantage that, when someone adds a new language, and doesn't add all the necessary signatures, they'll get a test failure that shows them which signatures are missing in particular.

In both cases, we might want to open a tracking issue for all the remaining signatures. There, for each language with missing signatures, we could tag the people who added that particular language, with the hope that it'd be easier for them to add those since they're more familiar with the particular language's semantics than.. me.

What do you think?

Ok the extraction thing was a bit awkward.. Let's do this another way: 1. in this PR, add `#[ignore = "not all signatures are added yet"]` to the test, and merge it 2. add further signatures in separate PRs -- they can temporarily remove the `#[ignore]` for testing, but shouldn't include that change in the final version Alternatively, we could do my favorite thing: a UI test! For that, we could: 1. don't add the `#[ignore]` to the test 2. instead, save the current list of node types with missing signatures to a file 3. in the test, compare that list with the one we get from the test This has the advantage that, when someone adds a new language, and doesn't add all the necessary signatures, they'll get a test failure that shows them which signatures are missing in particular. In both cases, we might want to open a tracking issue for all the remaining signatures. There, for each language with missing signatures, we could tag the people who added that particular language, with the hope that it'd be easier for them to add those since they're more familiar with the particular language's semantics than.. me. What do you think?
Owner

Sorry, I completely missed that last message…
Both solutions look good to me.
On the long term I would like to list all possible children of a commutative parent (not just the ones that are mentioned in the language profile) to do this check. But it's much more involved so we clearly shouldn't let that hold this PR back.

Sorry, I completely missed that last message… Both solutions look good to me. On the long term I would like to list all possible children of a commutative parent (not just the ones that are mentioned in the language profile) to do this check. But it's much more involved so we clearly shouldn't let that hold this PR back.
Some checks failed
/ test (pull_request) Failing after 39s
This pull request has changes conflicting with the target branch.
  • src/supported_langs.rs
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin comm-parents-check-have-all-sigs:comm-parents-check-have-all-sigs
git switch comm-parents-check-have-all-sigs
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#507
No description provided.