chore: Generate our own node ids instead of reusing tree-sitter's #393

Merged
wetneb merged 4 commits from wetneb/mergiraf:own_node_ids into main 2025-05-17 15:29:37 +02:00
Owner

This is in preparation for mixing up nodes coming from different tree-sitter trees (for #5), which could potentially share ids as they would come from different parsing calls.
This also has the benefit of making those node ids more predictable and visually shorter, which should generally help in debugging.

This is in preparation for mixing up nodes coming from different tree-sitter trees (for #5), which could potentially share ids as they would come from different parsing calls. This also has the benefit of making those node ids more predictable and visually shorter, which should generally help in debugging.
Owner

With this change in place, maybe we could expand this test a little?
5d4b5787c6/src/visualizer.rs (L200)

With this change in place, maybe we could expand this test a little? https://codeberg.org/mergiraf/mergiraf/src/commit/5d4b5787c68f6279c78333ccbff748179cae0c37/src/visualizer.rs#L200
wetneb force-pushed own_node_ids from 5d4b5787c6 to 0d190294a9 2025-05-16 14:52:19 +02:00 Compare
Add explicit contents in visualization test case
All checks were successful
/ test (pull_request) Successful in 1m1s
55e4d1b8ae
@ -179,3 +188,3 @@
field_name: None,
byte_range: start_position..start_position + trimmed.len(),
id: 2 * start_position + 1, // start_position is known to be unique among virtual lines
id: *next_node_id,
Owner

this looks like something we might accidentally mess up in a future refactoring... is it okay if I add a couple of tests for the whole id incrementing logic?

this looks like something we might accidentally mess up in a future refactoring... is it okay if I add a couple of tests for the whole id incrementing logic?
Author
Owner

Sure, of course!

Sure, of course!
ada4a marked this conversation as resolved
@ -199,2 +199,3 @@
fs::read_to_string(&target_path).expect("Could not read the generated graph.dot file");
assert!(contents.contains("subgraph ")); // yes, not a very great assertion… but node ids are all unstable!
let expected_contents = r##"graph matching {
Owner

the raw string thing is very handy!

the raw string thing is very handy!
ada4a marked this conversation as resolved
Owner

While working on the test, I noticed that the description of the id field is now outdated. Could you please update it?

While working on the test, I noticed that the description of the `id` field is now outdated. Could you please update it?
Update id field description
All checks were successful
/ test (pull_request) Successful in 58s
9bad90ccac
Owner

So I made a test, not sure that we'll use it exactly as it is now. But while working on it, I made this:

#[test]
fn node_ids_multiline_escape() {
    let ctx = ctx();

    let src = r#"let foo = "line 1\
    line 2
    line 3";"#;
    let tree = ctx.parse_rust(src);

    let root = tree.root();
    assert_eq!(root.id, 14);
    
    let let_declaration = root[0];
    assert_eq!(let_declaration.id, 13);
    
    let [let_kw, foo, eq, str_literal, semicolon] = let_declaration[..] else {
        unreachable!()
    };
    assert_eq!(let_kw.id, 1);
    assert_eq!(foo.id, 2);
    assert_eq!(eq.id, 3);
    assert_eq!(str_literal.id, 11);
    assert_eq!(semicolon.id, 12);

    let [quote_open, line1, newline1, lines2and3, quote_close] = str_literal[..] else {
        unreachable!()
    };
    assert_eq!(quote_open.id, 4);
    assert_eq!(line1.source, "line 1");
    assert_eq!(line1.id, 5);
    assert_eq!(newline1.source, "\\");
    assert_eq!(newline1.id, 6);
    assert_eq!(lines2and3.source, "line 2\nline 3");
    assert_eq!(lines2and3.id, 9);
    assert_eq!(quote_close.id, 10);

    let [line2, line3] = lines2and3[..] else {
        unreachable!()
    };
    assert_eq!(line2.source, "line 2");
    assert_eq!(line2.id, 7);
    assert_eq!(line3.source, "line 3");
    assert_eq!(line3.id, 8);
}

One can see that the resulting tree structure ended up being a bit weird, but the ASCII tree kind of shows why that is:

└let_declaration
  ├let
  ├pattern: identifier foo
  ├=
  ├value: string_literal
  │ ├"
  │ ├string_content line 1
  │ ├escape_sequence \
  │ ├string_content
  │ └"
  └;

so one level of nesting is created because the grammar splits the string on the escape sentence, and another we create ourselves, when splitting the second string_content, which corresponds to line 2\nline3, into multiple lines. This can be something to keep in mind when working on improved handling of multiline string literals, since even though we don't split line 1 into a separate line ourselves, it's still ends up effectively split away by the grammar.

That aside, here's the more normal-looking version of the test:

#[test]
fn node_ids_multiline() {
    let ctx = ctx();
    
    let src = r#"let foo = "line 1
line 2
line 3";"#;
    let tree = ctx.parse_rust(src);

    let root = tree.root();
    assert_eq!(root.id, 13);
    
    let let_declaration = root[0];
    assert_eq!(let_declaration.id, 12);

    let [let_kw, foo, eq, str_literal, semicolon] = let_declaration[..] else {
        unreachable!()
    };
    assert_eq!(let_kw.id, 1);
    assert_eq!(foo.id, 2);
    assert_eq!(eq.id, 3);
    assert_eq!(str_literal.id, 10);
    assert_eq!(semicolon.id, 11);

    let [quote_open, lines, quote_close] = str_literal[..] else {
        unreachable!()
    };
    assert_eq!(quote_open.id, 4);
    assert_eq!(lines.id, 8);
    assert_eq!(quote_close.id, 9);

    let [line1, line2, line3] = lines[..] else {
        unreachable!()
    };
    assert_eq!(line1.source, "line 1");
    assert_eq!(line1.id, 5);
    assert_eq!(line2.source, "line 2");
    assert_eq!(line2.id, 6);
    assert_eq!(line3.source, "line 3");
    assert_eq!(line3.id, 7);
}

As can be seen, removing the \ made the whole tree a bit more understandable.

The reason I wasn't sure we want to keep the test like is that the particular mapping of ids to nodes doesn't feel like something we want to guarantee. Since we're only really interested in the ids being unique, I think this much simpler test would be enough:

#[test]
fn node_ids_all_unique() {
    let ctx = ctx();

    let src = r#"let foo = "line 1
line 2
line 3";"#;
    let tree = ctx.parse_rust(src);

    let ids = tree.dfs().map(|n| n.id).collect_vec();
    let len = ids.len();
    assert!(ids.iter().all_unique());
    assert_eq!(*ids.iter().max().unwrap(), len);
}

the max thing would be just to ensure that we actually use all 13 (=len) available ids, not skipping anything -- but even that requirement is not strictly necessary

So I made _a_ test, not sure that we'll use it exactly as it is now. But while working on it, I made this: ```rs #[test] fn node_ids_multiline_escape() { let ctx = ctx(); let src = r#"let foo = "line 1\ line 2 line 3";"#; let tree = ctx.parse_rust(src); let root = tree.root(); assert_eq!(root.id, 14); let let_declaration = root[0]; assert_eq!(let_declaration.id, 13); let [let_kw, foo, eq, str_literal, semicolon] = let_declaration[..] else { unreachable!() }; assert_eq!(let_kw.id, 1); assert_eq!(foo.id, 2); assert_eq!(eq.id, 3); assert_eq!(str_literal.id, 11); assert_eq!(semicolon.id, 12); let [quote_open, line1, newline1, lines2and3, quote_close] = str_literal[..] else { unreachable!() }; assert_eq!(quote_open.id, 4); assert_eq!(line1.source, "line 1"); assert_eq!(line1.id, 5); assert_eq!(newline1.source, "\\"); assert_eq!(newline1.id, 6); assert_eq!(lines2and3.source, "line 2\nline 3"); assert_eq!(lines2and3.id, 9); assert_eq!(quote_close.id, 10); let [line2, line3] = lines2and3[..] else { unreachable!() }; assert_eq!(line2.source, "line 2"); assert_eq!(line2.id, 7); assert_eq!(line3.source, "line 3"); assert_eq!(line3.id, 8); } ``` One can see that the resulting tree structure ended up being a bit weird, but the ASCII tree kind of shows why that is: ``` └let_declaration ├let ├pattern: identifier foo ├= ├value: string_literal │ ├" │ ├string_content line 1 │ ├escape_sequence \ │ ├string_content │ └" └; ``` so one level of nesting is created because the grammar splits the string on the escape sentence, and another we create ourselves, when splitting the second `string_content`, which corresponds to `line 2\nline3`, into multiple lines. This can be something to keep in mind when working on improved handling of multiline string literals, since even though _we_ don't split `line 1` into a separate line ourselves, it's still ends up effectively split away by the grammar. That aside, here's the more normal-looking version of the test: ```rs #[test] fn node_ids_multiline() { let ctx = ctx(); let src = r#"let foo = "line 1 line 2 line 3";"#; let tree = ctx.parse_rust(src); let root = tree.root(); assert_eq!(root.id, 13); let let_declaration = root[0]; assert_eq!(let_declaration.id, 12); let [let_kw, foo, eq, str_literal, semicolon] = let_declaration[..] else { unreachable!() }; assert_eq!(let_kw.id, 1); assert_eq!(foo.id, 2); assert_eq!(eq.id, 3); assert_eq!(str_literal.id, 10); assert_eq!(semicolon.id, 11); let [quote_open, lines, quote_close] = str_literal[..] else { unreachable!() }; assert_eq!(quote_open.id, 4); assert_eq!(lines.id, 8); assert_eq!(quote_close.id, 9); let [line1, line2, line3] = lines[..] else { unreachable!() }; assert_eq!(line1.source, "line 1"); assert_eq!(line1.id, 5); assert_eq!(line2.source, "line 2"); assert_eq!(line2.id, 6); assert_eq!(line3.source, "line 3"); assert_eq!(line3.id, 7); } ``` As can be seen, removing the `\` made the whole tree a bit more understandable. The reason I wasn't sure we want to keep the test like is that the particular mapping of ids to nodes doesn't feel like something we want to guarantee. Since we're only really interested in the ids being unique, I think this much simpler test would be enough: ```rs #[test] fn node_ids_all_unique() { let ctx = ctx(); let src = r#"let foo = "line 1 line 2 line 3";"#; let tree = ctx.parse_rust(src); let ids = tree.dfs().map(|n| n.id).collect_vec(); let len = ids.len(); assert!(ids.iter().all_unique()); assert_eq!(*ids.iter().max().unwrap(), len); } ``` the `max` thing would be just to ensure that we actually use all 13 (=`len`) available ids, not skipping anything -- but even that requirement is not strictly necessary
Author
Owner

Great! My preference goes to the second test, because I agree what matters here is the uniqueness of the ids, not their particular values. I wouldn't mind having the other one too, as long as we feel empowered to change or remove it later on if needed.
Are you able to commit them on my branch already or should I do it?

Great! My preference goes to the second test, because I agree what matters here is the uniqueness of the ids, not their particular values. I wouldn't mind having the other one too, as long as we feel empowered to change or remove it later on if needed. Are you able to commit them on my branch already or should I do it?
Owner

Well not commit directly, but I'll open a PR, one sec

Well not commit directly, but I'll open a PR, one sec
add tests
All checks were successful
/ test (pull_request) Successful in 57s
1838ae8630
ada4a approved these changes 2025-05-17 15:05:50 +02:00
wetneb merged commit cd55cdae01 into main 2025-05-17 15:29:37 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#393
No description provided.