chore: Generate our own node ids instead of reusing tree-sitter's #393

wetneb · 2025-05-16T07:26:23+02:00

wetneb commented

2025-05-16 07:26:23 +02:00

This is in preparation for mixing up nodes coming from different tree-sitter trees (for #5), which could potentially share ids as they would come from different parsing calls.
This also has the benefit of making those node ids more predictable and visually shorter, which should generally help in debugging.

This is in preparation for mixing up nodes coming from different tree-sitter trees (for #5), which could potentially share ids as they would come from different parsing calls. This also has the benefit of making those node ids more predictable and visually shorter, which should generally help in debugging.

ada4a commented

2025-05-16 08:00:37 +02:00

With this change in place, maybe we could expand this test a little?
5d4b5787c6/src/visualizer.rs (L200)

With this change in place, maybe we could expand this test a little? https://codeberg.org/mergiraf/mergiraf/src/commit/5d4b5787c68f6279c78333ccbff748179cae0c37/src/visualizer.rs#L200

👍 1

wetneb force-pushed own_node_ids from 5d4b5787c6 to 0d190294a9

2025-05-16 14:52:19 +02:00

Compare

wetneb added 1 commit

2025-05-16 14:52:29 +02:00

Add explicit contents in visualization test case

/ test (pull_request) Successful in 1m1s

Details

55e4d1b8ae

ada4a reviewed

2025-05-16 15:06:23 +02:00

src/ast.rs

					
				@ -179,3 +188,3 @@

				                    field_name: None,

				                    byte_range: start_position..start_position + trimmed.len(),

				                    id: 2 * start_position + 1, // start_position is known to be unique among virtual lines

				                    id: *next_node_id,

ada4a commented

2025-05-16 15:06:19 +02:00

this looks like something we might accidentally mess up in a future refactoring... is it okay if I add a couple of tests for the whole id incrementing logic?

wetneb commented

2025-05-16 22:15:22 +02:00

Sure, of course!

ada4a marked this conversation as resolved

src/visualizer.rs

					
				@ -199,2 +199,3 @@

				            fs::read_to_string(&target_path).expect("Could not read the generated graph.dot file");

				        assert!(contents.contains("subgraph ")); // yes, not a very great assertion… but node ids are all unstable!

				        let expected_contents = r##"graph matching {

ada4a commented

2025-05-16 15:03:32 +02:00

the raw string thing is very handy!

ada4a marked this conversation as resolved

ada4a commented

2025-05-17 13:00:09 +02:00

While working on the test, I noticed that the description of the id field is now outdated. Could you please update it?

While working on the test, I noticed that the description of the `id` field is now outdated. Could you please update it?

👍 1

wetneb added 1 commit

2025-05-17 13:43:08 +02:00

Update id field description

/ test (pull_request) Successful in 58s

Details

9bad90ccac

ada4a commented

2025-05-17 13:47:47 +02:00

So I made a test, not sure that we'll use it exactly as it is now. But while working on it, I made this:

#[test]
fn node_ids_multiline_escape() {
    let ctx = ctx();

    let src = r#"let foo = "line 1\
    line 2
    line 3";"#;
    let tree = ctx.parse_rust(src);

    let root = tree.root();
    assert_eq!(root.id, 14);
    
    let let_declaration = root[0];
    assert_eq!(let_declaration.id, 13);
    
    let [let_kw, foo, eq, str_literal, semicolon] = let_declaration[..] else {
        unreachable!()
    };
    assert_eq!(let_kw.id, 1);
    assert_eq!(foo.id, 2);
    assert_eq!(eq.id, 3);
    assert_eq!(str_literal.id, 11);
    assert_eq!(semicolon.id, 12);

    let [quote_open, line1, newline1, lines2and3, quote_close] = str_literal[..] else {
        unreachable!()
    };
    assert_eq!(quote_open.id, 4);
    assert_eq!(line1.source, "line 1");
    assert_eq!(line1.id, 5);
    assert_eq!(newline1.source, "\\");
    assert_eq!(newline1.id, 6);
    assert_eq!(lines2and3.source, "line 2\nline 3");
    assert_eq!(lines2and3.id, 9);
    assert_eq!(quote_close.id, 10);

    let [line2, line3] = lines2and3[..] else {
        unreachable!()
    };
    assert_eq!(line2.source, "line 2");
    assert_eq!(line2.id, 7);
    assert_eq!(line3.source, "line 3");
    assert_eq!(line3.id, 8);
}

One can see that the resulting tree structure ended up being a bit weird, but the ASCII tree kind of shows why that is:

└let_declaration
  ├let
  ├pattern: identifier foo
  ├=
  ├value: string_literal
  │ ├"
  │ ├string_content line 1
  │ ├escape_sequence \
  │ ├string_content
  │ └"
  └;

so one level of nesting is created because the grammar splits the string on the escape sentence, and another we create ourselves, when splitting the second string_content, which corresponds to line 2\nline3, into multiple lines. This can be something to keep in mind when working on improved handling of multiline string literals, since even though we don't split line 1 into a separate line ourselves, it's still ends up effectively split away by the grammar.

That aside, here's the more normal-looking version of the test:

#[test]
fn node_ids_multiline() {
    let ctx = ctx();
    
    let src = r#"let foo = "line 1
line 2
line 3";"#;
    let tree = ctx.parse_rust(src);

    let root = tree.root();
    assert_eq!(root.id, 13);
    
    let let_declaration = root[0];
    assert_eq!(let_declaration.id, 12);

    let [let_kw, foo, eq, str_literal, semicolon] = let_declaration[..] else {
        unreachable!()
    };
    assert_eq!(let_kw.id, 1);
    assert_eq!(foo.id, 2);
    assert_eq!(eq.id, 3);
    assert_eq!(str_literal.id, 10);
    assert_eq!(semicolon.id, 11);

    let [quote_open, lines, quote_close] = str_literal[..] else {
        unreachable!()
    };
    assert_eq!(quote_open.id, 4);
    assert_eq!(lines.id, 8);
    assert_eq!(quote_close.id, 9);

    let [line1, line2, line3] = lines[..] else {
        unreachable!()
    };
    assert_eq!(line1.source, "line 1");
    assert_eq!(line1.id, 5);
    assert_eq!(line2.source, "line 2");
    assert_eq!(line2.id, 6);
    assert_eq!(line3.source, "line 3");
    assert_eq!(line3.id, 7);
}

As can be seen, removing the \ made the whole tree a bit more understandable.

The reason I wasn't sure we want to keep the test like is that the particular mapping of ids to nodes doesn't feel like something we want to guarantee. Since we're only really interested in the ids being unique, I think this much simpler test would be enough:

#[test]
fn node_ids_all_unique() {
    let ctx = ctx();

    let src = r#"let foo = "line 1
line 2
line 3";"#;
    let tree = ctx.parse_rust(src);

    let ids = tree.dfs().map(|n| n.id).collect_vec();
    let len = ids.len();
    assert!(ids.iter().all_unique());
    assert_eq!(*ids.iter().max().unwrap(), len);
}

the max thing would be just to ensure that we actually use all 13 (=len) available ids, not skipping anything -- but even that requirement is not strictly necessary

So I made _a_ test, not sure that we'll use it exactly as it is now. But while working on it, I made this: ```rs #[test] fn node_ids_multiline_escape() { let ctx = ctx(); let src = r#"let foo = "line 1\ line 2 line 3";"#; let tree = ctx.parse_rust(src); let root = tree.root(); assert_eq!(root.id, 14); let let_declaration = root[0]; assert_eq!(let_declaration.id, 13); let [let_kw, foo, eq, str_literal, semicolon] = let_declaration[..] else { unreachable!() }; assert_eq!(let_kw.id, 1); assert_eq!(foo.id, 2); assert_eq!(eq.id, 3); assert_eq!(str_literal.id, 11); assert_eq!(semicolon.id, 12); let [quote_open, line1, newline1, lines2and3, quote_close] = str_literal[..] else { unreachable!() }; assert_eq!(quote_open.id, 4); assert_eq!(line1.source, "line 1"); assert_eq!(line1.id, 5); assert_eq!(newline1.source, "\\"); assert_eq!(newline1.id, 6); assert_eq!(lines2and3.source, "line 2\nline 3"); assert_eq!(lines2and3.id, 9); assert_eq!(quote_close.id, 10); let [line2, line3] = lines2and3[..] else { unreachable!() }; assert_eq!(line2.source, "line 2"); assert_eq!(line2.id, 7); assert_eq!(line3.source, "line 3"); assert_eq!(line3.id, 8); } ``` One can see that the resulting tree structure ended up being a bit weird, but the ASCII tree kind of shows why that is: ``` └let_declaration ├let ├pattern: identifier foo ├= ├value: string_literal │ ├" │ ├string_content line 1 │ ├escape_sequence \ │ ├string_content │ └" └; ``` so one level of nesting is created because the grammar splits the string on the escape sentence, and another we create ourselves, when splitting the second `string_content`, which corresponds to `line 2\nline3`, into multiple lines. This can be something to keep in mind when working on improved handling of multiline string literals, since even though _we_ don't split `line 1` into a separate line ourselves, it's still ends up effectively split away by the grammar. That aside, here's the more normal-looking version of the test: ```rs #[test] fn node_ids_multiline() { let ctx = ctx(); let src = r#"let foo = "line 1 line 2 line 3";"#; let tree = ctx.parse_rust(src); let root = tree.root(); assert_eq!(root.id, 13); let let_declaration = root[0]; assert_eq!(let_declaration.id, 12); let [let_kw, foo, eq, str_literal, semicolon] = let_declaration[..] else { unreachable!() }; assert_eq!(let_kw.id, 1); assert_eq!(foo.id, 2); assert_eq!(eq.id, 3); assert_eq!(str_literal.id, 10); assert_eq!(semicolon.id, 11); let [quote_open, lines, quote_close] = str_literal[..] else { unreachable!() }; assert_eq!(quote_open.id, 4); assert_eq!(lines.id, 8); assert_eq!(quote_close.id, 9); let [line1, line2, line3] = lines[..] else { unreachable!() }; assert_eq!(line1.source, "line 1"); assert_eq!(line1.id, 5); assert_eq!(line2.source, "line 2"); assert_eq!(line2.id, 6); assert_eq!(line3.source, "line 3"); assert_eq!(line3.id, 7); } ``` As can be seen, removing the `\` made the whole tree a bit more understandable. The reason I wasn't sure we want to keep the test like is that the particular mapping of ids to nodes doesn't feel like something we want to guarantee. Since we're only really interested in the ids being unique, I think this much simpler test would be enough: ```rs #[test] fn node_ids_all_unique() { let ctx = ctx(); let src = r#"let foo = "line 1 line 2 line 3";"#; let tree = ctx.parse_rust(src); let ids = tree.dfs().map(|n| n.id).collect_vec(); let len = ids.len(); assert!(ids.iter().all_unique()); assert_eq!(*ids.iter().max().unwrap(), len); } ``` the `max` thing would be just to ensure that we actually use all 13 (=`len`) available ids, not skipping anything -- but even that requirement is not strictly necessary

wetneb commented

2025-05-17 13:50:55 +02:00

Great! My preference goes to the second test, because I agree what matters here is the uniqueness of the ids, not their particular values. I wouldn't mind having the other one too, as long as we feel empowered to change or remove it later on if needed.
Are you able to commit them on my branch already or should I do it?

Great! My preference goes to the second test, because I agree what matters here is the uniqueness of the ids, not their particular values. I wouldn't mind having the other one too, as long as we feel empowered to change or remove it later on if needed. Are you able to commit them on my branch already or should I do it?

ada4a commented

2025-05-17 13:57:36 +02:00

Well not commit directly, but I'll open a PR, one sec

wetneb added 1 commit

2025-05-17 14:22:38 +02:00

add tests

/ test (pull_request) Successful in 57s

Details

1838ae8630

ada4a approved these changes

2025-05-17 15:05:50 +02:00

wetneb merged commit cd55cdae01 into main