fix: Check that the merged text is syntactically valid and consistent with the merged tree #368

wetneb · 2025-05-06T09:49:04+02:00

wetneb commented

2025-05-06 09:49:04 +02:00

Fixes #320.

Instead of just parsing the text obtained in the final rendered merge, this goes a step further and checks that the corresponding trees are "isomorphic" to the merged tree.

The isomorphism check is rather optimistic, as it gives up on sections of the merged tree that are obtained using line-based merges. To be able to also check the validity of those, we'd need to be able to parse snippets of text with a given node type as target, which is an open feature request in tree-sitter: https://github.com/tree-sitter/tree-sitter/issues/711.

Fixes #320. Instead of just parsing the text obtained in the final rendered merge, this goes a step further and checks that the corresponding trees are "isomorphic" to the merged tree. The isomorphism check is rather optimistic, as it gives up on sections of the merged tree that are obtained using line-based merges. To be able to also check the validity of those, we'd need to be able to parse snippets of text with a given node type as target, which is an open feature request in tree-sitter: https://github.com/tree-sitter/tree-sitter/issues/711.

wetneb added 6 commits

2025-05-06 09:49:05 +02:00

MergedText::reconstruct_revision 90ba1c2a7f

Introduce MergeResult::from_merged_text 343c235eb4

Refactor solve.rs to use from_parsed_merge f418fd2631

Count conflicts on merged text 5196248951

Check isomorphism of rendered merge and merge tree dfaf33803e

Add integration test

/ test (pull_request) Successful in 59s

Details

c99df7031c

wetneb referenced this pull request

2025-05-06 09:49:28 +02:00

WIP: fix: Check the syntactic validity of merges #321

ada4a requested changes

2025-05-06 20:43:38 +02:00

Dismissed

ada4a left a comment

mostly stylistic nits

I like the idea of having a README, and linking to the corresponding issue:)

mostly stylistic nits I like the idea of having a README, and linking to the corresponding issue:)

src/merged_text.rs

					
				@ -138,0 +166,4 @@

				            .iter()

				            .map(|section| match section {

				                MergeSection::Merged(contents) => contents.as_ref(),

				                MergeSection::Conflict { left, base, right } => match revision {

ada4a commented

2025-05-06 19:59:40 +02:00

misc: I feel like we have this logic of "get a particular revision from a conflict" repeated a lot of times throughout the codebase. I wonder if it would be worth it to add a struct called Conflict which would look something like this:

struct Conflict<T> {
    base: T,
    left: T,
    right: T,
}

impl<T> Conflict<T> {
    fn at(&self, rev: Revision) -> &T {
        match rev {
            Revision::Base => &self.base,
            Revision::Left => &self.left,
            Revision::Right => &self.right,
        }
    }
}

and use it inside all the different enums that have Conflict as a variant:

enum Tree {
    /* other variants */
    Conflict(Conflict)
}

One argument against doing this would be that sometimes we want to do something will all sides, not just one.

misc: I feel like we have this logic of "get a particular revision from a conflict" repeated a lot of times throughout the codebase. I wonder if it would be worth it to add a struct called `Conflict` which would look something like this: ```rust struct Conflict<T> { base: T, left: T, right: T, } impl<T> Conflict<T> { fn at(&self, rev: Revision) -> &T { match rev { Revision::Base => &self.base, Revision::Left => &self.left, Revision::Right => &self.right, } } } ``` and use it inside all the different enums that have `Conflict` as a variant: ```rust enum Tree { /* other variants */ Conflict(Conflict) } ``` One argument against doing this would be that sometimes we want to do something will _all_ sides, not just one.

wetneb commented

2025-05-07 09:23:49 +02:00

I had a quick look at the rest of the codebase and there are indeed a couple of other places where we do similar things. I wouldn't be opposed to such a refactoring, even though I'm not 100% sure if it will make the code more readable / maintainable (the Conflict(Conflict) definition is perhaps a bit confusing). Either way, perhaps it's worth doing it as a separate follow-up PR, no?

I had a quick look at the rest of the codebase and there are indeed a couple of other places where we do similar things. I wouldn't be opposed to such a refactoring, even though I'm not 100% sure if it will make the code more readable / maintainable (the `Conflict(Conflict)` definition is perhaps a bit confusing). Either way, perhaps it's worth doing it as a separate follow-up PR, no?

ada4a commented

2025-05-07 10:26:55 +02:00

Either way, perhaps it's worth doing it as a separate follow-up PR, no?

Of course. I just wanted to note that down for myself so that I don't forget

> Either way, perhaps it's worth doing it as a separate follow-up PR, no? Of course. I just wanted to note that down for myself so that I don't forget

ada4a marked this conversation as resolved

src/merged_tree.rs Outdated

					
				@ -278,0 +297,4 @@

				                ast_node.isomorphic_to(other_node)

				            }

				            MergedTree::MixedTree { node, children, .. } => {

				                node.grammar_name() == other_node.grammar_name && {

ada4a commented

2025-05-06 20:35:44 +02:00

I have a couple of remarks for this part of code, which are somewhat hard to isolate, so here they all are:

wrapping the whole second part in a block creates unnecessary indentation. What I'd do instead is early-return if the grammar names don't match
for contains_line_based_merge, I'd say there's no reason to intermingle its calculation with that of children_at_rev -- something like this should be more clear

// (the big comment)
let contains_line_based_merge = children
    .iter()
    .any(|c| matches!(c, MergedTree::LineBaseMerge { .. }));

Then one could early-return directly there.
3. You collect the iterator, I guess in order to be able to short-circuit on mismatched lengths. I feel like that the length-comparison would make sense if we had the two vecs from the beginning -- but here we first allocate everything just to drop directly afterwards (if the lengths do end up being different). I think it would be more efficient to just compare the iterators directly

I have a couple of remarks for this part of code, which are somewhat hard to isolate, so here they all are: 1. wrapping the whole second part in a block creates unnecessary indentation. What I'd do instead is early-return if the grammar names _don't_ match 2. for `contains_line_based_merge`, I'd say there's no reason to intermingle its calculation with that of `children_at_rev` -- something like this should be more clear ```rs // (the big comment) let contains_line_based_merge = children .iter() .any(|c| matches!(c, MergedTree::LineBaseMerge { .. })); ``` Then one could early-return directly there. 3. You `collect` the iterator, I guess in order to be able to short-circuit on mismatched lengths. I feel like that the length-comparison would make sense if we had the two vecs from the beginning -- but here we first allocate everything just to drop directly afterwards (if the lengths do end up being different). I think it would be more efficient to just compare the iterators directly

wetneb commented

2025-05-07 09:30:24 +02:00

Absolutely! About point 3, I've been looking for a more elegant way of doing it. If we just keep them as iterators, then the iterator will need to be exhausted twice to first compare the length and then iterate again through it, no?

ada4a commented

2025-05-07 10:25:39 +02:00

Exactly, that's why I'd suggest removing the length comparison entirely – it'd make sense if we had two vecs (so that we could iterate multiple times over the same vec), but since we don't, running the iterator twice would actually run all the computations in it twice, which is obviously a waste.

wetneb commented

2025-05-07 11:25:34 +02:00

Ok but then we need to replace zip by zip_longest from itertools right? And arguably in some cases it might be more efficient to just abort the isomorphism check earlier as soon as the lengths differ. But I guess our expectation is that isomorphism checks generally succeed, so maybe it's not a big concern.

Ok but then we need to replace `zip` by `zip_longest` from `itertools` right? And arguably in some cases it might be more efficient to just abort the isomorphism check earlier as soon as the lengths differ. But I guess our expectation is that isomorphism checks generally succeed, so maybe it's not a big concern.

ada4a commented

2025-05-07 11:32:24 +02:00

Ohh, that's why you wanted to check the lengths! Then yes, something like zip_longest is definitely needed imo. Or maybe just collect after all...

Ohh, _that's_ why you wanted to check the lengths! Then yes, something like `zip_longest` is definitely needed imo. Or maybe just `collect` after all...

src/merged_tree.rs Outdated

					
				@ -278,0 +322,4 @@

				                                            Revision::Left => left,

				                                            Revision::Right => right,

				                                        };

				                                        nodes

ada4a commented

2025-05-06 20:13:11 +02:00

this could be replaced with nodes.iter().copied().map(MergedChild::Original).collect(), which even fits on one line!

this could be replaced with `nodes.iter().copied().map(MergedChild::Original).collect()`, which even fits on one line!

ada4a marked this conversation as resolved

src/merged_tree.rs Outdated

					
				@ -278,0 +341,4 @@

				                                    ) => !separator.trim().is_empty(),

				                                    MergedChild::Merged(MergedTree::MixedTree {

				                                        children, ..

				                                    }) if children.is_empty() => false,

ada4a commented

2025-05-06 20:39:39 +02:00

I think it's more consistent to do !children.is_empty() here. One could alternatively pull the other arm's thing into the match arm, but this style feels better to me

I think it's more consistent to do `!children.is_empty()` here. One could alternatively pull the other arm's thing into the match arm, but this style feels better to me

ada4a marked this conversation as resolved

src/structured.rs Outdated

					
				@ -112,0 +116,4 @@

				            lang_profile,

				            &arena,

				            &ref_arena,

				        )?;

ada4a commented

2025-05-06 20:41:48 +02:00

I think it'd be good to give a bit of a context to this ?? What does it mean for us to fail to parse the tree in this case?

I think it'd be good to give a bit of a context to this `?`? What does it mean for us to fail to parse the tree in this case?

ada4a commented

2025-05-07 10:29:50 +02:00

What I had in mind is map_err, but a comment works as well, sure! Thanks:)

What I had in mind is `map_err`, but a comment works as well, sure! Thanks:)

wetneb commented

2025-05-07 11:24:06 +02:00

Ah, map_err does sound like a better idea, in case we rely on the downstream error later on.

Ah, `map_err` does sound like a better idea, in case we rely on the downstream error later on.

ada4a marked this conversation as resolved

src/structured.rs Outdated

					
				@ -112,0 +130,4 @@

				    } else {

				        STRUCTURED_RESOLUTION_METHOD

				    };

				    Ok(MergeResult::from_merged_text(

ada4a commented

2025-05-06 19:48:33 +02:00

I feel like the logic flow would be better if from_merged_text would be replaced with into_merge_result, similar to result_tree.to_merged_text above. I guess you've went with this option because of the already-existing MergeResult::from_parsed_merge, so maybe you could add MergedText::into_merge_result that just called from_merged_text internally. Not really important though

I feel like the logic flow would be better if `from_merged_text` would be replaced with `into_merge_result`, similar to `result_tree.to_merged_text` above. I guess you've went with this option because of the already-existing `MergeResult::from_parsed_merge`, so maybe you could add `MergedText::into_merge_result` that just called `from_merged_text` internally. Not really important though

wetneb commented

2025-05-07 10:04:32 +02:00

If you prefer into to from, then I'm happy to just remove the from methods and replace them by into versions of them. Or do you see a reason to keep both?

If you prefer `into` to `from`, then I'm happy to just remove the `from` methods and replace them by `into` versions of them. Or do you see a reason to keep both?

ada4a commented

2025-05-07 10:22:55 +02:00

Sure, why not!

wetneb added 6 commits

2025-05-07 10:09:52 +02:00

early return when grammar names don't match 03a7448fad

Check for presence of line-based merges independently 9ffbbc2605

More compact mapping to MergedChild::Original a26f84bfd4

move children.is_empty() 759785daba

Add comment about failing the merge if a revision doesn't parse bd8b5ee619

Remove unused variant LineBased

/ test (pull_request) Successful in 1m0s

Details

bf800d18ac

wetneb added 1 commit

2025-05-07 11:26:45 +02:00

Avoid collecting an intermediate iterator

/ test (pull_request) Successful in 1m0s

Details

de12ba00bc

wetneb added 1 commit

2025-05-07 18:44:52 +02:00

ParsedMerge::from… converted to 'into_parsed_merge'

/ test (pull_request) Failing after 50s

Details

c688809813

wetneb force-pushed 320-v3 from 70b9f94195 to 99b840cb75

2025-05-07 21:27:50 +02:00

Compare

wetneb added 1 commit

2025-05-08 06:08:03 +02:00

Allow 'into' methods to take self by reference

/ test (pull_request) Successful in 1m4s

Details

eb7a3aa3cf

ada4a reviewed

2025-05-08 09:49:59 +02:00

src/ast.rs Outdated

					
				@ -669,9 +669,11 @@ impl<'a> AstNode<'a> {

				            .then(|| format!(" {}", Color::Red.paint(self.source.replace('\n', "\\n"))))

				            .unwrap_or_default();

				        let commutative = (next_parent.is_some())