fix: Check that the merged text is syntactically valid and consistent with the merged tree #368

Merged
wetneb merged 16 commits from wetneb/mergiraf:320-v3 into main 2025-05-10 03:02:36 +02:00
Owner

Fixes #320.

Instead of just parsing the text obtained in the final rendered merge, this goes a step further and checks that the corresponding trees are "isomorphic" to the merged tree.

The isomorphism check is rather optimistic, as it gives up on sections of the merged tree that are obtained using line-based merges. To be able to also check the validity of those, we'd need to be able to parse snippets of text with a given node type as target, which is an open feature request in tree-sitter: https://github.com/tree-sitter/tree-sitter/issues/711.

Fixes #320. Instead of just parsing the text obtained in the final rendered merge, this goes a step further and checks that the corresponding trees are "isomorphic" to the merged tree. The isomorphism check is rather optimistic, as it gives up on sections of the merged tree that are obtained using line-based merges. To be able to also check the validity of those, we'd need to be able to parse snippets of text with a given node type as target, which is an open feature request in tree-sitter: https://github.com/tree-sitter/tree-sitter/issues/711.
ada4a requested changes 2025-05-06 20:43:38 +02:00
Dismissed
ada4a left a comment
Owner

mostly stylistic nits

I like the idea of having a README, and linking to the corresponding issue:)

mostly stylistic nits I like the idea of having a README, and linking to the corresponding issue:)
@ -138,0 +166,4 @@
.iter()
.map(|section| match section {
MergeSection::Merged(contents) => contents.as_ref(),
MergeSection::Conflict { left, base, right } => match revision {
Owner

misc: I feel like we have this logic of "get a particular revision from a conflict" repeated a lot of times throughout the codebase. I wonder if it would be worth it to add a struct called Conflict which would look something like this:

struct Conflict<T> {
    base: T,
    left: T,
    right: T,
}

impl<T> Conflict<T> {
    fn at(&self, rev: Revision) -> &T {
        match rev {
            Revision::Base => &self.base,
            Revision::Left => &self.left,
            Revision::Right => &self.right,
        }
    }
}

and use it inside all the different enums that have Conflict as a variant:

enum Tree {
    /* other variants */
    Conflict(Conflict)
}

One argument against doing this would be that sometimes we want to do something will all sides, not just one.

misc: I feel like we have this logic of "get a particular revision from a conflict" repeated a lot of times throughout the codebase. I wonder if it would be worth it to add a struct called `Conflict` which would look something like this: ```rust struct Conflict<T> { base: T, left: T, right: T, } impl<T> Conflict<T> { fn at(&self, rev: Revision) -> &T { match rev { Revision::Base => &self.base, Revision::Left => &self.left, Revision::Right => &self.right, } } } ``` and use it inside all the different enums that have `Conflict` as a variant: ```rust enum Tree { /* other variants */ Conflict(Conflict) } ``` One argument against doing this would be that sometimes we want to do something will _all_ sides, not just one.
Author
Owner

I had a quick look at the rest of the codebase and there are indeed a couple of other places where we do similar things. I wouldn't be opposed to such a refactoring, even though I'm not 100% sure if it will make the code more readable / maintainable (the Conflict(Conflict) definition is perhaps a bit confusing). Either way, perhaps it's worth doing it as a separate follow-up PR, no?

I had a quick look at the rest of the codebase and there are indeed a couple of other places where we do similar things. I wouldn't be opposed to such a refactoring, even though I'm not 100% sure if it will make the code more readable / maintainable (the `Conflict(Conflict)` definition is perhaps a bit confusing). Either way, perhaps it's worth doing it as a separate follow-up PR, no?
Owner

Either way, perhaps it's worth doing it as a separate follow-up PR, no?

Of course. I just wanted to note that down for myself so that I don't forget

> Either way, perhaps it's worth doing it as a separate follow-up PR, no? Of course. I just wanted to note that down for myself so that I don't forget
ada4a marked this conversation as resolved
@ -278,0 +297,4 @@
ast_node.isomorphic_to(other_node)
}
MergedTree::MixedTree { node, children, .. } => {
node.grammar_name() == other_node.grammar_name && {
Owner

I have a couple of remarks for this part of code, which are somewhat hard to isolate, so here they all are:

  1. wrapping the whole second part in a block creates unnecessary indentation. What I'd do instead is early-return if the grammar names don't match
  2. for contains_line_based_merge, I'd say there's no reason to intermingle its calculation with that of children_at_rev -- something like this should be more clear
// (the big comment)
let contains_line_based_merge = children
    .iter()
    .any(|c| matches!(c, MergedTree::LineBaseMerge { .. }));

Then one could early-return directly there.
3. You collect the iterator, I guess in order to be able to short-circuit on mismatched lengths. I feel like that the length-comparison would make sense if we had the two vecs from the beginning -- but here we first allocate everything just to drop directly afterwards (if the lengths do end up being different). I think it would be more efficient to just compare the iterators directly

I have a couple of remarks for this part of code, which are somewhat hard to isolate, so here they all are: 1. wrapping the whole second part in a block creates unnecessary indentation. What I'd do instead is early-return if the grammar names _don't_ match 2. for `contains_line_based_merge`, I'd say there's no reason to intermingle its calculation with that of `children_at_rev` -- something like this should be more clear ```rs // (the big comment) let contains_line_based_merge = children .iter() .any(|c| matches!(c, MergedTree::LineBaseMerge { .. })); ``` Then one could early-return directly there. 3. You `collect` the iterator, I guess in order to be able to short-circuit on mismatched lengths. I feel like that the length-comparison would make sense if we had the two vecs from the beginning -- but here we first allocate everything just to drop directly afterwards (if the lengths do end up being different). I think it would be more efficient to just compare the iterators directly
Author
Owner

Absolutely! About point 3, I've been looking for a more elegant way of doing it. If we just keep them as iterators, then the iterator will need to be exhausted twice to first compare the length and then iterate again through it, no?

Absolutely! About point 3, I've been looking for a more elegant way of doing it. If we just keep them as iterators, then the iterator will need to be exhausted twice to first compare the length and then iterate again through it, no?
Owner

Exactly, that's why I'd suggest removing the length comparison entirely – it'd make sense if we had two vecs (so that we could iterate multiple times over the same vec), but since we don't, running the iterator twice would actually run all the computations in it twice, which is obviously a waste.

Exactly, that's why I'd suggest removing the length comparison entirely – it'd make sense if we had two vecs (so that we could iterate multiple times over the same vec), but since we don't, running the iterator twice would actually run all the computations in it twice, which is obviously a waste.
Author
Owner

Ok but then we need to replace zip by zip_longest from itertools right? And arguably in some cases it might be more efficient to just abort the isomorphism check earlier as soon as the lengths differ. But I guess our expectation is that isomorphism checks generally succeed, so maybe it's not a big concern.

Ok but then we need to replace `zip` by `zip_longest` from `itertools` right? And arguably in some cases it might be more efficient to just abort the isomorphism check earlier as soon as the lengths differ. But I guess our expectation is that isomorphism checks generally succeed, so maybe it's not a big concern.
Owner

Ohh, that's why you wanted to check the lengths! Then yes, something like zip_longest is definitely needed imo. Or maybe just collect after all...

Ohh, _that's_ why you wanted to check the lengths! Then yes, something like `zip_longest` is definitely needed imo. Or maybe just `collect` after all...
@ -278,0 +322,4 @@
Revision::Left => left,
Revision::Right => right,
};
nodes
Owner

this could be replaced with nodes.iter().copied().map(MergedChild::Original).collect(), which even fits on one line!

this could be replaced with `nodes.iter().copied().map(MergedChild::Original).collect()`, which even fits on one line!
ada4a marked this conversation as resolved
@ -278,0 +341,4 @@
) => !separator.trim().is_empty(),
MergedChild::Merged(MergedTree::MixedTree {
children, ..
}) if children.is_empty() => false,
Owner

I think it's more consistent to do !children.is_empty() here. One could alternatively pull the other arm's thing into the match arm, but this style feels better to me

I think it's more consistent to do `!children.is_empty()` here. One could alternatively pull the other arm's thing into the match arm, but this style feels better to me
ada4a marked this conversation as resolved
@ -112,0 +116,4 @@
lang_profile,
&arena,
&ref_arena,
)?;
Owner

I think it'd be good to give a bit of a context to this ?? What does it mean for us to fail to parse the tree in this case?

I think it'd be good to give a bit of a context to this `?`? What does it mean for us to fail to parse the tree in this case?
Owner

What I had in mind is map_err, but a comment works as well, sure! Thanks:)

What I had in mind is `map_err`, but a comment works as well, sure! Thanks:)
Author
Owner

Ah, map_err does sound like a better idea, in case we rely on the downstream error later on.

Ah, `map_err` does sound like a better idea, in case we rely on the downstream error later on.
ada4a marked this conversation as resolved
@ -112,0 +130,4 @@
} else {
STRUCTURED_RESOLUTION_METHOD
};
Ok(MergeResult::from_merged_text(
Owner

I feel like the logic flow would be better if from_merged_text would be replaced with into_merge_result, similar to result_tree.to_merged_text above. I guess you've went with this option because of the already-existing MergeResult::from_parsed_merge, so maybe you could add MergedText::into_merge_result that just called from_merged_text internally. Not really important though

I feel like the logic flow would be better if `from_merged_text` would be replaced with `into_merge_result`, similar to `result_tree.to_merged_text` above. I guess you've went with this option because of the already-existing `MergeResult::from_parsed_merge`, so maybe you could add `MergedText::into_merge_result` that just called `from_merged_text` internally. Not really important though
Author
Owner

If you prefer into to from, then I'm happy to just remove the from methods and replace them by into versions of them. Or do you see a reason to keep both?

If you prefer `into` to `from`, then I'm happy to just remove the `from` methods and replace them by `into` versions of them. Or do you see a reason to keep both?
Owner

Sure, why not!

Sure, why not!
Avoid collecting an intermediate iterator
All checks were successful
/ test (pull_request) Successful in 1m0s
de12ba00bc
ParsedMerge::from… converted to 'into_parsed_merge'
Some checks failed
/ test (pull_request) Failing after 50s
c688809813
Allow 'into' methods to take self by reference
All checks were successful
/ test (pull_request) Successful in 1m4s
eb7a3aa3cf
src/ast.rs Outdated
@ -669,9 +669,11 @@ impl<'a> AstNode<'a> {
.then(|| format!(" {}", Color::Red.paint(self.source.replace('\n', "\\n"))))
.unwrap_or_default();
let commutative = (next_parent.is_some())
Owner

what's this all about? 😅

what's this all about? 😅
Author
Owner

That's a different version of clippy wanting to rewrite this for us… :-D I'll revert it.

That's a different version of clippy wanting to rewrite this for us… :-D I'll revert it.
ada4a approved these changes 2025-05-09 21:21:16 +02:00
wetneb merged commit 6627da9b92 into main 2025-05-10 03:02:36 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#368
No description provided.