feat: Distinguish between exact and inexact initial matchings #523

Merged
ada4a merged 7 commits from wetneb/initial_exact_matching into main 2025-07-26 14:53:51 +02:00 AGit
Owner

The tree-matching heuristics we use are often seeded with pre-existing matchings between the trees to match. This happens in two cases:

  1. When we have a line-based merge between the two sources, we create an initial matching which pre-matches elements that lie in merged regions (as they are identical in all revisions)
  2. When we have Base-Left and Base-Right matchings, we compose them to obtain an initial Left-Right matching.

So far, those two sorts of initial matchings were handled by the matching heuristics in the same way. But their nature is quite different. In the first case, this is a matching of nodes whose sources are literally the same, so of course they are also isomorphic as subtrees. On the contrary, in the second case, the initial matching is obtained by composition of potentially inexact matches. Since the first pass of the GumTree algorithm is concerned with matching isomorphic nodes, it is useful for it to know that it can indeed rely on the knowledge that the initial matches are all exact.

This PR makes it possible to distinguish between the exact and inexact matches supplied to the matching algorithm. It happens to solve an existing failing test case (closes #139).

The tree-matching heuristics we use are often seeded with pre-existing matchings between the trees to match. This happens in two cases: 1. When we have a line-based merge between the two sources, we create an initial matching which pre-matches elements that lie in merged regions (as they are identical in all revisions) 2. When we have Base-Left and Base-Right matchings, we compose them to obtain an initial Left-Right matching. So far, those two sorts of initial matchings were handled by the matching heuristics in the same way. But their nature is quite different. In the first case, this is a matching of nodes whose sources are literally the same, so of course they are also isomorphic as subtrees. On the contrary, in the second case, the initial matching is obtained by composition of potentially inexact matches. Since the first pass of the GumTree algorithm is concerned with matching isomorphic nodes, it is useful for it to know that it can indeed rely on the knowledge that the initial matches are all exact. This PR makes it possible to distinguish between the exact and inexact matches supplied to the matching algorithm. It happens to solve an existing failing test case (closes #139).
wetneb changed title from Mark rust/undetected_use_statement as working to feat: Distinguish between exact and inexact initial matchings 2025-07-22 11:23:23 +02:00
cargo fmt
All checks were successful
/ test (pull_request) Successful in 35s
ed10beb06a
Author
Owner

Running this on my toy evaluation dataset, I also notice a really significant speed-up, which makes sense because we avoid quite some isomorphism checks by doing that. This speed up is all the more significant for cases where line-based merge is able to return large merged sections (which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful).

Language Cases Conflict Exact Format Differ Parse Panic Time (s)
*.py 815 -1 (-0.1%) +1 (+0.1%) +0 +0 +0 +0 -0.071 (-35.4%)
*.cpp 461 -4 (-0.9%) +3 (+0.7%) -1 (-0.2%) +2 (+0.4%) +0 +0 -0.055 (-40.6%)
*.js 1514 -2 (-0.1%) +0 +1 (+0.1%) +1 (+0.1%) +0 +0 -0.187 (-24.4%)
*.java 327 -1 (-0.3%) +1 (+0.3%) +1 (+0.3%) -1 (-0.3%) +0 +0 -0.066 (-46.8%)
Total 3117 -8 (-0.3%) +5 (+0.2%) +1 (+0.0%) +2 (+0.1%) +0 +0 -0.124 (-27.1%)
Running this on my toy evaluation dataset, I also notice a really significant speed-up, which makes sense because we avoid quite some isomorphism checks by doing that. This speed up is all the more significant for cases where line-based merge is able to return large merged sections (which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful). | Language | Cases | Conflict | Exact | Format | Differ | Parse | Panic | Time (s) | | -------- | ----- | -------- | ----- | ------ | ------ | ----- | ----- | -------- | | `*.py` | 815 | -1 **(-0.1%)** | +1 **(+0.1%)** | +0 | +0 | +0 | +0 | -0.071 (-35.4%) | | `*.cpp` | 461 | -4 **(-0.9%)** | +3 **(+0.7%)** | -1 **(-0.2%)** | +2 **(+0.4%)** | +0 | +0 | -0.055 (-40.6%) | | `*.js` | 1514 | -2 **(-0.1%)** | +0 | +1 **(+0.1%)** | +1 **(+0.1%)** | +0 | +0 | -0.187 (-24.4%) | | `*.java` | 327 | -1 **(-0.3%)** | +1 **(+0.3%)** | +1 **(+0.3%)** | -1 **(-0.3%)** | +0 | +0 | -0.066 (-46.8%) | | **Total** | 3117 | -8 **(-0.3%)** | +5 **(+0.2%)** | +1 **(+0.0%)** | +2 **(+0.1%)** | +0 | +0 | -0.124 (-27.1%) |
Owner

which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful

Ohh, that makes a lot of sense! But why would anyone store these minified files yet alone modify them in different branches... Aren't they essentially machine code of the web development world

But given that we can't change this fact, maybe we could employ a heuristics where, if we detect a minified file (the whole file is one line + size bigger than, say, 1kB), we try to split it into lines, do a line-merge + structured resolution of that, and if we succeed, put all the lines back together and return, otherwise fail? Though the question would be how exactly to split into lines – I imagine doing so on things like opening/closing braces would yield results that a line-merge would be more successful on

> which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful Ohh, that makes a lot of sense! But why would anyone store these minified files yet alone modify them in different branches... Aren't they essentially machine code of the web development world But given that we can't change this fact, maybe we could employ a heuristics where, if we detect a minified file (the whole file is one line + size bigger than, say, 1kB), we try to split it into lines, do a line-merge + structured resolution of that, and if we succeed, put all the lines back together and return, otherwise fail? Though the question would be how exactly to split into lines – I imagine doing so on things like opening/closing braces would yield results that a line-merge would be more successful on
Author
Owner

Yes, or resorting to word-based merging instead of line-based, but arguably merging minified files isn't really supposed to happen in the first place (like merging compiled binaries), so I'm not sure it's worth investing a lot of effort into such use cases.

Yes, or resorting to word-based merging instead of line-based, but arguably merging minified files isn't really supposed to happen in the first place (like merging compiled binaries), so I'm not sure it's worth investing a lot of effort into such use cases.
Owner

arguably merging minified files isn't really supposed to happen in the first place (like merging compiled binaries)

That's what I was trying to say with the first paragraph, yes. We could get even more audacious and complain on the stdout about being forced to merge files where we know that we'll just be wasting our and user's time (lockfiles, minified JS, something else?)

> arguably merging minified files isn't really supposed to happen in the first place (like merging compiled binaries) That's what I was trying to say with the first paragraph, yes. We could get even more audacious and complain on the stdout about being forced to merge files where we know that we'll just be wasting our and user's time (lockfiles, minified JS, something else?)
Owner

It took me a bit of time to see what this PR actually changed -- it looked like it added a new optional parameter to a method, and then proceeded to just put Nones into it at all the callsites. Only then I realized that sometimes the other parameter was shifted to the last position by adding a None before it. Still, this feels a bit brittle to me, type-safety-wise.

Maybe we could something like:

  1. add ExactMatching, a newtype around Matching, which guarantees that all the matches in it are exact
  2. make constructing it tricky -- maybe with an unsafe constructor like
    /// # Safety
    ///
    /// `matching` must contain only exact matches
    unsafe fn new(matching: Matching) -> Self {
        Self(matching)
    }
    
  3. call this unsafe constructor in the places where we know that the matching we create is indeed exact, with an appropriate safety comment
  4. change the signature of match_trees to take one Matching and one ExactMatching
  5. we probably could implement Deref<Target=&Matching>, for the places where we want to "forget" about the exactness -- similarly to RevNESet1 vs RevSet
  6. in top_down_pass, we'll be doing self.exact_matching.add_matching(initial_exact_matching -- that would probably need to be a method like ExactMatching::add_matching(&mut self, other: ExactMatching) (accepting a regular Matching would of course be wrong because of the broken invariant)

  1. I feel like putting the NE at the start, so NERevSet, would read a bit better... but that's a different discussion entirely ↩︎

It took me a bit of time to see what this PR actually changed -- it looked like it added a new optional parameter to a method, and then proceeded to just put `None`s into it at all the callsites. Only then I realized that sometimes _the other_ parameter was shifted to the last position by adding a `None` before it. Still, this feels a bit brittle to me, type-safety-wise. Maybe we could something like: 1. add `ExactMatching`, a newtype around `Matching`, which guarantees that all the matches in it are exact 2. make constructing it tricky -- maybe with an unsafe constructor like ```rs /// # Safety /// /// `matching` must contain only exact matches unsafe fn new(matching: Matching) -> Self { Self(matching) } ``` 3. call this unsafe constructor in the places where we know that the matching we create is indeed exact, with an appropriate safety comment 4. change the signature of `match_trees` to take one `Matching` and one `ExactMatching` 5. we probably could implement `Deref<Target=&Matching>`, for the places where we want to "forget" about the exactness -- similarly to `RevNESet`[^1] vs `RevSet` 6. in `top_down_pass`, we'll be doing `self.exact_matching.add_matching(initial_exact_matching` -- that would probably need to be a method like `ExactMatching::add_matching(&mut self, other: ExactMatching)` (accepting a regular `Matching` would of course be wrong because of the broken invariant) [^1]: I feel like putting the `NE` at the start, so `NERevSet`, would read a bit better... but that's a different discussion entirely
Owner

Maybe the same logic could be applied to recovery and container matchings, but I'm still not completely clear on their semantics to be honest, so I'm not sure

Maybe the same logic could be applied to recovery and container matchings, but I'm still not completely clear on their semantics to be honest, so I'm not sure
Owner

Also, do you think it's possible for both initial matchings to ever be Some at the same time? Doesn't seem to be the case from the diff anyway. I wonder whether Option<Either<Matching, ExactMatching>> would be more passing.. it is a bit less understandable though I'm afraid

Also, do you think it's possible for both initial matchings to ever be `Some` at the same time? Doesn't seem to be the case from the diff anyway. I wonder whether `Option<Either<Matching, ExactMatching>>` would be more passing.. it is a bit less understandable though I'm afraid
Author
Owner

Yes, I need to give you a bit more context here.

The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them. (So it's a bit of an accident that this preparatory change turned out to be such a big performance boost.)

So I'm interested in being able to supply both an initial approximate matching and an initial exact matching. My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too.

I generally enjoy introducing types which encode properties about their contents, without adding more fields. To make their construction harder, I thought that can be achieved using constructor visibility, no? If the type is defined in the same unit as the code that is legitimate to produce it, then isn't that enough to prevent its construction outside? I guess one issue here is that it will typically be constructed from the first pass of the GumTree algorithm, but also from ParsedMerge, so maybe it's difficult to use this solution.

Yes, I need to give you a bit more context here. The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them. (So it's a bit of an accident that this preparatory change turned out to be such a big performance boost.) So I'm interested in being able to supply both an initial approximate matching and an initial exact matching. My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too. I generally enjoy introducing types which encode properties about their contents, without adding more fields. To make their construction harder, I thought that can be achieved using constructor visibility, no? If the type is defined in the same unit as the code that is legitimate to produce it, then isn't that enough to prevent its construction outside? I guess one issue here is that it will typically be constructed from the first pass of the GumTree algorithm, but also from `ParsedMerge`, so maybe it's difficult to use this solution.
Owner

Yes, I need to give you a bit more context here.

The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them.

The context was indeed quite useful^^ I too sometimes find it hard to choose between "open one big PR which the other person will probably ask to split-up into multiple anyway" and "just do it as a series of small PRs from the start". As we've seen a couple of times, the latter approach has a risk of creating PRs whose purpose is not quite clear. Because of that, I'd say let's open those big PRs -- splitting them up is not that complicated anyway.

To make their construction harder, I thought that can be achieved using constructor visibility, no?

I mean if you're in the same module as the item definition, you can just skip constructors entirely.. but I see your point. Yes, that could probably work as well, but the GumTree thing does make that solution difficult as you said

> Yes, I need to give you a bit more context here. > > The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them. The context was indeed quite useful^^ I too sometimes find it hard to choose between "open one big PR which the other person will probably ask to split-up into multiple anyway" and "just do it as a series of small PRs from the start". As we've seen a couple of times, the latter approach has a risk of creating PRs whose purpose is not quite clear. Because of that, I'd say let's open those big PRs -- splitting them up is not that complicated anyway. > To make their construction harder, I thought that can be achieved using constructor visibility, no? I mean if you're in the same module as the item definition, you can just skip constructors entirely.. but I see your point. Yes, that could probably work as well, but the GumTree thing does make that solution difficult as you said
Owner

My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too.

I can't quite imagine what kind of a struct you mean, but if you say it's relevant then it probably is^^ Do as you see fit

> My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too. I can't quite imagine what kind of a struct you mean, but if you say it's relevant then it probably is^^ Do as you see fit
Use ApproxExactMatching in TreeMatcher
All checks were successful
/ test (pull_request) Successful in 36s
9a55c5ad38
Author
Owner

Here is the struct I meant. It's a struct so that it can be used to supply both an approximate and an exact matching at the same time (which doesn't happen yet but will be later for #482).

Concerning making one big PR instead, I do think those two changes are worth reviewing separately, because changes to the matching algorithm can have really subtle impacts on the merged output and performance… even for this PR, I'd like to run a benchmark on a wider dataset to make sure the merged outputs are favorably impacted. Merge cases changing from Conflict to Differ are generally suspicious as they can introduce wrong merges that don't get reviewed by users.

Here is the `struct` I meant. It's a struct so that it can be used to supply both an approximate and an exact matching at the same time (which doesn't happen yet but will be later for #482). Concerning making one big PR instead, I do think those two changes are worth reviewing separately, because changes to the matching algorithm can have really subtle impacts on the merged output and performance… even for this PR, I'd like to run a benchmark on a wider dataset to make sure the merged outputs are favorably impacted. Merge cases changing from **Conflict** to **Differ** are generally suspicious as they can introduce wrong merges that don't get reviewed by users.
Author
Owner

But I'm happy to try and give you more context upfront, I need to remember to do this spontaneously… (but sometimes it feels a bit like drowning the other in irrelevant stuff, perhaps)

But I'm happy to try and give you more context upfront, I need to remember to do this spontaneously… (but sometimes it feels a bit like drowning the other in irrelevant stuff, perhaps)
ada4a approved these changes 2025-07-24 17:18:30 +02:00
ada4a left a comment
Owner

I'm not sure how I feel about the new struct, but let's see how it looks in the context of the second PR

I'm not sure how I feel about the new struct, but let's see how it looks in the context of the second PR
@ -84,0 +90,4 @@
/// - `initial_matching`, which may contain inexact matches, but which the exact matching produced in this method must
/// be compatible with (we cannot generate an exact match of a node with another one that's already matched in the
/// initial_matching)
/// - `initial_exact_matching`, which is an initial seed of matches that are known to be isomorphic.
Owner

The docs need to be updated then

The docs need to be updated then
ada4a marked this conversation as resolved
Update doc comment
All checks were successful
/ test (pull_request) Successful in 38s
bd4a6d7cfb
ada4a merged commit e6a6c62822 into main 2025-07-26 14:53:51 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#523
No description provided.