feat: Distinguish between exact and inexact initial matchings #523

wetneb · 2025-07-22T11:22:24+02:00

wetneb commented

2025-07-22 11:22:24 +02:00

The tree-matching heuristics we use are often seeded with pre-existing matchings between the trees to match. This happens in two cases:

When we have a line-based merge between the two sources, we create an initial matching which pre-matches elements that lie in merged regions (as they are identical in all revisions)
When we have Base-Left and Base-Right matchings, we compose them to obtain an initial Left-Right matching.

So far, those two sorts of initial matchings were handled by the matching heuristics in the same way. But their nature is quite different. In the first case, this is a matching of nodes whose sources are literally the same, so of course they are also isomorphic as subtrees. On the contrary, in the second case, the initial matching is obtained by composition of potentially inexact matches. Since the first pass of the GumTree algorithm is concerned with matching isomorphic nodes, it is useful for it to know that it can indeed rely on the knowledge that the initial matches are all exact.

This PR makes it possible to distinguish between the exact and inexact matches supplied to the matching algorithm. It happens to solve an existing failing test case (closes #139).

The tree-matching heuristics we use are often seeded with pre-existing matchings between the trees to match. This happens in two cases: 1. When we have a line-based merge between the two sources, we create an initial matching which pre-matches elements that lie in merged regions (as they are identical in all revisions) 2. When we have Base-Left and Base-Right matchings, we compose them to obtain an initial Left-Right matching. So far, those two sorts of initial matchings were handled by the matching heuristics in the same way. But their nature is quite different. In the first case, this is a matching of nodes whose sources are literally the same, so of course they are also isomorphic as subtrees. On the contrary, in the second case, the initial matching is obtained by composition of potentially inexact matches. Since the first pass of the GumTree algorithm is concerned with matching isomorphic nodes, it is useful for it to know that it can indeed rely on the knowledge that the initial matches are all exact. This PR makes it possible to distinguish between the exact and inexact matches supplied to the matching algorithm. It happens to solve an existing failing test case (closes #139).

wetneb added 3 commits

2025-07-22 11:22:24 +02:00

Introduce initial_exact_matching in top-down pass 18de2b0f7c

Introduce initial_exact_matching in match_trees a5f7f6facc

Mark rust/undetected_use_statement as working

/ test (pull_request) Failing after 18s

Details

51610f3e49

wetneb changed title from ~~Mark rust/undetected_use_statement as working~~ to feat: Distinguish between exact and inexact initial matchings

2025-07-22 11:23:23 +02:00

wetneb added 1 commit

2025-07-22 11:24:06 +02:00

cargo fmt

/ test (pull_request) Successful in 35s

Details

ed10beb06a

wetneb commented

2025-07-22 13:45:22 +02:00

Running this on my toy evaluation dataset, I also notice a really significant speed-up, which makes sense because we avoid quite some isomorphism checks by doing that. This speed up is all the more significant for cases where line-based merge is able to return large merged sections (which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful).

Language	Cases	Conflict	Exact	Format	Differ	Parse	Panic	Time (s)
`*.py`	815	-1 (-0.1%)	+1 (+0.1%)	+0	+0	+0	+0	-0.071 (-35.4%)
`*.cpp`	461	-4 (-0.9%)	+3 (+0.7%)	-1 (-0.2%)	+2 (+0.4%)	+0	+0	-0.055 (-40.6%)
`*.js`	1514	-2 (-0.1%)	+0	+1 (+0.1%)	+1 (+0.1%)	+0	+0	-0.187 (-24.4%)
`*.java`	327	-1 (-0.3%)	+1 (+0.3%)	+1 (+0.3%)	-1 (-0.3%)	+0	+0	-0.066 (-46.8%)
Total	3117	-8 (-0.3%)	+5 (+0.2%)	+1 (+0.0%)	+2 (+0.1%)	+0	+0	-0.124 (-27.1%)

Running this on my toy evaluation dataset, I also notice a really significant speed-up, which makes sense because we avoid quite some isomorphism checks by doing that. This speed up is all the more significant for cases where line-based merge is able to return large merged sections (which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful). | Language | Cases | Conflict | Exact | Format | Differ | Parse | Panic | Time (s) | | -------- | ----- | -------- | ----- | ------ | ------ | ----- | ----- | -------- | | `*.py` | 815 | -1 **(-0.1%)** | +1 **(+0.1%)** | +0 | +0 | +0 | +0 | -0.071 (-35.4%) | | `*.cpp` | 461 | -4 **(-0.9%)** | +3 **(+0.7%)** | -1 **(-0.2%)** | +2 **(+0.4%)** | +0 | +0 | -0.055 (-40.6%) | | `*.js` | 1514 | -2 **(-0.1%)** | +0 | +1 **(+0.1%)** | +1 **(+0.1%)** | +0 | +0 | -0.187 (-24.4%) | | `*.java` | 327 | -1 **(-0.3%)** | +1 **(+0.3%)** | +1 **(+0.3%)** | -1 **(-0.3%)** | +0 | +0 | -0.066 (-46.8%) | | **Total** | 3117 | -8 **(-0.3%)** | +5 **(+0.2%)** | +1 **(+0.0%)** | +2 **(+0.1%)** | +0 | +0 | -0.124 (-27.1%) |

ada4a commented

2025-07-22 13:54:01 +02:00

which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful

Ohh, that makes a lot of sense! But why would anyone store these minified files yet alone modify them in different branches... Aren't they essentially machine code of the web development world

But given that we can't change this fact, maybe we could employ a heuristics where, if we detect a minified file (the whole file is one line + size bigger than, say, 1kB), we try to split it into lines, do a line-merge + structured resolution of that, and if we succeed, put all the lines back together and return, otherwise fail? Though the question would be how exactly to split into lines – I imagine doing so on things like opening/closing braces would yield results that a line-merge would be more successful on

> which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful Ohh, that makes a lot of sense! But why would anyone store these minified files yet alone modify them in different branches... Aren't they essentially machine code of the web development world But given that we can't change this fact, maybe we could employ a heuristics where, if we detect a minified file (the whole file is one line + size bigger than, say, 1kB), we try to split it into lines, do a line-merge + structured resolution of that, and if we succeed, put all the lines back together and return, otherwise fail? Though the question would be how exactly to split into lines – I imagine doing so on things like opening/closing braces would yield results that a line-merge would be more successful on

wetneb commented

2025-07-22 14:00:56 +02:00

Yes, or resorting to word-based merging instead of line-based, but arguably merging minified files isn't really supposed to happen in the first place (like merging compiled binaries), so I'm not sure it's worth investing a lot of effort into such use cases.

ada4a commented

2025-07-22 14:06:59 +02:00

arguably merging minified files isn't really supposed to happen in the first place (like merging compiled binaries)

That's what I was trying to say with the first paragraph, yes. We could get even more audacious and complain on the stdout about being forced to merge files where we know that we'll just be wasting our and user's time (lockfiles, minified JS, something else?)

> arguably merging minified files isn't really supposed to happen in the first place (like merging compiled binaries) That's what I was trying to say with the first paragraph, yes. We could get even more audacious and complain on the stdout about being forced to merge files where we know that we'll just be wasting our and user's time (lockfiles, minified JS, something else?)

ada4a commented

2025-07-22 20:04:34 +02:00

It took me a bit of time to see what this PR actually changed -- it looked like it added a new optional parameter to a method, and then proceeded to just put Nones into it at all the callsites. Only then I realized that sometimes the other parameter was shifted to the last position by adding a None before it. Still, this feels a bit brittle to me, type-safety-wise.

Maybe we could something like:

add ExactMatching, a newtype around Matching, which guarantees that all the matches in it are exact

make constructing it tricky -- maybe with an unsafe constructor like

/// # Safety
///
/// `matching` must contain only exact matches
unsafe fn new(matching: Matching) -> Self {
    Self(matching)
}

call this unsafe constructor in the places where we know that the matching we create is indeed exact, with an appropriate safety comment
change the signature of match_trees to take one Matching and one ExactMatching
we probably could implement Deref<Target=&Matching>, for the places where we want to "forget" about the exactness -- similarly to RevNESet¹ vs RevSet
in top_down_pass, we'll be doing self.exact_matching.add_matching(initial_exact_matching -- that would probably need to be a method like ExactMatching::add_matching(&mut self, other: ExactMatching) (accepting a regular Matching would of course be wrong because of the broken invariant)

I feel like putting the NE at the start, so NERevSet, would read a bit better... but that's a different discussion entirely ↩︎

It took me a bit of time to see what this PR actually changed -- it looked like it added a new optional parameter to a method, and then proceeded to just put `None`s into it at all the callsites. Only then I realized that sometimes _the other_ parameter was shifted to the last position by adding a `None` before it. Still, this feels a bit brittle to me, type-safety-wise. Maybe we could something like: 1. add `ExactMatching`, a newtype around `Matching`, which guarantees that all the matches in it are exact 2. make constructing it tricky -- maybe with an unsafe constructor like ```rs /// # Safety /// /// `matching` must contain only exact matches unsafe fn new(matching: Matching) -> Self { Self(matching) } ``` 3. call this unsafe constructor in the places where we know that the matching we create is indeed exact, with an appropriate safety comment 4. change the signature of `match_trees` to take one `Matching` and one `ExactMatching` 5. we probably could implement `Deref<Target=&Matching>`, for the places where we want to "forget" about the exactness -- similarly to `RevNESet`[^1] vs `RevSet` 6. in `top_down_pass`, we'll be doing `self.exact_matching.add_matching(initial_exact_matching` -- that would probably need to be a method like `ExactMatching::add_matching(&mut self, other: ExactMatching)` (accepting a regular `Matching` would of course be wrong because of the broken invariant) [^1]: I feel like putting the `NE` at the start, so `NERevSet`, would read a bit better... but that's a different discussion entirely

ada4a commented

2025-07-22 20:11:41 +02:00

Maybe the same logic could be applied to recovery and container matchings, but I'm still not completely clear on their semantics to be honest, so I'm not sure

ada4a commented

2025-07-22 20:36:25 +02:00

Also, do you think it's possible for both initial matchings to ever be Some at the same time? Doesn't seem to be the case from the diff anyway. I wonder whether Option<Either<Matching, ExactMatching>> would be more passing.. it is a bit less understandable though I'm afraid

Also, do you think it's possible for both initial matchings to ever be `Some` at the same time? Doesn't seem to be the case from the diff anyway. I wonder whether `Option<Either<Matching, ExactMatching>>` would be more passing.. it is a bit less understandable though I'm afraid

wetneb commented

2025-07-23 16:21:28 +02:00

Yes, I need to give you a bit more context here.

The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them. (So it's a bit of an accident that this preparatory change turned out to be such a big performance boost.)

So I'm interested in being able to supply both an initial approximate matching and an initial exact matching. My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too.

I generally enjoy introducing types which encode properties about their contents, without adding more fields. To make their construction harder, I thought that can be achieved using constructor visibility, no? If the type is defined in the same unit as the code that is legitimate to produce it, then isn't that enough to prevent its construction outside? I guess one issue here is that it will typically be constructed from the first pass of the GumTree algorithm, but also from ParsedMerge, so maybe it's difficult to use this solution.

Yes, I need to give you a bit more context here. The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them. (So it's a bit of an accident that this preparatory change turned out to be such a big performance boost.) So I'm interested in being able to supply both an initial approximate matching and an initial exact matching. My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too. I generally enjoy introducing types which encode properties about their contents, without adding more fields. To make their construction harder, I thought that can be achieved using constructor visibility, no? If the type is defined in the same unit as the code that is legitimate to produce it, then isn't that enough to prevent its construction outside? I guess one issue here is that it will typically be constructed from the first pass of the GumTree algorithm, but also from `ParsedMerge`, so maybe it's difficult to use this solution.

ada4a commented

2025-07-23 16:34:21 +02:00

Yes, I need to give you a bit more context here.

The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them.

The context was indeed quite useful^^ I too sometimes find it hard to choose between "open one big PR which the other person will probably ask to split-up into multiple anyway" and "just do it as a series of small PRs from the start". As we've seen a couple of times, the latter approach has a risk of creating PRs whose purpose is not quite clear. Because of that, I'd say let's open those big PRs -- splitting them up is not that complicated anyway.

To make their construction harder, I thought that can be achieved using constructor visibility, no?

I mean if you're in the same module as the item definition, you can just skip constructors entirely.. but I see your point. Yes, that could probably work as well, but the GumTree thing does make that solution difficult as you said

> Yes, I need to give you a bit more context here. > > The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them. The context was indeed quite useful^^ I too sometimes find it hard to choose between "open one big PR which the other person will probably ask to split-up into multiple anyway" and "just do it as a series of small PRs from the start". As we've seen a couple of times, the latter approach has a risk of creating PRs whose purpose is not quite clear. Because of that, I'd say let's open those big PRs -- splitting them up is not that complicated anyway. > To make their construction harder, I thought that can be achieved using constructor visibility, no? I mean if you're in the same module as the item definition, you can just skip constructors entirely.. but I see your point. Yes, that could probably work as well, but the GumTree thing does make that solution difficult as you said

ada4a commented

2025-07-23 16:37:07 +02:00

My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too.

I can't quite imagine what kind of a struct you mean, but if you say it's relevant then it probably is^^ Do as you see fit

> My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too. I can't quite imagine what kind of a struct you mean, but if you say it's relevant then it probably is^^ Do as you see fit

wetneb added 2 commits

2025-07-23 17:58:52 +02:00

Introduce ExactApproxMatching 89445407c6

Use ApproxExactMatching in TreeMatcher

/ test (pull_request) Successful in 36s

Details

9a55c5ad38

wetneb commented

2025-07-23 18:03:30 +02:00

Here is the struct I meant. It's a struct so that it can be used to supply both an approximate and an exact matching at the same time (which doesn't happen yet but will be later for #482).

Concerning making one big PR instead, I do think those two changes are worth reviewing separately, because changes to the matching algorithm can have really subtle impacts on the merged output and performance… even for this PR, I'd like to run a benchmark on a wider dataset to make sure the merged outputs are favorably impacted. Merge cases changing from Conflict to Differ are generally suspicious as they can introduce wrong merges that don't get reviewed by users.

Here is the `struct` I meant. It's a struct so that it can be used to supply both an approximate and an exact matching at the same time (which doesn't happen yet but will be later for #482). Concerning making one big PR instead, I do think those two changes are worth reviewing separately, because changes to the matching algorithm can have really subtle impacts on the merged output and performance… even for this PR, I'd like to run a benchmark on a wider dataset to make sure the merged outputs are favorably impacted. Merge cases changing from **Conflict** to **Differ** are generally suspicious as they can introduce wrong merges that don't get reviewed by users.

wetneb commented

2025-07-23 18:05:22 +02:00

But I'm happy to try and give you more context upfront, I need to remember to do this spontaneously… (but sometimes it feels a bit like drowning the other in irrelevant stuff, perhaps)

ada4a approved these changes

2025-07-24 17:18:30 +02:00

ada4a left a comment

I'm not sure how I feel about the new struct, but let's see how it looks in the context of the second PR

src/tree_matcher.rs

					
				@ -84,0 +90,4 @@

				    /// - `initial_matching`, which may contain inexact matches, but which the exact matching produced in this method must

				    ///   be compatible with (we cannot generate an exact match of a node with another one that's already matched in the

				    ///   initial_matching)

				    /// - `initial_exact_matching`, which is an initial seed of matches that are known to be isomorphic.