feat: Distinguish between exact and inexact initial matchings #523
No reviewers
Labels
No labels
Compat/Breaking
Kind
Bad merge
Kind
Bug
Kind
Documentation
Kind
Enhancement
Kind
Feature
Kind
New language
Kind
Security
Kind
Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: mergiraf/mergiraf#523
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "wetneb/initial_exact_matching"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The tree-matching heuristics we use are often seeded with pre-existing matchings between the trees to match. This happens in two cases:
So far, those two sorts of initial matchings were handled by the matching heuristics in the same way. But their nature is quite different. In the first case, this is a matching of nodes whose sources are literally the same, so of course they are also isomorphic as subtrees. On the contrary, in the second case, the initial matching is obtained by composition of potentially inexact matches. Since the first pass of the GumTree algorithm is concerned with matching isomorphic nodes, it is useful for it to know that it can indeed rely on the knowledge that the initial matches are all exact.
This PR makes it possible to distinguish between the exact and inexact matches supplied to the matching algorithm. It happens to solve an existing failing test case (closes #139).
Mark rust/undetected_use_statement as workingto feat: Distinguish between exact and inexact initial matchingsRunning this on my toy evaluation dataset, I also notice a really significant speed-up, which makes sense because we avoid quite some isomorphism checks by doing that. This speed up is all the more significant for cases where line-based merge is able to return large merged sections (which explains why the speed up is smaller for javascript files, as this contains more minified files for which line-based merges aren't helpful).
*.py
*.cpp
*.js
*.java
Ohh, that makes a lot of sense! But why would anyone store these minified files yet alone modify them in different branches... Aren't they essentially machine code of the web development world
But given that we can't change this fact, maybe we could employ a heuristics where, if we detect a minified file (the whole file is one line + size bigger than, say, 1kB), we try to split it into lines, do a line-merge + structured resolution of that, and if we succeed, put all the lines back together and return, otherwise fail? Though the question would be how exactly to split into lines – I imagine doing so on things like opening/closing braces would yield results that a line-merge would be more successful on
Yes, or resorting to word-based merging instead of line-based, but arguably merging minified files isn't really supposed to happen in the first place (like merging compiled binaries), so I'm not sure it's worth investing a lot of effort into such use cases.
That's what I was trying to say with the first paragraph, yes. We could get even more audacious and complain on the stdout about being forced to merge files where we know that we'll just be wasting our and user's time (lockfiles, minified JS, something else?)
It took me a bit of time to see what this PR actually changed -- it looked like it added a new optional parameter to a method, and then proceeded to just put
None
s into it at all the callsites. Only then I realized that sometimes the other parameter was shifted to the last position by adding aNone
before it. Still, this feels a bit brittle to me, type-safety-wise.Maybe we could something like:
ExactMatching
, a newtype aroundMatching
, which guarantees that all the matches in it are exactmatch_trees
to take oneMatching
and oneExactMatching
Deref<Target=&Matching>
, for the places where we want to "forget" about the exactness -- similarly toRevNESet
1 vsRevSet
top_down_pass
, we'll be doingself.exact_matching.add_matching(initial_exact_matching
-- that would probably need to be a method likeExactMatching::add_matching(&mut self, other: ExactMatching)
(accepting a regularMatching
would of course be wrong because of the broken invariant)I feel like putting the
NE
at the start, soNERevSet
, would read a bit better... but that's a different discussion entirely ↩︎Maybe the same logic could be applied to recovery and container matchings, but I'm still not completely clear on their semantics to be honest, so I'm not sure
Also, do you think it's possible for both initial matchings to ever be
Some
at the same time? Doesn't seem to be the case from the diff anyway. I wonder whetherOption<Either<Matching, ExactMatching>>
would be more passing.. it is a bit less understandable though I'm afraidYes, I need to give you a bit more context here.
The reason I undertook this change was primarily to enable another follow-up change, to solve #482. I wanted to change the generation of initial matchings from line-based merges so that it returns not just the exact matches (for nodes lying exclusively in a merged region) but also approximate ones (for nodes whose start and end are in merged regions, but potentially different ones). To prepare for that, it made sense to make sure the matching heuristics can accept both types of initial matchings, and are able to distinguish them. (So it's a bit of an accident that this preparatory change turned out to be such a big performance boost.)
So I'm interested in being able to supply both an initial approximate matching and an initial exact matching. My follow-up changes actually introduce a struct for that purpose, so maybe I can add this commit here too.
I generally enjoy introducing types which encode properties about their contents, without adding more fields. To make their construction harder, I thought that can be achieved using constructor visibility, no? If the type is defined in the same unit as the code that is legitimate to produce it, then isn't that enough to prevent its construction outside? I guess one issue here is that it will typically be constructed from the first pass of the GumTree algorithm, but also from
ParsedMerge
, so maybe it's difficult to use this solution.The context was indeed quite useful^^ I too sometimes find it hard to choose between "open one big PR which the other person will probably ask to split-up into multiple anyway" and "just do it as a series of small PRs from the start". As we've seen a couple of times, the latter approach has a risk of creating PRs whose purpose is not quite clear. Because of that, I'd say let's open those big PRs -- splitting them up is not that complicated anyway.
I mean if you're in the same module as the item definition, you can just skip constructors entirely.. but I see your point. Yes, that could probably work as well, but the GumTree thing does make that solution difficult as you said
I can't quite imagine what kind of a struct you mean, but if you say it's relevant then it probably is^^ Do as you see fit
Here is the
struct
I meant. It's a struct so that it can be used to supply both an approximate and an exact matching at the same time (which doesn't happen yet but will be later for #482).Concerning making one big PR instead, I do think those two changes are worth reviewing separately, because changes to the matching algorithm can have really subtle impacts on the merged output and performance… even for this PR, I'd like to run a benchmark on a wider dataset to make sure the merged outputs are favorably impacted. Merge cases changing from Conflict to Differ are generally suspicious as they can introduce wrong merges that don't get reviewed by users.
But I'm happy to try and give you more context upfront, I need to remember to do this spontaneously… (but sometimes it feels a bit like drowning the other in irrelevant stuff, perhaps)
I'm not sure how I feel about the new struct, but let's see how it looks in the context of the second PR
@ -84,0 +90,4 @@
/// - `initial_matching`, which may contain inexact matches, but which the exact matching produced in this method must
/// be compatible with (we cannot generate an exact match of a node with another one that's already matched in the
/// initial_matching)
/// - `initial_exact_matching`, which is an initial seed of matches that are known to be isomorphic.
The docs need to be updated then