Broader test suite for benchmarks #485
Labels
No labels
Compat/Breaking
Kind
Bad merge
Kind
Bug
Kind
Documentation
Kind
Enhancement
Kind
Feature
Kind
New language
Kind
Security
Kind
Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: mergiraf/mergiraf#485
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Our existing set of
examples/
is focused on minimal, artificial test cases to check that a specific type of conflict resolution is supported.For some types of changes, such as tweaks to the matching algorithms, it would be good to have an overview of the impact on real-world usage. During initial development, I've used the replication test suite of Spork for this. It's a set of real merge scenarios extracted from merge commits found in the wild. The expected merge file is set to be the one found in the merge commit. With such a dataset, one can run mergiraf on all cases and compare its output to the extracted ones. Then we can categorize those based on whether mergiraf returns:
I would like:
helpers/suite.sh
script is not far from that.*.rs
go.mod
I think this would be really helpful to tweak the matching heuristics (such as for #406 or #325), because I feel quite anxious about making changes to it without this feedback.
I'm working on the dataset collection, which I plan to do via https://www.softwareheritage.org/.
Other things that would be interesting to track:
Also, it would be good to be able to run the test suite only on certain file types (typically for parser updates or changes to specific language profiles).
I'm slowly getting there! Here is a table summarizing a benchmark on a small dataset of real merge scenarios extracted from merge commits:
*.py
*.cpp
*.js
*.java
Here's what the columns mean:
git merge-file
with Myers diff)mergiraf merge
process, in secondsSo far the panics are all instances of #520, #521 or #333. It's interesting to see that almost half of the
*.cpp
files fail to parse (likely due to pre-processor usage). Also the*.js
files take significantly longer - not sure if it's because they are often bigger, or if the parser is inefficient. Of course the numbers aren't so insightful on their own, the real deal will be observing the difference between two versions of mergiraf.I'm still gathering a wider dataset of test cases (covering all file formats).
Very impressive!
I think for the "Conflicts" column, it could make sense to weight each merge result by the amount of conflicts created – maybe Mergiraf solves all but one? I guess one could argue for doing the same for "Format" using something like tree-edit-distance, but that might come too close to an attempt to white-wash Mergiraf 😅
What sort of weighting are you thinking about?
For me, the number of conflicts solved isn't super important… as long as there are conflicts in the file, I expect that some human will have a look at the merged results to solve the remaining conflicts, and so if there are issues with the resolution of any conflicts, they are likely to get noticed in the same go.
Another issue is that there can be more conflicts than in the line-based file, because they're narrower. So we could also track the conflict mass, but then again I'm not sure how to report that in the table.
Hm, my idea would be something like
conflict_mass(structured_merge) / conflict_mass(line_based_merge)
, but I guess that's not guaranteed to be favorable to us, given that a fully structured merge can create just different conflicts, which can't really be compared to those created by a line-based mergeYes… I guess one category we could still add is one for cases where mergiraf returns conflicts which are identical to the line-based merge output. This would let us distinguish cases where mergiraf really did something.
That does make sense I think. But now I'm wondering – how do you recognize the case where Mergiraf plain falls back to line-base merging (because of some side not parsing)? By analyzing the logs probably? Because this case and the case you mentioned could theoretically get conflated
For now I just manually check with
mgf_dev
if all revisions parse. It's true that in that case the line-based fallback also kicks in, one would need to make the definitions of the categories clear.Just for fun I tried comparing the effect of #522 on benchmark results:
*.py
I'm still working on improving the rendering. Another useful piece of info would be the test cases that changed, to inspect that they went in the right direction (in particular the ones that land in
Differ
).Here are some results on a big-ish dataset, for mergiraf 0.13.0.
*.java
*.xml
*.cc
*.py
*.json
*.php
*.js
*.h
*.md
*.c
*.cpp
*.ts
*.scala
*.go
*.hpp
*.html
*.cs
*.rs
*.yml
*.rb
*.kt
*.tsx
*.toml
*.properties
*.jsx
*.yaml
*.mk
*.dart
*.ini
*.lua
*.hs
*.phtml
*.sbt
*.htm
*.ex
*.exs
*.hh
*.mjs
*.hxx