Broader test suite for benchmarks #485

Open
opened 2025-07-09 10:26:58 +02:00 by wetneb · 10 comments
Owner

Our existing set of examples/ is focused on minimal, artificial test cases to check that a specific type of conflict resolution is supported.

For some types of changes, such as tweaks to the matching algorithms, it would be good to have an overview of the impact on real-world usage. During initial development, I've used the replication test suite of Spork for this. It's a set of real merge scenarios extracted from merge commits found in the wild. The expected merge file is set to be the one found in the merge commit. With such a dataset, one can run mergiraf on all cases and compare its output to the extracted ones. Then we can categorize those based on whether mergiraf returns:

  • conflicts, in which case its output is almost certainly different from the extracted file (unless merge conflicts were committed into the merge…)
  • a clean merge that's exactly identical to the extracted file. In this case we are obviously very happy.
  • a clean merge that's commutatively isomorphic to the extracted file. Perhaps the formatting isn't perfect, but we are still pretty happy
  • a clean merge that's different from the extracted file. This is the category we want to reduce as much as possible, even though this will include plenty of cases where mergiraf's output is actually fine.

I would like:

  1. a big dataset containing extracted merge cases, like Spork's replication test suite but for any file types, and only containing the cases where line-based merging produces conflicts
  2. a utility to benchmark mergiraf on this dataset, computing statistics about each of the categories above, broken down by file type. The helpers/suite.sh script is not far from that.
  3. (cherry on the cake) a CI pipeline we can trigger on demand on PRs to produce a summary of the changes to the benchmark statics before and after the PR. Something that would look like this:
Glob Total cases Conflicts Identical Isomorphic Different
*.rs 1815 +16 (+1%) +0 +0 -16 (-1%)
go.mod
Total

I think this would be really helpful to tweak the matching heuristics (such as for #406 or #325), because I feel quite anxious about making changes to it without this feedback.

I'm working on the dataset collection, which I plan to do via https://www.softwareheritage.org/.

Our existing set of `examples/` is focused on minimal, artificial test cases to check that a specific type of conflict resolution is supported. For some types of changes, such as tweaks to the matching algorithms, it would be good to have an overview of the impact on real-world usage. During initial development, I've used [the replication test suite of Spork](https://github.com/ASSERT-KTH/spork/tree/master/replication) for this. It's a set of real merge scenarios extracted from merge commits found in the wild. The expected merge file is set to be the one found in the merge commit. With such a dataset, one can run mergiraf on all cases and compare its output to the extracted ones. Then we can categorize those based on whether mergiraf returns: * **conflicts**, in which case its output is almost certainly different from the extracted file (unless merge conflicts were committed into the merge…) * a clean merge that's **exactly identical** to the extracted file. In this case we are obviously very happy. * a clean merge that's **commutatively isomorphic** to the extracted file. Perhaps the formatting isn't perfect, but we are still pretty happy * a clean merge that's **different** from the extracted file. This is the category we want to reduce as much as possible, even though this will include plenty of cases where mergiraf's output is actually fine. I would like: 1. [ ] a big dataset containing extracted merge cases, like Spork's replication test suite but for any file types, and only containing the cases where line-based merging produces conflicts 2. [ ] a utility to benchmark mergiraf on this dataset, computing statistics about each of the categories above, broken down by file type. The `helpers/suite.sh` script is not far from that. 3. [ ] (cherry on the cake) a CI pipeline we can trigger on demand on PRs to produce a summary of the changes to the benchmark statics before and after the PR. Something that would look like this: | Glob | Total cases | Conflicts| Identical | Isomorphic | Different | |---------|---------|---------|---------|---------|---------| | `*.rs` | 1815| +16 (+1%) | +0 | +0 | -16 (-1%) | | `go.mod` | … | … | … | … | … | | … | … | … | … | … | … | | **Total** | … | … | … | … | … | I think this would be really helpful to tweak the matching heuristics (such as for #406 or #325), because I feel quite anxious about making changes to it without this feedback. I'm working on the dataset collection, which I plan to do via https://www.softwareheritage.org/.
Author
Owner

Other things that would be interesting to track:

  • the proportion of files where we fail to parse one of the source files (useful metric to track when upgrading a parser to a new version)
  • of course, performance metrics. Average time to merge? Or time to merge per kilobytes of input files? Or per bytes of conflicts? Not sure what's the most sensible thing.

Also, it would be good to be able to run the test suite only on certain file types (typically for parser updates or changes to specific language profiles).

Other things that would be interesting to track: * the proportion of files where we fail to parse one of the source files (useful metric to track when upgrading a parser to a new version) * of course, performance metrics. Average time to merge? Or time to merge per kilobytes of input files? Or per bytes of conflicts? Not sure what's the most sensible thing. Also, it would be good to be able to run the test suite only on certain file types (typically for parser updates or changes to specific language profiles).
wetneb self-assigned this 2025-07-16 01:20:02 +02:00
Author
Owner

I'm slowly getting there! Here is a table summarizing a benchmark on a small dataset of real merge scenarios extracted from merge commits:

Language Cases Conflict Exact Format Differ Parse Panic Time (s)
*.py 815 550 (67%) 123 (15%) 45 (6%) 87 (11%) 7 (1%) 3 (0%) 0.187
*.cpp 461 156 (34%) 44 (10%) 14 (3%) 36 (8%) 207 (45%) 4 (1%) 0.129
*.js 1514 1129 (75%) 131 (9%) 52 (3%) 151 (10%) 48 (3%) 3 (0%) 0.731
*.java 327 181 (55%) 59 (18%) 18 (6%) 55 (17%) 14 (4%) 0 0.131
Total 3117 2016 (65%) 357 (11%) 129 (4%) 329 (11%) 276 (9%) 10 (0%) 0.437

Here's what the columns mean:

  • Cases: the total number of test cases considered. Line-based merging returns conflicts for all of them (using git merge-file with Myers diff)
  • Conflict: mergiraf also returns conflicts
  • Exact: mergiraf returns exactly the merge that is recorded in the merge commit (up to blank lines)
  • Format: mergiraf returns something slightly different, but commutatively isomorphic to the target file from the merge commit
  • Differ: mergiraf returns a conflict-free merge output, that isn't commutatively isomorphic to the target file
  • Parse: one of the source revisions failed to parse with the associated tree-sitter parser (so mergiraf fell back on line-based merging)
  • Panic: mergiraf panicked during the merge process
  • Time: the average duration of the mergiraf merge process, in seconds

So far the panics are all instances of #520, #521 or #333. It's interesting to see that almost half of the *.cpp files fail to parse (likely due to pre-processor usage). Also the *.js files take significantly longer - not sure if it's because they are often bigger, or if the parser is inefficient. Of course the numbers aren't so insightful on their own, the real deal will be observing the difference between two versions of mergiraf.

I'm still gathering a wider dataset of test cases (covering all file formats).

I'm slowly getting there! Here is a table summarizing a benchmark on a small dataset of real merge scenarios extracted from merge commits: | Language | Cases | Conflict | Exact | Format | Differ | Parse | Panic | Time (s) | | -------- | ----- | -------- | ----- | ------ | ------ | ----- | ----- | -------- | | `*.py` | 815 | 550 (67%) | 123 <span style="text-align:right">(15%)</span> | 45 (6%) | 87 (11%) | 7 (1%) | 3 (0%) | 0.187 | | `*.cpp` | 461 | 156 (34%) | 44 (10%) | 14 (3%) | 36 (8%) | 207 (45%) | 4 (1%) | 0.129 | | `*.js` | 1514 | 1129 (75%) | 131 (9%) | 52 (3%) | 151 (10%) | 48 (3%) | 3 (0%) | 0.731 | | `*.java` | 327 | 181 (55%) | 59 (18%) | 18 (6%) | 55 (17%) | 14 (4%) | 0 | 0.131 | | **Total** | 3117 | 2016 (65%) | 357 (11%) | 129 (4%) | 329 (11%) | 276 (9%) | 10 (0%) | 0.437 | Here's what the columns mean: * Cases: the total number of test cases considered. Line-based merging returns conflicts for all of them (using `git merge-file` with Myers diff) * Conflict: mergiraf also returns conflicts * Exact: mergiraf returns exactly the merge that is recorded in the merge commit (up to blank lines) * Format: mergiraf returns something slightly different, but commutatively isomorphic to the target file from the merge commit * Differ: mergiraf returns a conflict-free merge output, that isn't commutatively isomorphic to the target file * Parse: one of the source revisions failed to parse with the associated tree-sitter parser (so mergiraf fell back on line-based merging) * Panic: mergiraf panicked during the merge process * Time: the average duration of the `mergiraf merge` process, in seconds So far the panics are all instances of #520, #521 or #333. It's interesting to see that almost half of the `*.cpp` files fail to parse (likely due to pre-processor usage). Also the `*.js` files take significantly longer - not sure if it's because they are often bigger, or if the parser is inefficient. Of course the numbers aren't so insightful on their own, the real deal will be observing the difference between two versions of mergiraf. I'm still gathering a wider dataset of test cases (covering all file formats).
Owner

Very impressive!

I think for the "Conflicts" column, it could make sense to weight each merge result by the amount of conflicts created – maybe Mergiraf solves all but one? I guess one could argue for doing the same for "Format" using something like tree-edit-distance, but that might come too close to an attempt to white-wash Mergiraf 😅

Very impressive! I think for the "Conflicts" column, it could make sense to weight each merge result by the amount of conflicts created – maybe Mergiraf solves all but one? I guess one could argue for doing the same for "Format" using something like tree-edit-distance, but that might come too close to an attempt to white-wash Mergiraf 😅
Author
Owner

I think for the "Conflicts" column, it could make sense to weight each merge result by the amount of conflicts created – maybe Mergiraf solves all but one?

What sort of weighting are you thinking about?
For me, the number of conflicts solved isn't super important… as long as there are conflicts in the file, I expect that some human will have a look at the merged results to solve the remaining conflicts, and so if there are issues with the resolution of any conflicts, they are likely to get noticed in the same go.
Another issue is that there can be more conflicts than in the line-based file, because they're narrower. So we could also track the conflict mass, but then again I'm not sure how to report that in the table.

> I think for the "Conflicts" column, it could make sense to weight each merge result by the amount of conflicts created – maybe Mergiraf solves all but one? What sort of weighting are you thinking about? For me, the number of conflicts solved isn't super important… as long as there are conflicts in the file, I expect that some human will have a look at the merged results to solve the remaining conflicts, and so if there are issues with the resolution of any conflicts, they are likely to get noticed in the same go. Another issue is that there can be more conflicts than in the line-based file, because they're narrower. So we could also track the conflict mass, but then again I'm not sure how to report that in the table.
Owner

Hm, my idea would be something like conflict_mass(structured_merge) / conflict_mass(line_based_merge), but I guess that's not guaranteed to be favorable to us, given that a fully structured merge can create just different conflicts, which can't really be compared to those created by a line-based merge

Hm, my idea would be something like `conflict_mass(structured_merge) / conflict_mass(line_based_merge)`, but I guess that's not guaranteed to be favorable to us, given that a fully structured merge can create just _different_ conflicts, which can't really be compared to those created by a line-based merge
Author
Owner

Yes… I guess one category we could still add is one for cases where mergiraf returns conflicts which are identical to the line-based merge output. This would let us distinguish cases where mergiraf really did something.

Yes… I guess one category we could still add is one for cases where mergiraf returns conflicts which are identical to the line-based merge output. This would let us distinguish cases where mergiraf really did something.
Owner

That does make sense I think. But now I'm wondering – how do you recognize the case where Mergiraf plain falls back to line-base merging (because of some side not parsing)? By analyzing the logs probably? Because this case and the case you mentioned could theoretically get conflated

That does make sense I think. But now I'm wondering – how do you recognize the case where Mergiraf plain falls back to line-base merging (because of some side not parsing)? By analyzing the logs probably? Because this case and the case you mentioned could theoretically get conflated
Author
Owner

For now I just manually check with mgf_dev if all revisions parse. It's true that in that case the line-based fallback also kicks in, one would need to make the definitions of the categories clear.

For now I just manually check with `mgf_dev` if all revisions parse. It's true that in that case the line-based fallback also kicks in, one would need to make the definitions of the categories clear.
Author
Owner

Just for fun I tried comparing the effect of #522 on benchmark results:

Language Cases Conflict Exact Format Differ Parse Panic Time (s)
*.py 815 546 (-4) 124 (+1) 46 (+1) 89 (+2) 7 3 0.197 (+0.010)

I'm still working on improving the rendering. Another useful piece of info would be the test cases that changed, to inspect that they went in the right direction (in particular the ones that land in Differ).

Just for fun I tried comparing the effect of #522 on benchmark results: | Language | Cases | Conflict | Exact | Format | Differ | Parse | Panic | Time (s) | | -------- | ----- | -------- | ----- | ------ | ------ | ----- | ----- | -------- | | `*.py` | 815 | 546 **(-4)** | 124 **(+1)** | 46 **(+1)** | 89 **(+2)** | 7 | 3 | 0.197 (+0.010) | I'm still working on improving the rendering. Another useful piece of info would be the test cases that changed, to inspect that they went in the right direction (in particular the ones that land in `Differ`).
Author
Owner

Here are some results on a big-ish dataset, for mergiraf 0.13.0.

Language Cases Exact Format Conflict Differ Parse Panic Time (s)
*.java 34,530 2,512 (7%) 1,083 (3%) 28,312 (82%) 2,603 (8%) 18 (0%) 2 (0%) 0.121
*.xml 17,491 466 (3%) 38 (0%) 16,204 (93%) 447 (3%) 328 (2%) 8 (0%) 0.395
*.cc 15,014 708 (5%) 113 (1%) 1,671 (11%) 369 (2%) 12,148 (81%) 5 (0%) 0.156
*.py 10,817 1,981 (18%) 528 (5%) 6,890 (64%) 1,373 (13%) 41 (0%) 4 (0%) 0.161
*.json 10,239 1,372 (13%) 334 (3%) 7,657 (75%) 813 (8%) 43 (0%) 20 (0%) 0.208
*.php 10,015 1,822 (18%) 488 (5%) 6,504 (65%) 1,038 (10%) 158 (2%) 5 (0%) 0.363
*.js 9,795 1,323 (14%) 456 (5%) 6,349 (65%) 1,197 (12%) 454 (5%) 16 (0%) 1.288
*.h 8,096 493 (6%) 110 (1%) 3,079 (38%) 372 (5%) 4,037 (50%) 5 (0%) 0.053
*.md 5,997 412 (7%) 36 (1%) 4,911 (82%) 458 (8%) 168 (3%) 12 (0%) 0.443
*.c 5,841 244 (4%) 56 (1%) 1,449 (25%) 217 (4%) 3,846 (66%) 29 (0%) 0.151
*.cpp 4,671 290 (6%) 89 (2%) 2,271 (49%) 153 (3%) 1,846 (40%) 22 (0%) 0.276
*.ts 4,458 908 (20%) 291 (7%) 2,474 (55%) 664 (15%) 117 (3%) 4 (0%) 0.264
*.scala 3,633 935 (26%) 405 (11%) 1,166 (32%) 378 (10%) 747 (21%) 2 (0%) 0.078
*.go 3,333 549 (16%) 156 (5%) 2,128 (64%) 489 (15%) 11 (0%) 0 0.247
*.hpp 2,212 30 (1%) 8 (0%) 1,624 (73%) 30 (1%) 514 (23%) 6 (0%) 0.021
*.html 2,157 170 (8%) 36 (2%) 1,346 (62%) 184 (9%) 420 (19%) 1 (0%) 0.438
*.cs 1,651 182 (11%) 178 (11%) 981 (59%) 174 (11%) 130 (8%) 6 (0%) 0.127
*.rs 1,383 297 (21%) 95 (7%) 720 (52%) 258 (19%) 13 (1%) 0 0.811
*.yml 1,256 190 (15%) 62 (5%) 828 (66%) 148 (12%) 28 (2%) 0 0.101
*.rb 898 129 (14%) 43 (5%) 619 (69%) 97 (11%) 10 (1%) 0 0.067
*.kt 622 164 (26%) 28 (5%) 320 (51%) 79 (13%) 31 (5%) 0 0.079
*.tsx 602 171 (28%) 92 (15%) 218 (36%) 114 (19%) 7 (1%) 0 0.119
*.toml 597 180 (30%) 25 (4%) 288 (48%) 104 (17%) 0 0 0.023
*.properties 491 34 (7%) 10 (2%) 386 (79%) 61 (12%) 0 0 0.094
*.jsx 391 73 (19%) 11 (3%) 181 (46%) 50 (13%) 76 (19%) 0 0.086
*.yaml 334 45 (13%) 8 (2%) 237 (71%) 23 (7%) 21 (6%) 0 1.103
*.mk 211 16 (8%) 7 (3%) 133 (63%) 15 (7%) 40 (19%) 0 0.007
*.dart 185 28 (15%) 21 (11%) 50 (27%) 83 (45%) 3 (2%) 0 0.039
*.ini 125 4 (3%) 2 (2%) 60 (48%) 3 (2%) 56 (45%) 0 0.001
*.lua 39 5 (13%) 0 26 (67%) 8 (21%) 0 0 0.092
*.hs 27 1 (4%) 1 (4%) 23 (85%) 1 (4%) 1 (4%) 0 0.210
*.phtml 20 6 (30%) 3 (15%) 3 (15%) 5 (25%) 3 (15%) 0 0.045
*.sbt 13 3 (23%) 0 9 (69%) 1 (8%) 0 0 0.022
*.htm 11 0 0 7 (64%) 0 4 (36%) 0 0.315
*.ex 7 3 (43%) 0 1 (14%) 3 (43%) 0 0 0.120
*.exs 7 1 (14%) 0 4 (57%) 2 (29%) 0 0 0.139
*.hh 4 0 0 3 (75%) 1 (25%) 0 0 0.050
*.mjs 2 1 (50%) 1 (50%) 0 0 0 0 0.220
*.hxx 2 0 0 1 (50%) 0 1 (50%) 0 1.745
Total 157,177 15,748 (10%) 4,814 (3%) 99,133 (63%) 12,015 (8%) 25,320 (16%) 147 (0%) 0.281
Here are some results on a big-ish dataset, for mergiraf 0.13.0. | Language | Cases | Exact | Format | Conflict | Differ | Parse | Panic | Time (s) | | -------- | ----- | ----- | ------ | -------- | ------ | ----- | ----- | -------- | | `*.java` | 34,530 | 2,512 (7%) | 1,083 (3%) | 28,312 (82%) | 2,603 (8%) | 18 (0%) | 2 (0%) | 0.121 | | `*.xml` | 17,491 | 466 (3%) | 38 (0%) | 16,204 (93%) | 447 (3%) | 328 (2%) | 8 (0%) | 0.395 | | `*.cc` | 15,014 | 708 (5%) | 113 (1%) | 1,671 (11%) | 369 (2%) | 12,148 (81%) | 5 (0%) | 0.156 | | `*.py` | 10,817 | 1,981 (18%) | 528 (5%) | 6,890 (64%) | 1,373 (13%) | 41 (0%) | 4 (0%) | 0.161 | | `*.json` | 10,239 | 1,372 (13%) | 334 (3%) | 7,657 (75%) | 813 (8%) | 43 (0%) | 20 (0%) | 0.208 | | `*.php` | 10,015 | 1,822 (18%) | 488 (5%) | 6,504 (65%) | 1,038 (10%) | 158 (2%) | 5 (0%) | 0.363 | | `*.js` | 9,795 | 1,323 (14%) | 456 (5%) | 6,349 (65%) | 1,197 (12%) | 454 (5%) | 16 (0%) | 1.288 | | `*.h` | 8,096 | 493 (6%) | 110 (1%) | 3,079 (38%) | 372 (5%) | 4,037 (50%) | 5 (0%) | 0.053 | | `*.md` | 5,997 | 412 (7%) | 36 (1%) | 4,911 (82%) | 458 (8%) | 168 (3%) | 12 (0%) | 0.443 | | `*.c` | 5,841 | 244 (4%) | 56 (1%) | 1,449 (25%) | 217 (4%) | 3,846 (66%) | 29 (0%) | 0.151 | | `*.cpp` | 4,671 | 290 (6%) | 89 (2%) | 2,271 (49%) | 153 (3%) | 1,846 (40%) | 22 (0%) | 0.276 | | `*.ts` | 4,458 | 908 (20%) | 291 (7%) | 2,474 (55%) | 664 (15%) | 117 (3%) | 4 (0%) | 0.264 | | `*.scala` | 3,633 | 935 (26%) | 405 (11%) | 1,166 (32%) | 378 (10%) | 747 (21%) | 2 (0%) | 0.078 | | `*.go` | 3,333 | 549 (16%) | 156 (5%) | 2,128 (64%) | 489 (15%) | 11 (0%) | 0 | 0.247 | | `*.hpp` | 2,212 | 30 (1%) | 8 (0%) | 1,624 (73%) | 30 (1%) | 514 (23%) | 6 (0%) | 0.021 | | `*.html` | 2,157 | 170 (8%) | 36 (2%) | 1,346 (62%) | 184 (9%) | 420 (19%) | 1 (0%) | 0.438 | | `*.cs` | 1,651 | 182 (11%) | 178 (11%) | 981 (59%) | 174 (11%) | 130 (8%) | 6 (0%) | 0.127 | | `*.rs` | 1,383 | 297 (21%) | 95 (7%) | 720 (52%) | 258 (19%) | 13 (1%) | 0 | 0.811 | | `*.yml` | 1,256 | 190 (15%) | 62 (5%) | 828 (66%) | 148 (12%) | 28 (2%) | 0 | 0.101 | | `*.rb` | 898 | 129 (14%) | 43 (5%) | 619 (69%) | 97 (11%) | 10 (1%) | 0 | 0.067 | | `*.kt` | 622 | 164 (26%) | 28 (5%) | 320 (51%) | 79 (13%) | 31 (5%) | 0 | 0.079 | | `*.tsx` | 602 | 171 (28%) | 92 (15%) | 218 (36%) | 114 (19%) | 7 (1%) | 0 | 0.119 | | `*.toml` | 597 | 180 (30%) | 25 (4%) | 288 (48%) | 104 (17%) | 0 | 0 | 0.023 | | `*.properties` | 491 | 34 (7%) | 10 (2%) | 386 (79%) | 61 (12%) | 0 | 0 | 0.094 | | `*.jsx` | 391 | 73 (19%) | 11 (3%) | 181 (46%) | 50 (13%) | 76 (19%) | 0 | 0.086 | | `*.yaml` | 334 | 45 (13%) | 8 (2%) | 237 (71%) | 23 (7%) | 21 (6%) | 0 | 1.103 | | `*.mk` | 211 | 16 (8%) | 7 (3%) | 133 (63%) | 15 (7%) | 40 (19%) | 0 | 0.007 | | `*.dart` | 185 | 28 (15%) | 21 (11%) | 50 (27%) | 83 (45%) | 3 (2%) | 0 | 0.039 | | `*.ini` | 125 | 4 (3%) | 2 (2%) | 60 (48%) | 3 (2%) | 56 (45%) | 0 | 0.001 | | `*.lua` | 39 | 5 (13%) | 0 | 26 (67%) | 8 (21%) | 0 | 0 | 0.092 | | `*.hs` | 27 | 1 (4%) | 1 (4%) | 23 (85%) | 1 (4%) | 1 (4%) | 0 | 0.210 | | `*.phtml` | 20 | 6 (30%) | 3 (15%) | 3 (15%) | 5 (25%) | 3 (15%) | 0 | 0.045 | | `*.sbt` | 13 | 3 (23%) | 0 | 9 (69%) | 1 (8%) | 0 | 0 | 0.022 | | `*.htm` | 11 | 0 | 0 | 7 (64%) | 0 | 4 (36%) | 0 | 0.315 | | `*.ex` | 7 | 3 (43%) | 0 | 1 (14%) | 3 (43%) | 0 | 0 | 0.120 | | `*.exs` | 7 | 1 (14%) | 0 | 4 (57%) | 2 (29%) | 0 | 0 | 0.139 | | `*.hh` | 4 | 0 | 0 | 3 (75%) | 1 (25%) | 0 | 0 | 0.050 | | `*.mjs` | 2 | 1 (50%) | 1 (50%) | 0 | 0 | 0 | 0 | 0.220 | | `*.hxx` | 2 | 0 | 0 | 1 (50%) | 0 | 1 (50%) | 0 | 1.745 | | **Total** | 157,177 | 15,748 (10%) | 4,814 (3%) | 99,133 (63%) | 12,015 (8%) | 25,320 (16%) | 147 (0%) | 0.281 |
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#485
No description provided.