WIP: tests: Utilities to benchmark mergiraf on large corpora #537

Draft
wetneb wants to merge 19 commits from benchmark_utils into main
Owner

For #485.

This introduces two benchmark utilities:

  • helpers/benchmark.sh, which runs the benchmark properly speaking. It runs mergiraf on each test case contained in a directory and writes the outcome of each test in the row of a TSV file
  • helpers/summarize_benchmark.py, which analyzes such a TSV file to show human-readable statistics.

I first started writing this benchmarking code in Rust, but I decided against it, because we shouldn't run many test cases in the same process. That's due to the way we implemented the --timeout feature: the source revision files are .leak()-ed at the beginning of the merge process, and the threads doing the merging attempts aren't stopped after reaching the timeout. So that means that those threads would continue running after we've moved on to other test cases, plus we'd be accumulating memory usage due to the leak() calls.

And then for summarize_benchmark.py, it's something that could well be implemented in Rust, but it doesn't share any code with mergiraf so I'm not sure if it's worth it. I find a scripting language is a bit easier to work with for something that should be easy to tweak and iterate on, offering different views on the benchmark results depending on the needs.

I'm marking this as WIP because I continue to improve the scripts as I run benchmarks, but they're already in a state that can be shared. I would also like to learn about type annotations in Python and apply them more consistently.

Here are some examples of the outputs of the two scripts:

Running a benchmark

$ helpers/benchmark.sh target/release/mergiraf /path/to/test/suite/ | tee first_benchmark_log.tsv
status	timing	language	case
Exact	0.00	*.sbt	/path/to/test/suite/sbt/akka/akka/78597ed7c1/fd94c094cc
Conflict	0.01	*.sbt	/path/to/test/suite/sbt/akka/akka/8ab02738b7/fd94c094cc
Exact	0.05	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/60f50d2844/667e2e4fb6
Conflict	0.02	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/24361a4d68/ca0ddd12e2
Conflict	0.01	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/9a0ad57b5a/ca0ddd12e2
Conflict	0.03	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/9a0ad57b5a/667e2e4fb6
Conflict	0.01	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/9e912e6eef/ca0ddd12e2
Conflict	0.06	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/a9e4ee4179/667e2e4fb6
Differ	0.00	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/d54cb2cf28/ca0ddd12e2
Conflict	0.01	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/59f6e495d3/ca0ddd12e2
...

Summarizing one benchmark

$ helpers/summarize_benchmark.py first_benchmark_log.tsv

gives a table

Language Cases Exact Format Conflict Differ Parse Panic Time (s)
*.sbt 13 3 (23%) 0 9 (69%) 1 (8%) 0 0 0.022
*.htm 11 0 0 7 (64%) 0 4 (36%) 0 0.315
...
Total 157,177 15,748 (10%) 4,814 (3%) 99,133 (63%) 12,015 (8%) 25,320 (16%) 147 (0%) 0.281

Comparing two benchmarks

$ helpers/summarize_benchmark.py first_benchmark_log.tsv second_benchmark_log.tsv

Gives a table comparing category statistics:

Language Cases Exact Format Conflict Differ Parse Panic Time (s)
*.sbt 13 +0 +0 +0 +0 +0 +0 -0.007 (-32.1%)
*.htm 11 +0 +0 +0 +0 +0 +0 -0.023 (-7.2%)
...
Total 133,555 +73 (+0.1%) +22 (+0.0%) -199 (-0.1%) +105 (+0.1%) -2 (-0.0%) +1 (+0.0%) -0.063 (-27.6%)

and also a sort of "confusion matrix", showing from which status to which status test cases changed:

↓ Before \ After → Exact Format Conflict Differ Parse Panic
Exact 12,316 17 28 6
Format 14 3,473 11 1
Conflict 100 25 87,625 174 1
Differ 8 6 62 9,543
Parse 2 20,051
Panic 92

In this table, the categories are intuitively ordered from "best to worst". This means that we are happy to see numbers below the diagonal (as it means they changed from one category to a better one, such as Conflict → Exact) and want to avoid numbers above the diagonal (such as Conflict → Panic).

Finally, it also produces a list of test cases which moved from one category to another category that's "worse":

### Format → Differ
/path/to/test/suite/go/uber/cadence/31e89e11c2/fad5af1dbc

### Conflict → Differ
/path/to/test/suite/php/phpmyadmin/phpmyadmin/13890a947a/b08e478474
/path/to/test/suite/py/achael/eht-imaging/6cd766248b/835c98abbd
/path/to/test/suite/rs/wasmerio/wasmer/974fa2b699/255a9980b1
/path/to/test/suite/php/phpmyadmin/phpmyadmin/13890a947a/6b50be9206
...
For #485. This introduces two benchmark utilities: * `helpers/benchmark.sh`, which runs the benchmark properly speaking. It runs mergiraf on each test case contained in a directory and writes the outcome of each test in the row of a TSV file * `helpers/summarize_benchmark.py`, which analyzes such a TSV file to show human-readable statistics. I first started writing this benchmarking code in Rust, but I decided against it, because we shouldn't run many test cases in the same process. That's due to the way we implemented the `--timeout` feature: the source revision files are `.leak()`-ed at the beginning of the merge process, and the threads doing the merging attempts aren't stopped after reaching the timeout. So that means that those threads would continue running after we've moved on to other test cases, plus we'd be accumulating memory usage due to the `leak()` calls. And then for `summarize_benchmark.py`, it's something that could well be implemented in Rust, but it doesn't share any code with mergiraf so I'm not sure if it's worth it. I find a scripting language is a bit easier to work with for something that should be easy to tweak and iterate on, offering different views on the benchmark results depending on the needs. I'm marking this as WIP because I continue to improve the scripts as I run benchmarks, but they're already in a state that can be shared. I would also like to learn about type annotations in Python and apply them more consistently. Here are some examples of the outputs of the two scripts: ### Running a benchmark ```shell $ helpers/benchmark.sh target/release/mergiraf /path/to/test/suite/ | tee first_benchmark_log.tsv status timing language case Exact 0.00 *.sbt /path/to/test/suite/sbt/akka/akka/78597ed7c1/fd94c094cc Conflict 0.01 *.sbt /path/to/test/suite/sbt/akka/akka/8ab02738b7/fd94c094cc Exact 0.05 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/60f50d2844/667e2e4fb6 Conflict 0.02 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/24361a4d68/ca0ddd12e2 Conflict 0.01 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/9a0ad57b5a/ca0ddd12e2 Conflict 0.03 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/9a0ad57b5a/667e2e4fb6 Conflict 0.01 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/9e912e6eef/ca0ddd12e2 Conflict 0.06 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/a9e4ee4179/667e2e4fb6 Differ 0.00 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/d54cb2cf28/ca0ddd12e2 Conflict 0.01 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/59f6e495d3/ca0ddd12e2 ... ``` ### Summarizing one benchmark ```shell $ helpers/summarize_benchmark.py first_benchmark_log.tsv ``` gives a table | Language | Cases | Exact | Format | Conflict | Differ | Parse | Panic | Time (s) | | -------- | ----- | ----- | ------ | -------- | ------ | ----- | ----- | -------- | | `*.sbt` | 13 | 3 (23%) | 0 | 9 (69%) | 1 (8%) | 0 | 0 | 0.022 | | `*.htm` | 11 | 0 | 0 | 7 (64%) | 0 | 4 (36%) | 0 | 0.315 | ... | **Total** | 157,177 | 15,748 (10%) | 4,814 (3%) | 99,133 (63%) | 12,015 (8%) | 25,320 (16%) | 147 (0%) | 0.281 | ### Comparing two benchmarks ```shell $ helpers/summarize_benchmark.py first_benchmark_log.tsv second_benchmark_log.tsv ``` Gives a table comparing category statistics: | Language | Cases | Exact | Format | Conflict | Differ | Parse | Panic | Time (s) | | -------- | ----- | ----- | ------ | -------- | ------ | ----- | ----- | -------- | | `*.sbt` | 13 | +0 | +0 | +0 | +0 | +0 | +0 | -0.007 (-32.1%) | | `*.htm` | 11 | +0 | +0 | +0 | +0 | +0 | +0 | -0.023 (-7.2%) | ... | **Total** | 133,555 | +73 **(+0.1%)** | +22 **(+0.0%)** | -199 **(-0.1%)** | +105 **(+0.1%)** | -2 **(-0.0%)** | +1 **(+0.0%)** | -0.063 (-27.6%) | and also a sort of "confusion matrix", showing from which status to which status test cases changed: | ↓ Before \ After → | Exact | Format | Conflict | Differ | Parse | Panic | | ------------------ | ----- | ------ | -------- | ------ | ----- | ----- | | Exact | 12,316 | **17** | **28** | **6** | | | | Format | **14** | 3,473 | **11** | **1** | | | | Conflict | **100** | **25** | 87,625 | **174** | | **1** | | Differ | **8** | **6** | **62** | 9,543 | | | | Parse | **2** | | | | 20,051 | | | Panic | | | | | | 92 | In this table, the categories are intuitively ordered from "best to worst". This means that we are happy to see numbers below the diagonal (as it means they changed from one category to a better one, such as `Conflict → Exact`) and want to avoid numbers above the diagonal (such as `Conflict → Panic`). Finally, it also produces a list of test cases which moved from one category to another category that's "worse": ``` ### Format → Differ /path/to/test/suite/go/uber/cadence/31e89e11c2/fad5af1dbc ### Conflict → Differ /path/to/test/suite/php/phpmyadmin/phpmyadmin/13890a947a/b08e478474 /path/to/test/suite/py/achael/eht-imaging/6cd766248b/835c98abbd /path/to/test/suite/rs/wasmerio/wasmer/974fa2b699/255a9980b1 /path/to/test/suite/php/phpmyadmin/phpmyadmin/13890a947a/6b50be9206 ... ```
tests: Utilities to benchmark mergiraf on large corpora
All checks were successful
/ test (pull_request) Successful in 40s
1237307015
ada4a left a comment
Owner

A cursory overview, since both of these languages are.. a bit scary to me

A cursory overview, since both of these languages are.. a bit scary to me
@ -0,0 +1,79 @@
#!/usr/bin/env bash
Owner

just an aside -- do you think we should run shellcheck on our shell scripts? I've heard it's pretty helpful, especially with the many footguns that Bash has

just an aside -- do you think we should run shellcheck on our shell scripts? I've heard it's pretty helpful, especially with the many footguns that Bash has
Author
Owner

I don't know about shellcheck, that sounds like a good idea!

I don't know about shellcheck, that sounds like a good idea!
Owner

Could you please add it to the docker image?

Could you please add it to the docker image?
Author
Owner

I think it's easier if we first install it manually in the PR workflow, as updating the Docker image is a bit cumbersome. Then we can batch up things we want to add to it and do it in a single update (newer Rust version, git-cliff, …)

I think it's easier if we first install it manually in the PR workflow, as updating the Docker image is a bit cumbersome. Then we can batch up things we want to add to it and do it in a single update (newer Rust version, git-cliff, …)
@ -0,0 +67,4 @@
else
outcome="Parse"
fi
elif cargo run --release -p mgf_dev -- compare --commutative $tmp_dir/our_merge$ext $tmp_dir/expected_merge$ext > /dev/null 2>&1; then
Owner

I wanted to suggest using the cargo shortcut, but I see that you probably avoid it because you need to pass --release -- maybe we should just add that flag to the shortcut?

I wanted to suggest using the cargo shortcut, but I see that you probably avoid it because you need to pass `--release` -- maybe we should just add that flag to the shortcut?
ada4a marked this conversation as resolved
@ -0,0 +52,4 @@
self.timing = TimingStats()
self.states = defaultdict(int)
def add(self, case):
Owner

nit: case is a keyword in Python I think (for match arms), so the highlighting is confused -- maybe call this testcase?.. Although that wouldn't be consistent with cases that we use elsewhere

Also, I think its type would be dict[str, float]

nit: `case` is a keyword in Python I think (for match arms), so the highlighting is confused -- maybe call this `testcase`?.. Although that wouldn't be consistent with `cases` that we use elsewhere Also, I think its type would be `dict[str, float]`
Owner

although looking elsewhere in the code, what-would-be-keywords are syntax-highlighted even when inside docstrings, so it's probably just the syntax highlighter being trigger-happy and the actual Python isn't really getting confused about the naming of case

although looking elsewhere in the code, what-would-be-keywords are syntax-highlighted even when inside docstrings, so it's probably just the syntax highlighter being trigger-happy and the actual Python isn't really getting confused about the naming of `case`
ada4a marked this conversation as resolved
@ -0,0 +53,4 @@
self.states = defaultdict(int)
def add(self, case):
timing = float(case['timing'])
Owner

I don't think TimingStats requires timing to be a float? You should be able to add the type hint of timing: int to TimingStats::add, and see if that checks out

I don't think `TimingStats` requires `timing` to be a `float`? You should be able to add the type hint of `timing: int` to `TimingStats::add`, and see if that checks out
Author
Owner

I am confused… Timings are currently measured in seconds, so I do want them to be floating-point numbers and not integers, right?

I am confused… Timings are currently measured in seconds, so I do want them to be floating-point numbers and not integers, right?
Owner

Ah, so case['timing'] is initially a string here, and you use float to parse it? In that case, you're right of course

Ah, so `case['timing']` is initially a string here, and you use `float` to parse it? In that case, you're right of course
ada4a marked this conversation as resolved
@ -0,0 +55,4 @@
def add(self, case):
timing = float(case['timing'])
self.timing.add(timing)
status = case['status']
Owner

Having these as plain strings is a bit iffy -- it's all too easy to make a typo. One option could be to not have case be a dict of course, but short of that, having a const for each field name would also work. To make a const in Python, all you need is to call your variable in SCREAMING_CASE

Having these as plain strings is a bit iffy -- it's all too easy to make a typo. One option could be to not have `case` be a dict of course, but short of that, having a const for each field name would also work. To make a const in Python, all you need is to call your variable in SCREAMING_CASE
ada4a marked this conversation as resolved
sh: use usr/bin/env time to accomodate Nix
All checks were successful
/ test (pull_request) Successful in 40s
c161a47f2d
Author
Owner

Thanks for the review!

One issue I have noticed is that by doing the parsing checks via mgf_dev, we make it more difficult to ensure that both the binary being tested and mgf_dev use the same parser. Ideally, we'd detect the parsing failure from mergiraf's output already. It would also avoid parsing things twice, so it would be a bit more efficient.

Thanks for the review! One issue I have noticed is that by doing the parsing checks via `mgf_dev`, we make it more difficult to ensure that both the binary being tested and `mgf_dev` use the same parser. Ideally, we'd detect the parsing failure from mergiraf's output already. It would also avoid parsing things twice, so it would be a bit more efficient.
ada4a force-pushed benchmark_utils from d12b24b2b5 to bee3983591 2025-07-31 20:40:46 +02:00 Compare
benchmark: use cargo compare
All checks were successful
/ test (pull_request) Successful in 36s
3c7a831a66
summarize: give case a type
All checks were successful
/ test (pull_request) Successful in 38s
1b1485691f
summarize: use csv.DictReader
All checks were successful
/ test (pull_request) Successful in 37s
f2aa3b8b8d
Owner

it would be nice not to rerun the whole library test suite when the PR only touches non-Rust files -- I think this is the way to achieve that?

Speaking of tests, it would be nice to have one (or two, for having passed either one or two arguments) for summarize_benchmark.py -- just to see that nothing silly (formatting changes, erroring out) is happening

it would be nice not to rerun the whole library test suite when the PR only touches non-Rust files -- I think [this](https://forgejo.org/docs/latest/user/actions/reference/#jobsjob_idif) is the way to achieve that? Speaking of tests, it would be nice to have one (or two, for having passed either one or two arguments) for `summarize_benchmark.py` -- just to see that nothing silly (formatting changes, erroring out) is happening
summarize: make clear that status_order is a constant
All checks were successful
/ test (pull_request) Successful in 37s
b3ce89e146
ada4a referenced this pull request from a commit 2025-08-14 17:21:55 +02:00
All checks were successful
/ test (pull_request) Successful in 37s
This pull request is marked as a work in progress.
This branch is out-of-date with the base branch
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin benchmark_utils:benchmark_utils
git switch benchmark_utils
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#537
No description provided.