WIP: tests: Utilities to benchmark mergiraf on large corpora #537

wetneb · 2025-07-26T17:15:35+02:00

wetneb commented

2025-07-26 17:15:35 +02:00

This introduces two benchmark utilities:

helpers/benchmark.sh, which runs the benchmark properly speaking. It runs mergiraf on each test case contained in a directory and writes the outcome of each test in the row of a TSV file
helpers/summarize_benchmark.py, which analyzes such a TSV file to show human-readable statistics.

I first started writing this benchmarking code in Rust, but I decided against it, because we shouldn't run many test cases in the same process. That's due to the way we implemented the --timeout feature: the source revision files are .leak()-ed at the beginning of the merge process, and the threads doing the merging attempts aren't stopped after reaching the timeout. So that means that those threads would continue running after we've moved on to other test cases, plus we'd be accumulating memory usage due to the leak() calls.

And then for summarize_benchmark.py, it's something that could well be implemented in Rust, but it doesn't share any code with mergiraf so I'm not sure if it's worth it. I find a scripting language is a bit easier to work with for something that should be easy to tweak and iterate on, offering different views on the benchmark results depending on the needs.

I'm marking this as WIP because I continue to improve the scripts as I run benchmarks, but they're already in a state that can be shared. I would also like to learn about type annotations in Python and apply them more consistently.

Here are some examples of the outputs of the two scripts:

Running a benchmark

$ helpers/benchmark.sh target/release/mergiraf /path/to/test/suite/ | tee first_benchmark_log.tsv
status	timing	language	case
Exact	0.00	*.sbt	/path/to/test/suite/sbt/akka/akka/78597ed7c1/fd94c094cc
Conflict	0.01	*.sbt	/path/to/test/suite/sbt/akka/akka/8ab02738b7/fd94c094cc
Exact	0.05	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/60f50d2844/667e2e4fb6
Conflict	0.02	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/24361a4d68/ca0ddd12e2
Conflict	0.01	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/9a0ad57b5a/ca0ddd12e2
Conflict	0.03	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/9a0ad57b5a/667e2e4fb6
Conflict	0.01	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/9e912e6eef/ca0ddd12e2
Conflict	0.06	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/a9e4ee4179/667e2e4fb6
Differ	0.00	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/d54cb2cf28/ca0ddd12e2
Conflict	0.01	*.sbt	/path/to/test/suite/sbt/scala-js/scala-js/59f6e495d3/ca0ddd12e2
...

Summarizing one benchmark

$ helpers/summarize_benchmark.py first_benchmark_log.tsv

gives a table

Language	Cases	Exact	Format	Conflict	Differ	Parse	Panic	Time (s)
`*.sbt`	13	3 (23%)	0	9 (69%)	1 (8%)	0	0	0.022
`*.htm`	11	0	0	7 (64%)	0	4 (36%)	0	0.315
...
Total	157,177	15,748 (10%)	4,814 (3%)	99,133 (63%)	12,015 (8%)	25,320 (16%)	147 (0%)	0.281

Comparing two benchmarks

$ helpers/summarize_benchmark.py first_benchmark_log.tsv second_benchmark_log.tsv

Gives a table comparing category statistics:

Language	Cases	Exact	Format	Conflict	Differ	Parse	Panic	Time (s)
`*.sbt`	13	+0	+0	+0	+0	+0	+0	-0.007 (-32.1%)
`*.htm`	11	+0	+0	+0	+0	+0	+0	-0.023 (-7.2%)
...
Total	133,555	+73 (+0.1%)	+22 (+0.0%)	-199 (-0.1%)	+105 (+0.1%)	-2 (-0.0%)	+1 (+0.0%)	-0.063 (-27.6%)

and also a sort of "confusion matrix", showing from which status to which status test cases changed:

↓ Before \ After →	Exact	Format	Conflict	Differ	Parse	Panic
Exact	12,316	17	28	6
Format	14	3,473	11	1
Conflict	100	25	87,625	174		1
Differ	8	6	62	9,543
Parse	2				20,051
Panic						92

In this table, the categories are intuitively ordered from "best to worst". This means that we are happy to see numbers below the diagonal (as it means they changed from one category to a better one, such as Conflict → Exact) and want to avoid numbers above the diagonal (such as Conflict → Panic).

Finally, it also produces a list of test cases which moved from one category to another category that's "worse":

### Format → Differ
/path/to/test/suite/go/uber/cadence/31e89e11c2/fad5af1dbc

### Conflict → Differ
/path/to/test/suite/php/phpmyadmin/phpmyadmin/13890a947a/b08e478474
/path/to/test/suite/py/achael/eht-imaging/6cd766248b/835c98abbd
/path/to/test/suite/rs/wasmerio/wasmer/974fa2b699/255a9980b1
/path/to/test/suite/php/phpmyadmin/phpmyadmin/13890a947a/6b50be9206
...

For #485. This introduces two benchmark utilities: * `helpers/benchmark.sh`, which runs the benchmark properly speaking. It runs mergiraf on each test case contained in a directory and writes the outcome of each test in the row of a TSV file * `helpers/summarize_benchmark.py`, which analyzes such a TSV file to show human-readable statistics. I first started writing this benchmarking code in Rust, but I decided against it, because we shouldn't run many test cases in the same process. That's due to the way we implemented the `--timeout` feature: the source revision files are `.leak()`-ed at the beginning of the merge process, and the threads doing the merging attempts aren't stopped after reaching the timeout. So that means that those threads would continue running after we've moved on to other test cases, plus we'd be accumulating memory usage due to the `leak()` calls. And then for `summarize_benchmark.py`, it's something that could well be implemented in Rust, but it doesn't share any code with mergiraf so I'm not sure if it's worth it. I find a scripting language is a bit easier to work with for something that should be easy to tweak and iterate on, offering different views on the benchmark results depending on the needs. I'm marking this as WIP because I continue to improve the scripts as I run benchmarks, but they're already in a state that can be shared. I would also like to learn about type annotations in Python and apply them more consistently. Here are some examples of the outputs of the two scripts: ### Running a benchmark ```shell $ helpers/benchmark.sh target/release/mergiraf /path/to/test/suite/ | tee first_benchmark_log.tsv status timing language case Exact 0.00 *.sbt /path/to/test/suite/sbt/akka/akka/78597ed7c1/fd94c094cc Conflict 0.01 *.sbt /path/to/test/suite/sbt/akka/akka/8ab02738b7/fd94c094cc Exact 0.05 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/60f50d2844/667e2e4fb6 Conflict 0.02 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/24361a4d68/ca0ddd12e2 Conflict 0.01 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/9a0ad57b5a/ca0ddd12e2 Conflict 0.03 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/9a0ad57b5a/667e2e4fb6 Conflict 0.01 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/9e912e6eef/ca0ddd12e2 Conflict 0.06 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/a9e4ee4179/667e2e4fb6 Differ 0.00 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/d54cb2cf28/ca0ddd12e2 Conflict 0.01 *.sbt /path/to/test/suite/sbt/scala-js/scala-js/59f6e495d3/ca0ddd12e2 ... ``` ### Summarizing one benchmark ```shell $ helpers/summarize_benchmark.py first_benchmark_log.tsv ``` gives a table | Language | Cases | Exact | Format | Conflict | Differ | Parse | Panic | Time (s) | | -------- | ----- | ----- | ------ | -------- | ------ | ----- | ----- | -------- | | `*.sbt` | 13 | 3 (23%) | 0 | 9 (69%) | 1 (8%) | 0 | 0 | 0.022 | | `*.htm` | 11 | 0 | 0 | 7 (64%) | 0 | 4 (36%) | 0 | 0.315 | ... | **Total** | 157,177 | 15,748 (10%) | 4,814 (3%) | 99,133 (63%) | 12,015 (8%) | 25,320 (16%) | 147 (0%) | 0.281 | ### Comparing two benchmarks ```shell $ helpers/summarize_benchmark.py first_benchmark_log.tsv second_benchmark_log.tsv ``` Gives a table comparing category statistics: | Language | Cases | Exact | Format | Conflict | Differ | Parse | Panic | Time (s) | | -------- | ----- | ----- | ------ | -------- | ------ | ----- | ----- | -------- | | `*.sbt` | 13 | +0 | +0 | +0 | +0 | +0 | +0 | -0.007 (-32.1%) | | `*.htm` | 11 | +0 | +0 | +0 | +0 | +0 | +0 | -0.023 (-7.2%) | ... | **Total** | 133,555 | +73 **(+0.1%)** | +22 **(+0.0%)** | -199 **(-0.1%)** | +105 **(+0.1%)** | -2 **(-0.0%)** | +1 **(+0.0%)** | -0.063 (-27.6%) | and also a sort of "confusion matrix", showing from which status to which status test cases changed: | ↓ Before \ After → | Exact | Format | Conflict | Differ | Parse | Panic | | ------------------ | ----- | ------ | -------- | ------ | ----- | ----- | | Exact | 12,316 | **17** | **28** | **6** | | | | Format | **14** | 3,473 | **11** | **1** | | | | Conflict | **100** | **25** | 87,625 | **174** | | **1** | | Differ | **8** | **6** | **62** | 9,543 | | | | Parse | **2** | | | | 20,051 | | | Panic | | | | | | 92 | In this table, the categories are intuitively ordered from "best to worst". This means that we are happy to see numbers below the diagonal (as it means they changed from one category to a better one, such as `Conflict → Exact`) and want to avoid numbers above the diagonal (such as `Conflict → Panic`). Finally, it also produces a list of test cases which moved from one category to another category that's "worse": ``` ### Format → Differ /path/to/test/suite/go/uber/cadence/31e89e11c2/fad5af1dbc ### Conflict → Differ /path/to/test/suite/php/phpmyadmin/phpmyadmin/13890a947a/b08e478474 /path/to/test/suite/py/achael/eht-imaging/6cd766248b/835c98abbd /path/to/test/suite/rs/wasmerio/wasmer/974fa2b699/255a9980b1 /path/to/test/suite/php/phpmyadmin/phpmyadmin/13890a947a/6b50be9206 ... ```

wetneb added 1 commit

2025-07-26 17:15:36 +02:00

tests: Utilities to benchmark mergiraf on large corpora

/ test (pull_request) Successful in 40s

Details

1237307015

ada4a reviewed

2025-07-26 18:49:06 +02:00

ada4a left a comment

A cursory overview, since both of these languages are.. a bit scary to me

helpers/benchmark.sh Outdated

					
				@ -0,0 +1,79 @@

				#!/usr/bin/env bash

ada4a commented

2025-07-26 18:13:17 +02:00

just an aside -- do you think we should run shellcheck on our shell scripts? I've heard it's pretty helpful, especially with the many footguns that Bash has

wetneb commented

2025-07-28 15:41:33 +02:00

I don't know about shellcheck, that sounds like a good idea!

ada4a commented

2025-08-31 00:19:11 +02:00

Could you please add it to the docker image?

wetneb commented

2025-08-31 08:57:01 +02:00

I think it's easier if we first install it manually in the PR workflow, as updating the Docker image is a bit cumbersome. Then we can batch up things we want to add to it and do it in a single update (newer Rust version, git-cliff, …)

helpers/benchmark.sh Outdated

					
				@ -0,0 +67,4 @@

				            else

				                outcome="Parse"

				            fi

				        elif cargo run --release -p mgf_dev -- compare --commutative $tmp_dir/our_merge$ext $tmp_dir/expected_merge$ext > /dev/null 2>&1; then

ada4a commented

2025-07-26 18:17:08 +02:00

I wanted to suggest using the cargo shortcut, but I see that you probably avoid it because you need to pass --release -- maybe we should just add that flag to the shortcut?

I wanted to suggest using the cargo shortcut, but I see that you probably avoid it because you need to pass `--release` -- maybe we should just add that flag to the shortcut?

👍 1

ada4a marked this conversation as resolved

helpers/summarize_benchmark.py Outdated

					
				@ -0,0 +52,4 @@

				        self.timing = TimingStats()

				        self.states = defaultdict(int)

				    def add(self, case):

ada4a commented

2025-07-26 18:20:08 +02:00

nit: case is a keyword in Python I think (for match arms), so the highlighting is confused -- maybe call this testcase?.. Although that wouldn't be consistent with cases that we use elsewhere

Also, I think its type would be dict[str, float]

nit: `case` is a keyword in Python I think (for match arms), so the highlighting is confused -- maybe call this `testcase`?.. Although that wouldn't be consistent with `cases` that we use elsewhere Also, I think its type would be `dict[str, float]`

👍 1

ada4a commented

2025-07-31 20:44:50 +02:00

although looking elsewhere in the code, what-would-be-keywords are syntax-highlighted even when inside docstrings, so it's probably just the syntax highlighter being trigger-happy and the actual Python isn't really getting confused about the naming of case

although looking elsewhere in the code, what-would-be-keywords are syntax-highlighted even when inside docstrings, so it's probably just the syntax highlighter being trigger-happy and the actual Python isn't really getting confused about the naming of `case`

ada4a marked this conversation as resolved

helpers/summarize_benchmark.py Outdated

					
				@ -0,0 +53,4 @@

				        self.states = defaultdict(int)

				    def add(self, case):

				        timing = float(case['timing'])

ada4a commented

2025-07-26 18:24:39 +02:00

I don't think TimingStats requires timing to be a float? You should be able to add the type hint of timing: int to TimingStats::add, and see if that checks out

I don't think `TimingStats` requires `timing` to be a `float`? You should be able to add the type hint of `timing: int` to `TimingStats::add`, and see if that checks out

wetneb commented

2025-07-28 15:41:14 +02:00

I am confused… Timings are currently measured in seconds, so I do want them to be floating-point numbers and not integers, right?

ada4a commented

2025-07-28 16:08:12 +02:00

Ah, so case['timing'] is initially a string here, and you use float to parse it? In that case, you're right of course

Ah, so `case['timing']` is initially a string here, and you use `float` to parse it? In that case, you're right of course

👍 1

ada4a marked this conversation as resolved

helpers/summarize_benchmark.py Outdated

					
				@ -0,0 +55,4 @@

				    def add(self, case):

				        timing = float(case['timing'])

				        self.timing.add(timing)

				        status = case['status']

ada4a commented

2025-07-26 18:27:59 +02:00

Having these as plain strings is a bit iffy -- it's all too easy to make a typo. One option could be to not have case be a dict of course, but short of that, having a const for each field name would also work. To make a const in Python, all you need is to call your variable in SCREAMING_CASE

Having these as plain strings is a bit iffy -- it's all too easy to make a typo. One option could be to not have `case` be a dict of course, but short of that, having a const for each field name would also work. To make a const in Python, all you need is to call your variable in SCREAMING_CASE

👍 1

ada4a marked this conversation as resolved

ada4a added 5 commits

2025-07-26 19:37:27 +02:00

py: fmt 4a1d87ecf8

sh: quote all the things 49f4a3ed97

sh: use $(cmd) instead of cmd 2e8fee3850

sh: use grep -c instead of grep | wc -l 2d746d728c

sh: rm useless cat

/ test (pull_request) Successful in 41s

Details

3a0debc6a4

ada4a added 2 commits

2025-07-26 20:20:57 +02:00

py: add type hint 59ca86c386

sh: use usr/bin/env time to accomodate Nix

/ test (pull_request) Successful in 40s

Details

c161a47f2d

ada4a referenced this pull request

2025-07-26 20:26:50 +02:00

chore(cargo/aliases): run mgf_dev compare in release mode #539

wetneb commented

2025-07-28 15:43:33 +02:00

Thanks for the review!

One issue I have noticed is that by doing the parsing checks via mgf_dev, we make it more difficult to ensure that both the binary being tested and mgf_dev use the same parser. Ideally, we'd detect the parsing failure from mergiraf's output already. It would also avoid parsing things twice, so it would be a bit more efficient.

Thanks for the review! One issue I have noticed is that by doing the parsing checks via `mgf_dev`, we make it more difficult to ensure that both the binary being tested and `mgf_dev` use the same parser. Ideally, we'd detect the parsing failure from mergiraf's output already. It would also avoid parsing things twice, so it would be a bit more efficient.

wetneb referenced this pull request

2025-07-30 13:02:42 +02:00

chore: Improve log output #553

wetneb added 3 commits

2025-07-31 20:32:46 +02:00

summarize: Improve display of timing diffs 71ea984c5f

benchmark: remove quotation around all_args 1b752c0ea3

benchmark: check parsing via the logs rather than mgf_dev

/ test (pull_request) Successful in 38s

Details

d12b24b2b5

ada4a force-pushed benchmark_utils from d12b24b2b5 to bee3983591

2025-07-31 20:40:46 +02:00

Compare

ada4a added 1 commit

2025-07-31 20:41:22 +02:00

benchmark: use cargo compare

/ test (pull_request) Successful in 36s

Details

3c7a831a66

ada4a added 1 commit

2025-07-31 20:47:35 +02:00

summarize: give case a type

/ test (pull_request) Successful in 38s

Details

1b1485691f

ada4a added 4 commits

2025-07-31 21:11:02 +02:00

summarize: use lazy comprehension b51d7f7725

summarize: add constants for case fields dc78e9fb2b

summarize: give restrict_to a type 5919c8c0a2

summarize: give case_to_status a type

/ test (pull_request) Successful in 37s

Details

f7a4ee79ca

ada4a added 1 commit

2025-07-31 21:21:13 +02:00

summarize: use csv.DictReader

/ test (pull_request) Successful in 37s

Details

f2aa3b8b8d

ada4a commented

2025-07-31 21:32:32 +02:00

it would be nice not to rerun the whole library test suite when the PR only touches non-Rust files -- I think this is the way to achieve that?

Speaking of tests, it would be nice to have one (or two, for having passed either one or two arguments) for summarize_benchmark.py -- just to see that nothing silly (formatting changes, erroring out) is happening

it would be nice not to rerun the whole library test suite when the PR only touches non-Rust files -- I think [this](https://forgejo.org/docs/latest/user/actions/reference/#jobsjob_idif) is the way to achieve that? Speaking of tests, it would be nice to have one (or two, for having passed either one or two arguments) for `summarize_benchmark.py` -- just to see that nothing silly (formatting changes, erroring out) is happening

👍 1

ada4a added 1 commit

2025-07-31 21:35:04 +02:00

summarize: make clear that status_order is a constant

/ test (pull_request) Successful in 37s

Details

b3ce89e146

ada4a referenced this pull request from a commit

2025-08-14 17:21:55 +02:00

chore: Improve log output (#553)

ada4a referenced this pull request

2025-08-31 14:10:57 +02:00