WIP: tests: Utilities to benchmark mergiraf on large corpora #537
No reviewers
Labels
No labels
Compat/Breaking
Kind
Bad merge
Kind
Bug
Kind
Documentation
Kind
Enhancement
Kind
Feature
Kind
New language
Kind
Security
Kind
Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: mergiraf/mergiraf#537
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "benchmark_utils"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
For #485.
This introduces two benchmark utilities:
helpers/benchmark.sh
, which runs the benchmark properly speaking. It runs mergiraf on each test case contained in a directory and writes the outcome of each test in the row of a TSV filehelpers/summarize_benchmark.py
, which analyzes such a TSV file to show human-readable statistics.I first started writing this benchmarking code in Rust, but I decided against it, because we shouldn't run many test cases in the same process. That's due to the way we implemented the
--timeout
feature: the source revision files are.leak()
-ed at the beginning of the merge process, and the threads doing the merging attempts aren't stopped after reaching the timeout. So that means that those threads would continue running after we've moved on to other test cases, plus we'd be accumulating memory usage due to theleak()
calls.And then for
summarize_benchmark.py
, it's something that could well be implemented in Rust, but it doesn't share any code with mergiraf so I'm not sure if it's worth it. I find a scripting language is a bit easier to work with for something that should be easy to tweak and iterate on, offering different views on the benchmark results depending on the needs.I'm marking this as WIP because I continue to improve the scripts as I run benchmarks, but they're already in a state that can be shared. I would also like to learn about type annotations in Python and apply them more consistently.
Here are some examples of the outputs of the two scripts:
Running a benchmark
Summarizing one benchmark
gives a table
*.sbt
*.htm
Comparing two benchmarks
Gives a table comparing category statistics:
*.sbt
*.htm
and also a sort of "confusion matrix", showing from which status to which status test cases changed:
In this table, the categories are intuitively ordered from "best to worst". This means that we are happy to see numbers below the diagonal (as it means they changed from one category to a better one, such as
Conflict → Exact
) and want to avoid numbers above the diagonal (such asConflict → Panic
).Finally, it also produces a list of test cases which moved from one category to another category that's "worse":
A cursory overview, since both of these languages are.. a bit scary to me
@ -0,0 +1,79 @@
#!/usr/bin/env bash
just an aside -- do you think we should run shellcheck on our shell scripts? I've heard it's pretty helpful, especially with the many footguns that Bash has
I don't know about shellcheck, that sounds like a good idea!
Could you please add it to the docker image?
I think it's easier if we first install it manually in the PR workflow, as updating the Docker image is a bit cumbersome. Then we can batch up things we want to add to it and do it in a single update (newer Rust version, git-cliff, …)
@ -0,0 +67,4 @@
else
outcome="Parse"
fi
elif cargo run --release -p mgf_dev -- compare --commutative $tmp_dir/our_merge$ext $tmp_dir/expected_merge$ext > /dev/null 2>&1; then
I wanted to suggest using the cargo shortcut, but I see that you probably avoid it because you need to pass
--release
-- maybe we should just add that flag to the shortcut?@ -0,0 +52,4 @@
self.timing = TimingStats()
self.states = defaultdict(int)
def add(self, case):
nit:
case
is a keyword in Python I think (for match arms), so the highlighting is confused -- maybe call thistestcase
?.. Although that wouldn't be consistent withcases
that we use elsewhereAlso, I think its type would be
dict[str, float]
although looking elsewhere in the code, what-would-be-keywords are syntax-highlighted even when inside docstrings, so it's probably just the syntax highlighter being trigger-happy and the actual Python isn't really getting confused about the naming of
case
@ -0,0 +53,4 @@
self.states = defaultdict(int)
def add(self, case):
timing = float(case['timing'])
I don't think
TimingStats
requirestiming
to be afloat
? You should be able to add the type hint oftiming: int
toTimingStats::add
, and see if that checks outI am confused… Timings are currently measured in seconds, so I do want them to be floating-point numbers and not integers, right?
Ah, so
case['timing']
is initially a string here, and you usefloat
to parse it? In that case, you're right of course@ -0,0 +55,4 @@
def add(self, case):
timing = float(case['timing'])
self.timing.add(timing)
status = case['status']
Having these as plain strings is a bit iffy -- it's all too easy to make a typo. One option could be to not have
case
be a dict of course, but short of that, having a const for each field name would also work. To make a const in Python, all you need is to call your variable in SCREAMING_CASEcmd
2e8fee3850grep -c
instead ofgrep | wc -l
2d746d728cusr/bin/env time
to accomodate Nixmgf_dev compare
in release mode #539Thanks for the review!
One issue I have noticed is that by doing the parsing checks via
mgf_dev
, we make it more difficult to ensure that both the binary being tested andmgf_dev
use the same parser. Ideally, we'd detect the parsing failure from mergiraf's output already. It would also avoid parsing things twice, so it would be a bit more efficient.d12b24b2b5
tobee3983591
cargo compare
case
a typecase
fields dc78e9fb2brestrict_to
a type 5919c8c0a2case_to_status
a typecsv.DictReader
it would be nice not to rerun the whole library test suite when the PR only touches non-Rust files -- I think this is the way to achieve that?
Speaking of tests, it would be nice to have one (or two, for having passed either one or two arguments) for
summarize_benchmark.py
-- just to see that nothing silly (formatting changes, erroring out) is happeningstatus_order
is a constantView command line instructions
Checkout
From your project repository, check out a new branch and test the changes.