Syntactically invalid files are attempted to be parsed twice #551

Open
opened 2025-07-30 12:26:57 +02:00 by wetneb · 13 comments
Owner

In mergiraf solve mergiraf merge, when one of the input files fails to parse, we still parse it twice:

  • a first time when attempting to solve the line-based merge
  • a second time when attempting a full structured merge

It would be better to do it only once.

In <s>`mergiraf solve`</s> `mergiraf merge`, when one of the input files fails to parse, we still parse it twice: * a first time when attempting to solve the line-based merge * a second time when attempting a full structured merge It would be better to do it only once.
Author
Owner

Another optimization that could be worth doing in the case of syntactically invalid files is avoiding to translate tree-sitter's data structures to our AstNode, by calling root.has_error().

Note that this would likely prevent us from returning an error with information about where the parse error happened, which would be detrimental in the case of mgf_dev parse, or even in the case of mergiraf merge -v which reports it in the logs.

Another optimization that could be worth doing in the case of syntactically invalid files is avoiding to translate tree-sitter's data structures to our `AstNode`, by calling [`root.has_error()`](https://docs.rs/tree-sitter/0.25.8/tree_sitter/struct.Node.html#method.has_error). Note that this would likely prevent us from returning an error with information about where the parse error happened, which would be detrimental in the case of `mgf_dev parse`, or even in the case of `mergiraf merge -v` which reports it in the logs.
Owner

@wetneb wrote in #551 (comment):

In mergiraf solve, when one of the input files fails to parse, [...]

you mean mergiraf merge? I wasn't able to find the steps your describe in mergiraf solve..

@wetneb wrote in https://codeberg.org/mergiraf/mergiraf/issues/551#issue-1985479: > In `mergiraf solve`, when one of the input files fails to parse, [...] you mean `mergiraf merge`? I wasn't able to find the steps your describe in `mergiraf solve`..
Author
Owner

Sorry, yes, my bad.

Sorry, yes, my bad.
Owner

I struggle to come up with a solution to this. Here are some options that come to mind

Option 1

  1. return a triple of Results from line_based_merge_with_duplicate_signature_detection
  2. in cascading_merge, check whether any of those are Errs, and if so, don't even attempt to do any structural things

That would have a few problems:

  • we can't (currently) pass the successfully parsed ASTs into structured_merge, so we'll still duplicate some work
  • we'd need to inline the zdiff detection logic (see if only the base revision fails to parse) into cascading_merge

Option 2

Change structured_merge to accept contents_*: Result<AstNode, String>, so that we can pass the already parsed ASTs into there

That would work I guess, but it would mean that we'd need to also pre-parse contents in all the other callsites of structured_merge, which sounds a bit unfortunate?

I struggle to come up with a solution to this. Here are some options that come to mind ## Option 1 1. return a triple of `Result`s from `line_based_merge_with_duplicate_signature_detection` 2. in `cascading_merge`, check whether any of those are `Err`s, and if so, don't even attempt to do any structural things That would have a few problems: - we can't (currently) pass the successfully parsed ASTs into `structured_merge`, so we'll still duplicate _some_ work - we'd need to inline the zdiff detection logic (see if only the base revision fails to parse) into `cascading_merge` ## Option 2 Change `structured_merge` to accept `contents_*: Result<AstNode, String>`, so that we can pass the already parsed ASTs into there That would work I guess, but it would mean that we'd need to also pre-parse `contents` in all the other callsites of `structured_merge`, which sounds a bit unfortunate?
Author
Owner

Okay, you went for the ambitious (and proper) fix of avoiding double parsing not just if the parsing fails, but in general. That makes sense.

Option 2 sounds the cleanest to me. It shouldn't be hard to still provide another function that takes revisions as strings, does the parsing, and then delegates it to the function that takes revisions as trees, no? For the other function, I think it would make sense to have it take &AstNode directly (and not Result<&AstNode, String> because the check for zdiff3 could also be done externally (we don't need to do it twice either).

To minimize the diff, we could even try to keep the existing structured_merge function with its existing signature, and introduce a new one that takes AstNodes as arguments. But anyway, structured_merge is only called from 4 different places at the moment, so the diff should be manageable either way.

Okay, you went for the ambitious (and proper) fix of avoiding double parsing not just if the parsing fails, but in general. That makes sense. Option 2 sounds the cleanest to me. It shouldn't be hard to still provide another function that takes revisions as strings, does the parsing, and then delegates it to the function that takes revisions as trees, no? For the other function, I think it would make sense to have it take `&AstNode` directly (and not `Result<&AstNode, String>` because the check for zdiff3 could also be done externally (we don't need to do it twice either). To minimize the diff, we could even try to keep the existing `structured_merge` function with its existing signature, and introduce a new one that takes `AstNode`s as arguments. But anyway, `structured_merge` is only called from 4 different places at the moment, so the diff should be manageable either way.
Author
Owner

One slight subtlety is that this refactoring might pull the parsing out of the code section that's protected by our timeout mechanism.
It's not uncommon for parsers to run wild, for instance if it relies on a hand-written lexer (scanner) that falls into an infinite loop. So if it's not too hard, it could be worth maintaining this protection one way or another.

One slight subtlety is that this refactoring might pull the parsing out of the code section that's protected by our timeout mechanism. It's not uncommon for parsers to run wild, for instance if it relies on a hand-written lexer (scanner) that falls into an infinite loop. So if it's not too hard, it could be worth maintaining this protection one way or another.
Owner

One slight subtlety is that this refactoring might pull the parsing out of the code section that's protected by our timeout mechanism.

I'm pretty sure it already is outside (and also inside, but that's why we're here in the first place^^) – we already parse in line_based_merge_with_duplicate_signature_detection

> One slight subtlety is that this refactoring might pull the parsing out of the code section that's protected by our timeout mechanism. I'm pretty sure it already is outside (and also inside, but that's why we're here in the first place^^) – we already parse in `line_based_merge_with_duplicate_signature_detection`
Owner

It shouldn't be hard to still provide another function that takes revisions as strings, does the parsing, and then delegates it to the function that takes revisions as trees, no?

Yes, that sounds good.

For the other function, I think it would make sense to have it take &AstNode directly (and not Result<&AstNode, String> because the check for zdiff3 could also be done externally)

Hm, not so sure about this one. It is true that structured_merge (and its AST-taking counterpart) won't get called that often, but duplicating the zdiff3 check at every callsite still sounds pretty fragile to me.

> It shouldn't be hard to still provide another function that takes revisions as strings, does the parsing, and then delegates it to the function that takes revisions as trees, no? Yes, that sounds good. > For the other function, I think it would make sense to have it take &AstNode directly (and not `Result<&AstNode, String>` because the check for zdiff3 could also be done externally) Hm, not so sure about this one. It is true that `structured_merge` (and its AST-taking counterpart) won't get called that often, but duplicating the zdiff3 check at every callsite still sounds pretty fragile to me.
Author
Owner

Hm, not so sure about this one. It is true that structured_merge (and its AST-taking counterpart) won't get called that often, but duplicating the zdiff3 check at every callsite still sounds pretty fragile to me.

Isn't this check only relevant for mergiraf solve, and not for mergiraf merge? I think this refactoring would be a great opportunity to have it only in the solve command (which, as far as I can tell, is only one call site of structured_merge) and not in the merge command, where it will only display false alarms.

> Hm, not so sure about this one. It is true that structured_merge (and its AST-taking counterpart) won't get called that often, but duplicating the zdiff3 check at every callsite still sounds pretty fragile to me. Isn't this check only relevant for `mergiraf solve`, and not for `mergiraf merge`? I think this refactoring would be a great opportunity to have it only in the `solve` command (which, as far as I can tell, is only one call site of `structured_merge`) and not in the `merge` command, where it will only display false alarms.
Owner

Oh, haven't thought of that – you're totally right! I'll take a look at the other callsites – maybe they don't need the check either

Oh, haven't thought of that – you're totally right! I'll take a look at the other callsites – maybe they don't need the check either
Owner

It turns out that the function is called three times in solve.rs:

  1. when working with revisions reconstructed from ParsedMerge (in resolve_merge)
  2. when working with revisions extracted from Git
  3. when working with revisions extracted using OIDs

Now that I think of it, the functions used for the latter two could be refactored, by removing the actual merging from them and only leaving their respective revision extraction logic. Then, in resolve_merge_cascading, we could use the new functions like this:

let (contents_base, contents_left, contents_right) =
    extract_all_revisions_from_git(working_dir, fname_base))
        .or_else(|| extract_all_revisions_from_oids())?;

let parsed_base = parse(contents_base);
/* same for the other sides */

match (parsed_base, parsed_left, parsed_right) {
    /* zdiff3 detection */

let merge = structured_merge(...);

So that's one fewer place of duplication.

But for the first one, hoisting the check out of resolve_merge seems a bit more complicated. One could think that we solve that with the following:

  1. extract the revision reconstruction into a separate method
  2. use that in the .or_else chain above

But that has a problem of revision reconstruction actually being infallible.

It looks to me that it would actually be easier to define a function like structured_merge_with_zdiff3_detection which does have the check, use that in solve.rs, and structured_merge elsewhere. But that wouldn't help with the re-parsing problem we started with.

It turns out that the function is called _three times_ in `solve.rs`: 1. when working with revisions reconstructed from `ParsedMerge` (in `resolve_merge`) 2. when working with revisions extracted from Git 3. when working with revisions extracted using OIDs Now that I think of it, the functions used for the latter two could be refactored, by removing the actual merging from them and only leaving their respective revision extraction logic. Then, in `resolve_merge_cascading`, we could use the new functions like this: ```rs let (contents_base, contents_left, contents_right) = extract_all_revisions_from_git(working_dir, fname_base)) .or_else(|| extract_all_revisions_from_oids())?; let parsed_base = parse(contents_base); /* same for the other sides */ match (parsed_base, parsed_left, parsed_right) { /* zdiff3 detection */ let merge = structured_merge(...); ``` So that's one fewer place of duplication. But for the first one, hoisting the check out of `resolve_merge` seems a bit more complicated. One could think that we solve that with the following: 1. extract the revision reconstruction into a separate method 2. use that in the `.or_else` chain above But that has a problem of revision reconstruction actually being infallible. It looks to me that it would actually be easier to define a function like `structured_merge_with_zdiff3_detection` which does have the check, use that in `solve.rs`, and `structured_merge` elsewhere. But that wouldn't help with the re-parsing problem we started with.
Author
Owner

It looks to me that it would actually be easier to define a function like structured_merge_with_zdiff3_detection which does have the check, use that in solve.rs, and structured_merge elsewhere. But that wouldn't help with the re-parsing problem we started with.

That sounds good to me. Some places need to call a function with zdiff3 detection, some without. Some places need to call a function taking strings, other AstNode. We can just introduce all of the combinations of those criteria as needed, that would still be just 4 small functions, no? Anyway, I'm sure you can figure out something that works well.

> It looks to me that it would actually be easier to define a function like structured_merge_with_zdiff3_detection which does have the check, use that in solve.rs, and structured_merge elsewhere. But that wouldn't help with the re-parsing problem we started with. That sounds good to me. Some places need to call a function with zdiff3 detection, some without. Some places need to call a function taking strings, other AstNode. We can just introduce all of the combinations of those criteria as needed, that would still be just 4 small functions, no? Anyway, I'm sure you can figure out something that works well.
Owner

Unfortunately I got kind of stuck already when trying to extract the parsing out of structured_merge -- the lifetimes just don't add up... I'll keep trying and see what I get

Unfortunately I got kind of stuck already when trying to extract the parsing out of `structured_merge` -- the lifetimes just don't add up... I'll keep trying and see what I get
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#551
No description provided.