Unexpected file changes and git access #37

Closed
opened 2024-11-17 01:20:34 +01:00 by wilfred · 8 comments

Given the file:

<<<<<<< HEAD
use email::{send, validate};
||||||| parent of 4c2e8030b (Option parsing and conflict marker parsing)
use email::send;
=======
use email::{mark_as_spam, send};
>>>>>>> 4c2e8030b (Option parsing and conflict marker parsing)

I get the following output from mergiraf:

$ mergiraf solve foo.rs
error while retrieving Base revision for foo.rs:
git checkout-index: foo.rs is not in the cache

WARN Could not retrieve conflict sides from Git.
INFO 1 conflict(s) remaining.

I don't understand why mergiraf is trying to access my current git repository? All the information it needs should be in the current file.

It also looks like it modifies the file, removing the helpful commit information from the conflict markers:

<<<<<<< left
use email::{send, validate};
||||||| base
use email::send;
=======
use email::{mark_as_spam, send};
>>>>>>> right
Given the file: ``` <<<<<<< HEAD use email::{send, validate}; ||||||| parent of 4c2e8030b (Option parsing and conflict marker parsing) use email::send; ======= use email::{mark_as_spam, send}; >>>>>>> 4c2e8030b (Option parsing and conflict marker parsing) ``` I get the following output from mergiraf: ``` $ mergiraf solve foo.rs error while retrieving Base revision for foo.rs: git checkout-index: foo.rs is not in the cache WARN Could not retrieve conflict sides from Git. INFO 1 conflict(s) remaining. ``` I don't understand why mergiraf is trying to access my current git repository? All the information it needs should be in the current file. It also looks like it modifies the file, removing the helpful commit information from the conflict markers: ``` <<<<<<< left use email::{send, validate}; ||||||| base use email::send; ======= use email::{mark_as_spam, send}; >>>>>>> right ```
Owner

The output isn't great indeed. The tool first tries to solve the conflicts just looking at the file you provided, fails, and then tries to solve it again with the original revisions retrieved from Git. That's because in general, some conflicts can only be solved from the original revisions and not the merged file with conflicts.
In this case, since there are no merged sections in your file, the file is indeed equivalent to the three revisions, so it's indeed pointless to try fetching the revisions from Git. One could potentially add logic to detect such cases, but I think it wouldn't be used so often (your example strikes me as a rather unusual one in real use cases).

The removal of the commit ids is indeed concerning, and definitely something that I'd like to fix. There are two problems:

  • the fact that we're removing those ids (they should be detected from the file and reused)
  • the fact that we're writing the file at all (since we didn't manage to solve anything)
The output isn't great indeed. The tool first tries to solve the conflicts just looking at the file you provided, fails, and then tries to solve it again with the original revisions retrieved from Git. That's because in general, some conflicts can only be solved from the original revisions and not the merged file with conflicts. In this case, since there are no merged sections in your file, the file is indeed equivalent to the three revisions, so it's indeed pointless to try fetching the revisions from Git. One could potentially add logic to detect such cases, but I think it wouldn't be used so often (your example strikes me as a rather unusual one in real use cases). The removal of the commit ids is indeed concerning, and definitely something that I'd like to fix. There are two problems: * the fact that we're removing those ids (they should be detected from the file and reused) * the fact that we're writing the file at all (since we didn't manage to solve anything)
Author

That's because in general, some conflicts can only be solved from the original revisions and not the merged file with conflicts.

How so? When you have the base, left and right (diff3 format like this), you can construct the file state exclusively from the file with conflicts.

I guess if you have a diff2 conflict style you need to consult git?

> That's because in general, some conflicts can only be solved from the original revisions and not the merged file with conflicts. How so? When you have the base, left and right (diff3 format like this), you can construct the file state exclusively from the file with conflicts. I guess if you have a diff2 conflict style you need to consult git?
Owner

How so? When you have the base, left and right (diff3 format like this), you can construct the file state exclusively from the file with conflicts.

Not if some parts of the file with conflicts were successfully merged. In some cases, it's important to have the original revisions, for instance for move/edit conflicts.

> How so? When you have the base, left and right (diff3 format like this), you can construct the file state exclusively from the file with conflicts. Not if some parts of the file with conflicts were successfully merged. In some cases, it's important to have the original revisions, for instance for [move/edit conflicts](https://mergiraf.org/conflicts.html#moving-edited-elements).
Contributor

So I researched this a little, and I think I know the answer to

the fact that we're writing the file at all (since we didn't manage to solve anything)

The method where all of this takes is resolve_merge_cascading. It cascades as follows:

  1. tries to parse the input file contents -- succeeds
  2. tries structured_resolution1 -- succeeds
  3. checks whether the structured resolution succeeded and solves all the conflicts -- not all the conflicts are solved, I'll explain the reason later
  4. since the previous check fails, it tries structured_merge2 -- fails because of the lack of Git revisions
  5. it's now left with two merge results: the one reconstructed from the input (called parsed_merge), and the one from structured_resolution. To find out which one's best, it calls select_best_merge
  6. select_best_merge finds the merge with the smallest conflict_mass which doesn't have additional conflict. Neither of the 2 merges has additional conflicts, so it all comes to conflict_mass
  7. and herein (I'm pretty sure) lies the problem -- even though the 2 merges have identical contents (whether the revision names are retained makes no difference here), structured_resolution reports having a conflict_mass of 61, which is smaller than parsed_merge's 79. Therefore, the former gets written into the file.

Possible solutions

1

  1. correct structured_resolution's conflict_mass reporting so that it's equal to that of an identical parsed_merge
  2. change select_best_merge to prefer parsed_merge in case of a tie

Then, the following would happen in cases like in this issue:

  1. structured_resolution doesn't win over parsed_merge
  2. select_best_merge returns parsed_merge
  3. mergiraf solve recognizes parsed_merge by its `method="original", and throws it away without writing/outputting anything:

    Ok(merged) if merged.method == "original" => 1,

2

  1. same as above
  2. make select_best_merge return an Option, so that if parsed_merge seems to be the (equally) best, it returns None

Then:

  1. parsed_merge ties with structured_resolution
  2. select_best_merge returns None
  3. resolve_merge_cascading catches that and returns a corresponding Err
  4. mergiraf solve handles that reasonably:

    Lines 252 to 255 in 68c7d9f
    Err(e) => {
    warn!("Mergiraf: {e}");
    1
    }


  1. structured_merge(..., parsed_merge=Some(parsed_merge)) ↩︎

  2. structured_merge(..., parsed_merge=None) ↩︎

So I researched this a little, and I think I know the answer to > the fact that we're writing the file at all (since we didn't manage to solve anything) The method where all of this takes is `resolve_merge_cascading`. It cascades as follows: 1) tries to parse the input file contents -- succeeds 2) tries `structured_resolution`[^1] -- succeeds 3) checks whether the structured resolution succeeded _and solves all the conflicts_ -- not all the conflicts are solved, I'll explain the reason later 4) since the previous check fails, it tries `structured_merge`[^2] -- fails because of the lack of Git revisions 5) it's now left with two merge results: the one reconstructed from the input (called `parsed_merge`), and the one from `structured_resolution`. To find out which one's best, it calls `select_best_merge` 6) `select_best_merge` finds the merge with the smallest `conflict_mass` which doesn't have additional conflict. Neither of the 2 merges has additional conflicts, so it all comes to `conflict_mass` 7) and herein (I'm pretty sure) lies the problem -- even though the 2 merges have identical contents (whether the revision names are retained makes no difference here), `structured_resolution` reports having a `conflict_mass` of 61, which is smaller than `parsed_merge`'s 79. Therefore, the former gets written into the file. ## Possible solutions ### 1 1) correct `structured_resolution`'s `conflict_mass` reporting so that it's equal to that of an identical `parsed_merge` 2) change `select_best_merge` to prefer `parsed_merge` in case of a tie Then, the following would happen in cases like in this issue: 1) `structured_resolution` doesn't win over `parsed_merge` 2) `select_best_merge` returns `parsed_merge` 3) `mergiraf solve` recognizes `parsed_merge` by its `method="original", and throws it away without writing/outputting anything: https://codeberg.org/mergiraf/mergiraf/src/commit/68c7d9f07f17283fe40b31b07101b04bc539c0fc/src/bin/mergiraf.rs#L239 ### 2 1) same as above 2) make `select_best_merge` return an `Option`, so that if `parsed_merge` seems to be the (equally) best, it returns `None` Then: 1) `parsed_merge` ties with `structured_resolution` 2) `select_best_merge` returns `None` 2) `resolve_merge_cascading` catches that and returns a corresponding `Err` 3) `mergiraf solve` handles that reasonably: https://codeberg.org/mergiraf/mergiraf/src/commit/68c7d9f07f17283fe40b31b07101b04bc539c0fc/src/bin/mergiraf.rs#L252-L255 [^1]: `structured_merge(..., parsed_merge=Some(parsed_merge))` [^2]: `structured_merge(..., parsed_merge=None)`
Contributor

Ok, so the answer is much dumber than I thought -- but it has to do with the reason why structured_resolution fails in the first place.

The thing is that structured_resolution (just like structured_merge) uses treesitter to parse all three sides of the conflicts and then merges by matching the parts of three trees. The image below shows the matching between the base revision (here, use email::send;) and the left one (use email::{send, validate};).

For one, it shows why the conflict couldn't be resolved completely -- the lone email::send in base is represented as scoped_identifier, whereas a very similar email::{send,...} in left becomes a scoped_use_list, which prevents their child nodes (email, ::, and send) from being matched. I'm actually not sure whether this is an error on treesitter side (someone more familiar with it might give some context), or maybe this is the correct representation and it's us who'd need to teach Mergiraf to handle such cases (which would be a nice enhancement IMO).

At the same time, it shows why the conflict from resolved_merge is nevertheless smaller than the one in parsed_merge -- the former at least manages to match the common tokens use and ;, and as such, reduces conflict_mass, while parsed_merge reduces nothing (by definition, since it just parses the input file). To prove this, here is some debugprint from MergedTree::conflict_mass and ParsedMerge::conflict_mass, respectively:

# MergedTree
l=email::{send, validate}
b=email::send
r=email::{mark_as_spam, send}

l.len() = 23
b.len() = 11
r.len() = 27

# ParsedMerge
l=use email::{send, validate};
b=use email::send;
r=use email::{mark_as_spam, send};

l.len() = 29
b.len() = 17
r.len() = 33

All in all, since #74 has solved the stripped revision names problem, the outputs from both merge methods should be identical, so this is pretty much solved I'd say? Maybe, it's just the worrying output that needs to be silenced.

Ok, so the answer is much dumber than I thought -- but it has to do with the reason why `structured_resolution` fails in the first place. The thing is that `structured_resolution` (just like `structured_merge`) uses treesitter to parse all three sides of the conflicts and then merges by matching the parts of three trees. The image below shows the matching between the base revision (here, `use email::send;`) and the left one (`use email::{send, validate};`). For one, it shows why the conflict couldn't be resolved completely -- the lone `email::send` in base is represented as `scoped_identifier`, whereas a very similar `email::{send,...}` in left becomes a `scoped_use_list`, which prevents their child nodes (`email`, `::`, and `send`) from being matched. I'm actually not sure whether this is an error on treesitter side (someone more familiar with it might give some context), or maybe this is the correct representation and it's us who'd need to teach Mergiraf to handle such cases (which would be a nice enhancement IMO). At the same time, it shows why the conflict from `resolved_merge` is nevertheless smaller than the one in `parsed_merge` -- the former at least manages to match the common tokens `use` and `;`, and as such, reduces `conflict_mass`, while `parsed_merge` reduces nothing (by definition, since it just parses the input file). To prove this, here is some debugprint from [`MergedTree::conflict_mass`](https://codeberg.org/mergiraf/mergiraf/src/commit/68c7d9f07f17283fe40b31b07101b04bc539c0fc/src/merged_tree.rs#L566) and [`ParsedMerge::conflict_mass`](https://codeberg.org/mergiraf/mergiraf/src/commit/68c7d9f07f17283fe40b31b07101b04bc539c0fc/src/parsed_merge.rs#L359), respectively: ``` # MergedTree l=email::{send, validate} b=email::send r=email::{mark_as_spam, send} l.len() = 23 b.len() = 11 r.len() = 27 # ParsedMerge l=use email::{send, validate}; b=use email::send; r=use email::{mark_as_spam, send}; l.len() = 29 b.len() = 17 r.len() = 33 ``` All in all, since #74 has solved the stripped revision names problem, the outputs from both merge methods should be identical, so this is pretty much solved I'd say? Maybe, it's just the worrying output that needs to be silenced.
Owner

For one, it shows why the conflict couldn't be resolved completely -- the lone email::send in base is represented as scoped_identifier, whereas a very similar email::{send,...} in left becomes a scoped_use_list, which prevents their child nodes (email, ::, and send) from being matched. I'm actually not sure whether this is an error on treesitter side (someone more familiar with it might give some context), or maybe this is the correct representation and it's us who'd need to teach Mergiraf to handle such cases (which would be a nice enhancement IMO).

Yes, this is a known limitation of the current algorithm. I agree would be worth solving, but I am not sure how.

All in all, since #74 has solved the stripped revision names problem, the outputs from both merge methods should be identical, so this is pretty much solved I'd say? Maybe, it's just the worrying output that needs to be silenced.

Yes. I guess one could still consider changing how the conflict mass is computed, to measure the size of conflicts when fully rendered (expanded to match line boundaries) instead of their more compact version. But I am not sure it's worth it, because arguably, merge tools should be able to highlight the differences between the lines to make the conflict resolution simpler (meaning that the user's effort corresponds more or less to the compact conflict mass).

> For one, it shows why the conflict couldn't be resolved completely -- the lone `email::send` in base is represented as `scoped_identifier`, whereas a very similar `email::{send,...}` in left becomes a `scoped_use_list`, which prevents their child nodes (`email`, `::`, and `send`) from being matched. I'm actually not sure whether this is an error on treesitter side (someone more familiar with it might give some context), or maybe this is the correct representation and it's us who'd need to teach Mergiraf to handle such cases (which would be a nice enhancement IMO). Yes, this is a known limitation of the current algorithm. I agree would be worth solving, but I am not sure how. > All in all, since #74 has solved the stripped revision names problem, the outputs from both merge methods should be identical, so this is pretty much solved I'd say? Maybe, it's just the worrying output that needs to be silenced. Yes. I guess one could still consider changing how the conflict mass is computed, to measure the size of conflicts when fully rendered (expanded to match line boundaries) instead of their more compact version. But I am not sure it's worth it, because arguably, merge tools should be able to highlight the differences between the lines to make the conflict resolution simpler (meaning that the user's effort corresponds more or less to the compact conflict mass).
Contributor

I think the leftover issue should be pretty much solved by #423 in the sense that it changes the message to be clearer?

The message should now be along the lines of

WARN Couldn't retrieve the original revisions from Git. This limits Mergiraf's ability to solve certain types of conflicts.
INFO 1 conflict(s) remaining.

I think the leftover issue should be pretty much solved by #423 in the sense that it changes the message to be clearer? The message should now be along the lines of > WARN Couldn't retrieve the original revisions from Git. This limits Mergiraf's ability to solve certain types of conflicts. > INFO 1 conflict(s) remaining.
Owner

Agreed, let's close this.

Agreed, let's close this.
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#37
No description provided.