feat(Git): support reading language from git attributes #565

Closed
mathstuf wants to merge 1 commit from mathstuf/mergiraf:attribute-based-language-selection into main
Member

When used as a merge driver, mergiraf only has the file extension as
guidance for language specification. This may not be suitable in all
cases (e.g., custom languages that are supported-language-adjacent or
.in template files that are mostly another language). Allow
specification of the language using the mergiraf.language attribute.


I'm back with that feature I mentioned before :) .

When used as a merge driver, `mergiraf` only has the file extension as guidance for language specification. This may not be suitable in all cases (e.g., custom languages that are supported-language-adjacent or `.in` template files that are mostly another language). Allow specification of the language using the `mergiraf.language` attribute. --- I'm back with that feature I mentioned before :) .
feat(Git): support reading language from git attributes
Some checks failed
/ test (pull_request) Failing after 12s
0ff639574d
When used as a merge driver, `mergiraf` only has the file extension as
guidance for language specification. This may not be suitable in all
cases (e.g., custom languages that are supported-language-adjacent or
`.in` template files that are mostly another language). Allow
specification of the language using the `mergiraf.language` attribute.
mathstuf force-pushed attribute-based-language-selection from 0ff639574d to 02b0e496b4 2025-08-13 06:01:26 +02:00 Compare
mathstuf force-pushed attribute-based-language-selection from 02b0e496b4 to 3171817d8e 2025-08-13 06:05:56 +02:00 Compare
ada4a left a comment
Owner

This is awesome, thank you! Left a couple of things

This is awesome, thank you! Left a couple of things
@ -60,6 +60,11 @@ $ mergiraf languages --gitattributes
If you want to enable Mergiraf only in a certain repository, add the lines above in the `.gitattributes` file at the root of that repository instead, or in `.git/info/attributes` if you don't want it to be tracked in the repository.
If `mergiraf` does not recognize your file's language by extension, you may set the `mergiraf.language` attribute on the file to specify it manually:
Owner

I think this could go into the "Manually specifying the file's language" section below

I think this could go into the "Manually specifying the file's language" section below
@ -97,0 +103,4 @@
Command::new("git")
.args([
"check-attr",
"-z",
Owner

Is there a particular reason not to have the results separated by newlines, as is the default? As far as I can see, the only change that will require is replacing .split(b'0') with .lines()

Is there a particular reason not to have the results separated by newlines, as is the default? As far as I can see, the only change that will require is replacing `.split(b'0')` with `.lines()`
Author
Member

I don't want to deal with the encoding of spaces in the filename. Paths cannot have NUL bytes, so I don't have to deal with anything more complicated here. We only want the value, so we need some kind of field parsing; NUL is simpler.

I don't want to deal with the encoding of spaces in the filename. Paths cannot have NUL bytes, so I don't have to deal with anything more complicated here. We only want the value, so we need some kind of field parsing; NUL is simpler.
Owner

Makes a lot of sense! Could you please add this explanation as a comment?

Makes a lot of sense! Could you please add this explanation as a comment?
@ -97,0 +106,4 @@
"-z",
attr,
"--",
&format!("{}", file_name.display()),
Owner

If you put this into a separate .arg, you will be able to just pass file_name directly, since all that method requires is an impl AsRef<OsStr>

If you put this into a separate `.arg`, you will be able to just pass `file_name` directly, since all that method requires is an `impl AsRef<OsStr>`
@ -97,0 +112,4 @@
.output()
.ok()
.filter(|output| output.status.success())
.and_then(|output| {
Owner

I think you should be able to replace these and_thens with ? to reduce the nesting a bit – that should also you to avoid needing to call to_vec

I think you should be able to replace these `and_then`s with `?` to reduce the nesting a bit – that should also you to avoid needing to call `to_vec`
@ -97,0 +115,4 @@
.and_then(|output| {
output
.stdout
.split(|b| *b == b'\0')
Owner

I do believe that you parse the output of git-check-attr correctly, but still it would be nice to put a comment here which shows what the format of that is. Looking at the man page, it's:

<path> COLON SP <attribute> COLON SP <info> LF
I do believe that you parse the output of `git-check-attr` correctly, but still it would be nice to put a comment here which shows what the format of that is. Looking at the [man page](https://www.man7.org/linux/man-pages/man1/git-check-attr.1.html), it's: <path> COLON SP <attribute> COLON SP <info> LF
@ -97,0 +124,4 @@
pub(crate) fn read_lang_attribute(repo_dir: &Path, file_name: &Path) -> Option<String> {
read_attribute_for_file(repo_dir, file_name, "mergiraf.language")
.filter(|value| value != "unspecified" && value != "set" && value != "unset")
Owner

It would be nice to have an explanation what the meanings are of the value that we filter out. It seems like set and unset actually refer to boolean values, so that could theoretically be a part of the output of read_attribute_for_file, e.g. with an enum like:

enum AttrValue {
    Bool(bool),
    Value(&str),
    Unspecified,
}

But that's not at all necessary for this PR

It would be nice to have an explanation what the meanings are of the `value` that we filter out. It seems like `set` and `unset` actually refer to boolean values, so that could _theoretically_ be a part of the output of `read_attribute_for_file`, e.g. with an enum like: ```rs enum AttrValue { Bool(bool), Value(&str), Unspecified, } ``` But that's not at all necessary for this PR
Author
Member

I'm not sure that I'd map set/unset to bool (I have another crate where I did it here), but I can make it an actual structure.

I'm not sure that I'd map set/unset to bool (I have another crate where I did it [here](https://docs.rs/git-checks-core/latest/git_checks_core/enum.AttributeState.html)), but I can make it an actual structure.
Owner

That's the semantics I figured they have, based on the docs. But looking at some attributes later in that same file (text, eol), it looks like they have some freedom to decide their own semantics. So I guess Set and Unset variants are okay, even though they don't necessarily make sense for mergiraf.language

That's the semantics I figured they have, based on the [docs](https://git-scm.com/docs/gitattributes#_description). But looking at some attributes later in that same file ([`text`](https://git-scm.com/docs/gitattributes#_text), [`eol`](https://git-scm.com/docs/gitattributes#_eol)), it looks like they have some freedom to decide their own semantics. So I guess `Set` and `Unset` variants are okay, even though they don't necessarily make sense for `mergiraf.language`
@ -97,1 +102,4 @@
git::read_lang_attribute(repo_dir, filename.as_ref())
}
/// Loads a language either by name or by detecting it from a filename
Owner

Could you please update the docs to mention the new detection mechanism? I think at this point you'll need to make it a list

Could you please update the docs to mention the new detection mechanism? I think at this point you'll need to make it a list
@ -97,2 +103,4 @@
}
/// Loads a language either by name or by detecting it from a filename
pub fn find_by_filename_or_name<P>(
Owner

I think at this point this could be called just find

I think at this point this could be called just `find`
@ -106,6 +115,12 @@ impl LangProfile {
if let Some(lang_name) = language_name {
Self::find_by_name(lang_name)
.ok_or_else(|| format!("Specified language '{lang_name}' could not be found"))
Owner

I think it's a bit unfortunate that, even if both language_name and repo_dir are not None, we only ever try to get the filename based on the former.

What I think would be the solution is to separate the step of getting filename from that of find_by_name, so that we could chain all the ways of getting the former, and then try to do the latter once. Although now that I think of it, that would backfire if the language is specified both in the CLI and in .gitattributes, but the former specification is faulty while the second one is correct.

This is complicated... Any ideas?

I think it's a bit unfortunate that, even if both `language_name` and `repo_dir` are not `None`, we only ever try to get the filename based on the former. What I think would be the solution is to separate the step of getting `filename` from that of `find_by_name`, so that we could chain all the ways of getting the former, and then try to do the latter once. Although now that I think of it, that would backfire if the language is specified both in the CLI and in `.gitattributes`, but the former specification is faulty while the second one is correct. This is complicated... Any ideas?
Author
Member

I could have sworn I replied to this…but oh well.

If the user explicitly specifies a language, they may be expecting a newer mergiraf that does support their language. To me, an explicit language request is higher than any ambient setting and an explicit attribute-based setting is higher priority than heuristics based on file extension. Additionally, the first explicit request should be taken without looking at fallbacks (imagine if git merge -s bogus would ignore the bogus and instead just go and use ort or recursive strategies).

I could have sworn I replied to this…but oh well. If the user explicitly specifies a language, they may be expecting a newer `mergiraf` that *does* support their language. To me, an explicit language request is higher than any ambient setting and an explicit attribute-based setting is higher priority than heuristics based on file extension. Additionally, the first explicit request should be taken without looking at fallbacks (imagine if `git merge -s bogus` would ignore the `bogus` and instead just go and use `ort` or `recursive` strategies).
Owner

To me, an explicit language request is higher than any ambient setting and an explicit attribute-based setting is higher priority than heuristics based on file extension

Oh, so the placement of the new condition is intentional! I might not be the brightest, because I didn't realize that.. I would appreciate a comment about that as well^^

Additionally, the first explicit request should be taken without looking at fallbacks

Makes sense as well. I would like to have a comment mentioning that, but maybe that could go together with the other one.

> To me, an explicit language request is higher than any ambient setting and an explicit attribute-based setting is higher priority than heuristics based on file extension Oh, so the placement of the new condition _is_ intentional! I might not be the brightest, because I didn't realize that.. I would appreciate a comment about that as well^^ > Additionally, the first explicit request should be taken without looking at fallbacks Makes sense as well. I would like to have a comment mentioning that, but maybe that could go together with the other one.
@ -106,6 +115,12 @@ impl LangProfile {
if let Some(lang_name) = language_name {
Self::find_by_name(lang_name)
.ok_or_else(|| format!("Specified language '{lang_name}' could not be found"))
} else if let Some(lang_name) =
Owner

I got very confused by this if let Some(lang_name) = repo_dir thing – I thought it was the same as the previous branch or something. I think it could be made more clear by using let-chains:

if let Some(repo_dir) = repo_dir
    && let Some(lang_name) = Self::detect_language_from_vcs_attr(repo_dir, filename)
    {
I got very confused by this `if let Some(lang_name) = repo_dir` thing – I thought it was the same as the previous branch or something. I think it could be made more clear by using [let-chains](https://rust-lang.github.io/rfcs/2497-if-let-chains.html): ```rs if let Some(repo_dir) = repo_dir && let Some(lang_name) = Self::detect_language_from_vcs_attr(repo_dir, filename) { ```
@ -428,0 +447,4 @@
fn find_by_filename_or_name_vcs() {
let mut working_dir = env::current_exe().unwrap();
working_dir.pop();
let tempdir = tempfile::tempdir_in(working_dir).unwrap();
Owner

Is there any reason not to do just tempfile::tempdir?

Is there any reason not to do just `tempfile::tempdir`?
Author
Member

I think it is better to place it under $OUT_DIR rather than env::temp_dir() just to avoid polluting the system.

I think it is better to place it under `$OUT_DIR` rather than `env::temp_dir()` just to avoid polluting the system.
Owner

I mean the directory will get removed at the end of the test either way (thanks to the Drop impl of tempdir).. but sure that works

I mean the directory will get removed at the end of the test either way (thanks to the `Drop` impl of `tempdir`).. but sure that works
@ -428,0 +489,4 @@
) -> Result<&'static LangProfile, String> {
LangProfile::find_by_filename_or_name(filename, name, Some(repo_dir))
}
let find = |filename, name| find_impl(filename, name, tempdir.path());
Owner

At this point, why not inline find_impl into find?

At this point, why not inline `find_impl` into `find`?
Author
Member

Only closures can capture tempdir. This seemed cleaner than passing it from every callsite.

Only closures can capture `tempdir`. This seemed cleaner than passing it from every callsite.
Owner

Yes, I meant leaving find as a closure, but inlining find_impl inside it

Yes, I meant leaving `find` as a closure, but inlining `find_impl` inside it
Owner

Though I wonder whether this information should even go into .gitattributes specifically? It's not really VCS-specific, and isn't going to be used by Git. I mean I guess so, since Mergiraf is "just" a merge driver, so a sub-tool of Git. But maybe this could be the start of the mergiraf.toml configuration file 👀

Though I wonder whether this information should even go into `.gitattributes` specifically? It's not really VCS-specific, and isn't going to be used by Git. I mean I guess so, since Mergiraf is "just" a merge driver, so a sub-tool of Git. But maybe this could be the start of the `mergiraf.toml` configuration file 👀
Owner

Thanks a lot for this initiative! I can relate to the need.
Before diving too deep in the specifics of the implementation, I think it's worth gathering consensus on whether the overall approach is the one we want to go for. This is useful to avoid wasting work on the fine details of this PR if we end up choosing another solution.

At the moment, one way to achieve this is to register a separate merge driver, which contains the appropriate --language option in the CLI invocation. For instance, in .gitconfig:

[merge "mergiraf"]
    name = mergiraf
    driver = mergiraf merge --git %O %A %B -s %S -x %X -y %Y -p %P -l %L
    
[merge "mergiraf-myjs"]
    name = mergiraf
    driver = mergiraf merge --git  --language javascript %O %A %B -s %S -x %X -y %Y -p %P -l %L

and then in .gittatributes

*.py merge=mergiraf
*.java merge=mergiraf
*.myjs merge=mergiraf-myjs

Obviously, it's not super convenient, but it's already one way to do it. So, one approach to addressing the need could consist in just documenting this approach (not saying it's necessarily the best one).

If we are not satisfied with this UX, then it's worth thinking about all the sorts of things we want to let users configure. Association of filenames to languages is one thing, but there could be many others:

  • tweaking the language profile, for instance as suggested in #249 by enabling specific commutative parents
  • setting a specific timeout value for some file types, or other types of configuration (--fast, --compact…)

I think there is value in keeping configuration contained in not too many different places (for usability but also to avoid having to read many files at runtime), so I don't know if it's worth starting reading .gitattributes ourselves. It might not prove very convenient to extend later on. And as @ada4a mentioned, it might tie mergiraf to git a bit more, making it harder to use it with other VCS (which I think is a worthy goal to keep in mind).

Thanks a lot for this initiative! I can relate to the need. Before diving too deep in the specifics of the implementation, I think it's worth gathering consensus on whether the overall approach is the one we want to go for. This is useful to avoid wasting work on the fine details of this PR if we end up choosing another solution. At the moment, one way to achieve this is to register a separate merge driver, which contains the appropriate `--language` option in the CLI invocation. For instance, in `.gitconfig`: ``` [merge "mergiraf"] name = mergiraf driver = mergiraf merge --git %O %A %B -s %S -x %X -y %Y -p %P -l %L [merge "mergiraf-myjs"] name = mergiraf driver = mergiraf merge --git --language javascript %O %A %B -s %S -x %X -y %Y -p %P -l %L ``` and then in `.gittatributes` ``` *.py merge=mergiraf *.java merge=mergiraf *.myjs merge=mergiraf-myjs ``` Obviously, it's not super convenient, but it's already one way to do it. So, one approach to addressing the need could consist in just documenting this approach (not saying it's necessarily the best one). If we are not satisfied with this UX, then it's worth thinking about all the sorts of things we want to let users configure. Association of filenames to languages is one thing, but there could be many others: * tweaking the language profile, for instance as suggested in https://codeberg.org/mergiraf/mergiraf/issues/249 by enabling specific commutative parents * setting a specific timeout value for some file types, or other types of configuration (`--fast`, `--compact`…) I think there is value in keeping configuration contained in not too many different places (for usability but also to avoid having to read many files at runtime), so I don't know if it's worth starting reading `.gitattributes` ourselves. It might not prove very convenient to extend later on. And as @ada4a mentioned, it might tie mergiraf to git a bit more, making it harder to use it with other VCS (which I think is a worthy goal to keep in mind).
Author
Member

@ada4a wrote in #565 (comment):

Though I wonder whether this information should even go into .gitattributes specifically? It's not really VCS-specific, and isn't going to be used by Git. I mean I guess so, since Mergiraf is "just" a merge driver, so a sub-tool of Git. But maybe this could be the start of the mergiraf.toml configuration file 👀

git already supports things like:

  • specifying globs
  • unsetting it for exceptions

While it is not VCS-specific, other VCS tools don't have attributes like this though. jj has a feature request issue to implement it; hg doesn't AFAICS either.

I've found attributes a good place to store these kinds of details about files (even if Git itself doesn't care). For example, in our projects, we use it to indicate which files should be formatted. So we have format.clang-format=16 means "format this file using clang-format version 16. For vendored third party projects, we can the unset this (e.g., if we set that for "all *.cxx files" so that we are not reformatting vendored code).

@ada4a wrote in https://codeberg.org/mergiraf/mergiraf/pulls/565#issuecomment-6285487: > Though I wonder whether this information should even go into `.gitattributes` specifically? It's not really VCS-specific, and isn't going to be used by Git. I mean I guess so, since Mergiraf is "just" a merge driver, so a sub-tool of Git. But maybe this could be the start of the `mergiraf.toml` configuration file :eyes: `git` already supports things like: - specifying globs - *unsetting* it for exceptions While it is not VCS-specific, other VCS tools don't have attributes like this though. `jj` has a feature request issue to implement it; `hg` doesn't AFAICS either. I've found attributes a good place to store these kinds of details about files (even if Git itself doesn't care). For example, in our projects, we use it to indicate which files should be formatted. So we have `format.clang-format=16` means "format this file using `clang-format` version 16. For vendored third party projects, we can the *unset* this (e.g., if we set that for "all `*.cxx` files" so that we are *not* reformatting vendored code).
Owner

Not necessarily disagreeing with you @mathstuf, but globbing at least we could implement thanks to glob

Not necessarily disagreeing with you @mathstuf, but globbing at least we could implement thanks to [`glob`](https://docs.rs/glob)
Owner

reading .gitattributes ourselves

somewhat of a tangent: one could argue that a merge driver should be able to ask Git to pass to it attribute values for the to-be-merged file during invocation -- that kind of already happens with %L, with which a merge driver can get the value of the conflict-marker-size attribute

> reading `.gitattributes` ourselves somewhat of a tangent: one could argue that a merge driver should be able to ask Git to pass to it attribute values for the to-be-merged file during invocation -- that kind of already happens with `%L`, with which a merge driver can get the value of the `conflict-marker-size` attribute
Author
Member

Obviously, it's not super convenient, but it's already one way to do it. So, one approach to addressing the need could consist in just documenting this approach (not saying it's necessarily the best one).

I think it is a solution, yes. However, with a single merge=mergiraf, I can rely on a global "I know about a mergiraf merge driver. With custom drivers like this, I am "required" to have in-repo instructions for each special override. Global settings may conflict with other settings.

I do doubt it to really be an issue in practice, but Prolog and Perl do both share the .pl extension and if both get support, I think something like this would be very useful for whichever loses the default extension/language link (likely Prolog…but it is probably far easier to have support here for Prolog given the relative syntactic complexity of the two).

> Obviously, it's not super convenient, but it's already one way to do it. So, one approach to addressing the need could consist in just documenting this approach (not saying it's necessarily the best one). I think it is *a* solution, yes. However, with a single `merge=mergiraf`, I can rely on a *global* "I know about a `mergiraf` merge driver. With custom drivers like this, I am "required" to have in-repo instructions for each special override. Global settings may conflict with other settings. I do doubt it to really be an issue in practice, but Prolog and Perl do both share the `.pl` extension and if both get support, I think something like this would be very useful for whichever loses the default extension/language link (likely Prolog…but it is probably far easier to have support here for Prolog given the relative syntactic complexity of the two).
Author
Member

somewhat of a tangent: one could argue that a merge driver should be able to ask Git to pass to it attribute values for the to-be-merged file during invocation -- that kind of already happens with %L, with which a merge driver can get the value of the conflict-marker-size attribute

That does seem useful. However, that would require some kind of markup to suppress the --language flag if it is unset or unspecified.

> somewhat of a tangent: one could argue that a merge driver should be able to ask Git to pass to it attribute values for the to-be-merged file during invocation -- that kind of already happens with %L, with which a merge driver can get the value of the conflict-marker-size attribute That does seem useful. However, that would require some kind of markup to suppress the `--language` flag if it is unset or unspecified.
Owner

I think an unset value would be plain wrong, because I'm pretty sure that that does mean false, so a bool, whereas we require a string. So we could emit a warning in that case.

But otherwise we could handle the two cases the same – I'm pretty sure the language field is already an Option, and so we can set it to None on both unset and unspecified

I think an unset value would be plain wrong, because I'm pretty sure that that does mean `false`, so a bool, whereas we require a string. So we could emit a warning in that case. But otherwise we could handle the two cases the same – I'm pretty sure the `language` field is already an `Option`, and so we can set it to `None` on both `unset` and `unspecified`
Owner

Thanks both for your points! I've thought about it more and .gitattributes does seem like a good fit. I like @ada4a's idea of introducing a dedicated configuration file for the tool, as it would give us a lot of control over its structure, having it accommodate a lot more configuration settings, but I also see advantages to using gitattributes:

  • as @mathstuf wrote, git's existing cascading logic to look up attributes (be they checked into git or not, repository-specific or global) is quite helpful
  • I can imagine it being useful to rely on other attributes. For instance the linguist ones, which seem to be a sort of standard by now. We could access all those attributes at no further cost, with the existing invocation of git check-attr.

If we want to introduce a mergiraf.toml config file, it should still be possible to do that later… I guess we'll have to keep the git check-attr call for backwards compatibility but it doesn't feel like an outrageously costly thing to do.

So I'm happy to go ahead with this approach :)

Thanks both for your points! I've thought about it more and `.gitattributes` does seem like a good fit. I like @ada4a's idea of introducing a dedicated configuration file for the tool, as it would give us a lot of control over its structure, having it accommodate a lot more configuration settings, but I also see advantages to using gitattributes: * as @mathstuf wrote, git's existing cascading logic to look up attributes (be they checked into git or not, repository-specific or global) is quite helpful * I can imagine it being useful to rely on other attributes. For instance [the linguist ones](https://github.com/github-linguist/linguist/blob/main/docs/overrides.md#using-gitattributes), which seem to be a sort of standard by now. We could access all those attributes at no further cost, with the existing invocation of `git check-attr`. If we want to introduce a `mergiraf.toml` config file, it should still be possible to do that later… I guess we'll have to keep the `git check-attr` call for backwards compatibility but it doesn't feel like an outrageously costly thing to do. So I'm happy to go ahead with this approach :)
Owner

Right, about the linguist attributes -- it seems that they're indeed a bit of a standard, so maybe we should just use them instead of mergiraf.language? Though we should check whether they call their languages the same way we do

Right, about the linguist attributes -- it seems that they're indeed a bit of a standard, so maybe we should just use them instead of `mergiraf.language`? Though we should check whether they call their languages the same way we do
Owner

Circling back to this: I think it would be worth addressing the points @ada4a raised and after that it seems to me that it would be good to merge.

I'm traveling and won't be able to make a release before end of next week (since it still relies on a cross-compilation set up that I only have on another machine - I know… I have it as a goal to migrate that to the CI instead). So I would aim to make the release then, so perhaps it could be nice to merge this PR before that? I'm happy to help incorporate @ada4a's points if needed.

Circling back to this: I think it would be worth addressing the points @ada4a raised and after that it seems to me that it would be good to merge. I'm traveling and won't be able to make a release before end of next week (since it still relies on a cross-compilation set up that I only have on another machine - I know… I have it as a goal to migrate that to the CI instead). So I would aim to make the release then, so perhaps it could be nice to merge this PR before that? I'm happy to help incorporate @ada4a's points if needed.
Owner

I'm not allowed to edit this PR so I'm proposing to continue it in #599.

I'm not allowed to edit this PR so I'm proposing to continue it in #599.
wetneb closed this pull request 2025-09-17 14:53:38 +02:00
All checks were successful
/ test (pull_request) Successful in 1m37s

Pull request closed

Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#565
No description provided.