feat(Git): support reading language from git attributes #565
No reviewers
Labels
No labels
Compat/Breaking
Kind
Bad merge
Kind
Bug
Kind
Documentation
Kind
Enhancement
Kind
Feature
Kind
New language
Kind
Security
Kind
Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: mergiraf/mergiraf#565
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "mathstuf/mergiraf:attribute-based-language-selection"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When used as a merge driver,
mergiraf
only has the file extension asguidance for language specification. This may not be suitable in all
cases (e.g., custom languages that are supported-language-adjacent or
.in
template files that are mostly another language). Allowspecification of the language using the
mergiraf.language
attribute.I'm back with that feature I mentioned before :) .
0ff639574d
to02b0e496b4
02b0e496b4
to3171817d8e
This is awesome, thank you! Left a couple of things
@ -60,6 +60,11 @@ $ mergiraf languages --gitattributes
If you want to enable Mergiraf only in a certain repository, add the lines above in the `.gitattributes` file at the root of that repository instead, or in `.git/info/attributes` if you don't want it to be tracked in the repository.
If `mergiraf` does not recognize your file's language by extension, you may set the `mergiraf.language` attribute on the file to specify it manually:
I think this could go into the "Manually specifying the file's language" section below
@ -97,0 +103,4 @@
Command::new("git")
.args([
"check-attr",
"-z",
Is there a particular reason not to have the results separated by newlines, as is the default? As far as I can see, the only change that will require is replacing
.split(b'0')
with.lines()
I don't want to deal with the encoding of spaces in the filename. Paths cannot have NUL bytes, so I don't have to deal with anything more complicated here. We only want the value, so we need some kind of field parsing; NUL is simpler.
Makes a lot of sense! Could you please add this explanation as a comment?
@ -97,0 +106,4 @@
"-z",
attr,
"--",
&format!("{}", file_name.display()),
If you put this into a separate
.arg
, you will be able to just passfile_name
directly, since all that method requires is animpl AsRef<OsStr>
@ -97,0 +112,4 @@
.output()
.ok()
.filter(|output| output.status.success())
.and_then(|output| {
I think you should be able to replace these
and_then
s with?
to reduce the nesting a bit – that should also you to avoid needing to callto_vec
@ -97,0 +115,4 @@
.and_then(|output| {
output
.stdout
.split(|b| *b == b'\0')
I do believe that you parse the output of
git-check-attr
correctly, but still it would be nice to put a comment here which shows what the format of that is. Looking at the man page, it's:@ -97,0 +124,4 @@
pub(crate) fn read_lang_attribute(repo_dir: &Path, file_name: &Path) -> Option<String> {
read_attribute_for_file(repo_dir, file_name, "mergiraf.language")
.filter(|value| value != "unspecified" && value != "set" && value != "unset")
It would be nice to have an explanation what the meanings are of the
value
that we filter out. It seems likeset
andunset
actually refer to boolean values, so that could theoretically be a part of the output ofread_attribute_for_file
, e.g. with an enum like:But that's not at all necessary for this PR
I'm not sure that I'd map set/unset to bool (I have another crate where I did it here), but I can make it an actual structure.
That's the semantics I figured they have, based on the docs. But looking at some attributes later in that same file (
text
,eol
), it looks like they have some freedom to decide their own semantics. So I guessSet
andUnset
variants are okay, even though they don't necessarily make sense formergiraf.language
@ -97,1 +102,4 @@
git::read_lang_attribute(repo_dir, filename.as_ref())
}
/// Loads a language either by name or by detecting it from a filename
Could you please update the docs to mention the new detection mechanism? I think at this point you'll need to make it a list
@ -97,2 +103,4 @@
}
/// Loads a language either by name or by detecting it from a filename
pub fn find_by_filename_or_name<P>(
I think at this point this could be called just
find
@ -106,6 +115,12 @@ impl LangProfile {
if let Some(lang_name) = language_name {
Self::find_by_name(lang_name)
.ok_or_else(|| format!("Specified language '{lang_name}' could not be found"))
I think it's a bit unfortunate that, even if both
language_name
andrepo_dir
are notNone
, we only ever try to get the filename based on the former.What I think would be the solution is to separate the step of getting
filename
from that offind_by_name
, so that we could chain all the ways of getting the former, and then try to do the latter once. Although now that I think of it, that would backfire if the language is specified both in the CLI and in.gitattributes
, but the former specification is faulty while the second one is correct.This is complicated... Any ideas?
I could have sworn I replied to this…but oh well.
If the user explicitly specifies a language, they may be expecting a newer
mergiraf
that does support their language. To me, an explicit language request is higher than any ambient setting and an explicit attribute-based setting is higher priority than heuristics based on file extension. Additionally, the first explicit request should be taken without looking at fallbacks (imagine ifgit merge -s bogus
would ignore thebogus
and instead just go and useort
orrecursive
strategies).Oh, so the placement of the new condition is intentional! I might not be the brightest, because I didn't realize that.. I would appreciate a comment about that as well^^
Makes sense as well. I would like to have a comment mentioning that, but maybe that could go together with the other one.
@ -106,6 +115,12 @@ impl LangProfile {
if let Some(lang_name) = language_name {
Self::find_by_name(lang_name)
.ok_or_else(|| format!("Specified language '{lang_name}' could not be found"))
} else if let Some(lang_name) =
I got very confused by this
if let Some(lang_name) = repo_dir
thing – I thought it was the same as the previous branch or something. I think it could be made more clear by using let-chains:@ -428,0 +447,4 @@
fn find_by_filename_or_name_vcs() {
let mut working_dir = env::current_exe().unwrap();
working_dir.pop();
let tempdir = tempfile::tempdir_in(working_dir).unwrap();
Is there any reason not to do just
tempfile::tempdir
?I think it is better to place it under
$OUT_DIR
rather thanenv::temp_dir()
just to avoid polluting the system.I mean the directory will get removed at the end of the test either way (thanks to the
Drop
impl oftempdir
).. but sure that works@ -428,0 +489,4 @@
) -> Result<&'static LangProfile, String> {
LangProfile::find_by_filename_or_name(filename, name, Some(repo_dir))
}
let find = |filename, name| find_impl(filename, name, tempdir.path());
At this point, why not inline
find_impl
intofind
?Only closures can capture
tempdir
. This seemed cleaner than passing it from every callsite.Yes, I meant leaving
find
as a closure, but inliningfind_impl
inside itThough I wonder whether this information should even go into
.gitattributes
specifically? It's not really VCS-specific, and isn't going to be used by Git. I mean I guess so, since Mergiraf is "just" a merge driver, so a sub-tool of Git. But maybe this could be the start of themergiraf.toml
configuration file 👀Thanks a lot for this initiative! I can relate to the need.
Before diving too deep in the specifics of the implementation, I think it's worth gathering consensus on whether the overall approach is the one we want to go for. This is useful to avoid wasting work on the fine details of this PR if we end up choosing another solution.
At the moment, one way to achieve this is to register a separate merge driver, which contains the appropriate
--language
option in the CLI invocation. For instance, in.gitconfig
:and then in
.gittatributes
Obviously, it's not super convenient, but it's already one way to do it. So, one approach to addressing the need could consist in just documenting this approach (not saying it's necessarily the best one).
If we are not satisfied with this UX, then it's worth thinking about all the sorts of things we want to let users configure. Association of filenames to languages is one thing, but there could be many others:
--fast
,--compact
…)I think there is value in keeping configuration contained in not too many different places (for usability but also to avoid having to read many files at runtime), so I don't know if it's worth starting reading
.gitattributes
ourselves. It might not prove very convenient to extend later on. And as @ada4a mentioned, it might tie mergiraf to git a bit more, making it harder to use it with other VCS (which I think is a worthy goal to keep in mind).@ada4a wrote in #565 (comment):
git
already supports things like:While it is not VCS-specific, other VCS tools don't have attributes like this though.
jj
has a feature request issue to implement it;hg
doesn't AFAICS either.I've found attributes a good place to store these kinds of details about files (even if Git itself doesn't care). For example, in our projects, we use it to indicate which files should be formatted. So we have
format.clang-format=16
means "format this file usingclang-format
version 16. For vendored third party projects, we can the unset this (e.g., if we set that for "all*.cxx
files" so that we are not reformatting vendored code).Not necessarily disagreeing with you @mathstuf, but globbing at least we could implement thanks to
glob
somewhat of a tangent: one could argue that a merge driver should be able to ask Git to pass to it attribute values for the to-be-merged file during invocation -- that kind of already happens with
%L
, with which a merge driver can get the value of theconflict-marker-size
attributeI think it is a solution, yes. However, with a single
merge=mergiraf
, I can rely on a global "I know about amergiraf
merge driver. With custom drivers like this, I am "required" to have in-repo instructions for each special override. Global settings may conflict with other settings.I do doubt it to really be an issue in practice, but Prolog and Perl do both share the
.pl
extension and if both get support, I think something like this would be very useful for whichever loses the default extension/language link (likely Prolog…but it is probably far easier to have support here for Prolog given the relative syntactic complexity of the two).That does seem useful. However, that would require some kind of markup to suppress the
--language
flag if it is unset or unspecified.I think an unset value would be plain wrong, because I'm pretty sure that that does mean
false
, so a bool, whereas we require a string. So we could emit a warning in that case.But otherwise we could handle the two cases the same – I'm pretty sure the
language
field is already anOption
, and so we can set it toNone
on bothunset
andunspecified
Thanks both for your points! I've thought about it more and
.gitattributes
does seem like a good fit. I like @ada4a's idea of introducing a dedicated configuration file for the tool, as it would give us a lot of control over its structure, having it accommodate a lot more configuration settings, but I also see advantages to using gitattributes:git check-attr
.If we want to introduce a
mergiraf.toml
config file, it should still be possible to do that later… I guess we'll have to keep thegit check-attr
call for backwards compatibility but it doesn't feel like an outrageously costly thing to do.So I'm happy to go ahead with this approach :)
Right, about the linguist attributes -- it seems that they're indeed a bit of a standard, so maybe we should just use them instead of
mergiraf.language
? Though we should check whether they call their languages the same way we doCircling back to this: I think it would be worth addressing the points @ada4a raised and after that it seems to me that it would be good to merge.
I'm traveling and won't be able to make a release before end of next week (since it still relies on a cross-compilation set up that I only have on another machine - I know… I have it as a goal to migrate that to the CI instead). So I would aim to make the release then, so perhaps it could be nice to merge this PR before that? I'm happy to help incorporate @ada4a's points if needed.
I'm not allowed to edit this PR so I'm proposing to continue it in #599.
Pull request closed