manual-language-selection #296

Closed
mathstuf wants to merge 5 commits from mathstuf/mergiraf:manual-language-selection into main
Member

Given that I found #295, this would help users work around the problem without needing to patch the source code.

--- Given that I found #295, this would help users work around the problem without needing to patch the source code.
Lifetime elision otherwise ties the lifetime to the input `filename`
lifetime which is not true.
If autodetection is not accurate, allow an override.
Author
Member

I'll add test cases if the code looks OK.

I'll add test cases if the code looks OK.
ada4a left a comment
Owner

Hi, thanks for the PR(s)!

I've left a... larger-than-usual number of comments, but I guess that's partly because this a non-trivial change

Hi, thanks for the PR(s)! I've left a... larger-than-usual number of comments, but I guess that's partly because this a non-trivial change
@ -34,0 +36,4 @@
/// Load a profile by language name.
pub fn find_by_name(name: &str) -> Option<&'static Self> {
SUPPORTED_LANGUAGES.iter().find(|lang_profile| {
lang_profile.name.eq_ignore_ascii_case(name)
Owner

having this be case-insensitive is definitely a good call

having this be case-insensitive is definitely a good call
mathstuf marked this conversation as resolved
@ -34,2 +46,3 @@
/// Detects the language of a file based on its filename
pub fn detect_from_filename<P>(filename: &P) -> Option<&Self>
pub fn detect_from_filename<P>(filename: &P) -> Option<&'static Self>
Owner

good catch!

good catch!
mathstuf marked this conversation as resolved
src/lib.rs Outdated
@ -113,14 +113,27 @@ pub fn line_merge_and_structured_resolution(
attempts_cache: Option<&AttemptsCache>,
debug_dir: Option<&'static Path>,
timeout: Duration,
language: Option<String>,
Owner

Option<&str> ought to suffice here. Looks like a false-negative of clippy::needless_pass_by_value (which we have on)

Same for resolve_merge_cascading

`Option<&str>` ought to suffice here. Looks like a false-negative of `clippy::needless_pass_by_value` (which we have on) Same for `resolve_merge_cascading`
mathstuf marked this conversation as resolved
src/lib.rs Outdated
@ -124,0 +118,4 @@
let lang_profile = if let Some(lang_name) = language {
let Some(lang_profile) = LangProfile::find_by_name(&lang_name) else {
debug!(
"Unknown language {}. Falling back to a line-based merge.",
Owner

I feel like "unknown language" is not very descriptive. I'd like this to somehow mention that it has to do with the specified language -- something as simple as "Language specified could not be found".

Also this should arguably be a warn!. I think the "Could not find a supported language ..." thing sets a false precedent -- we should probably be warning there as well. I think we use warnings too sparingly in general, but that can be a discussion for another time -- this is good for now.

I feel like "unknown language" is not very descriptive. I'd like this to somehow mention that it has to do with the _specified_ language -- something as simple as "Language specified could not be found". Also this should arguably be a `warn!`. I think the "Could not find a supported language ..." thing sets a false precedent -- we should probably be warning there as well. I think we use warnings too sparingly in general, but that can be a discussion for another time -- this is good for now.
Owner

Here's the reasoning behind the existing debug!: #135 (comment). I think I forgot about info! at the time, which IMO would've been a better option than debug!. So maybe this could be info! as well.

Here's the reasoning behind the existing `debug!`: https://codeberg.org/mergiraf/mergiraf/pulls/135#issuecomment-2579213. I think I forgot about `info!` at the time, which IMO would've been a better option than `debug!`. So maybe this could be `info!` as well.
Author
Member

I think a warning for a manually-specified language makes sense; info for the auto-detection fallback also makes sense.

I think a warning for a manually-specified language makes sense; `info` for the auto-detection fallback also makes sense.
ada4a marked this conversation as resolved
src/lib.rs Outdated
@ -124,0 +119,4 @@
let Some(lang_profile) = LangProfile::find_by_name(&lang_name) else {
debug!(
"Unknown language {}. Falling back to a line-based merge.",
lang_name
Owner

Could you inline lang_name into the format string please? We have this warning for {e,}println!s, but it doesn't flag the logging functions unfortunately... (since they aren't from std)

Could you inline `lang_name` into the format string please? We have this warning for `{e,}println!`s, but it doesn't flag the logging functions unfortunately... (since they aren't from std)
mathstuf marked this conversation as resolved
src/lib.rs Outdated
@ -124,0 +121,4 @@
"Unknown language {}. Falling back to a line-based merge.",
lang_name
);
return line_based_merge(contents_base, contents_left, contents_right, &settings);
Owner

Part of me wishes that we'd at least try to LangProfile::detect_from_filetype here... I guess that depends on whether false detection is more probable than failed detection -- I.. don't think it is?

Part of me wishes that we'd at least try to `LangProfile::detect_from_filetype` here... I guess that depends on whether false detection is more probable than failed detection -- I.. don't think it is?
Author
Member

The compelling case, to me, is the .pl extension: it is used for both Perl and Prolog. Any kind of auto-detection there is destined for failure without considering the contents of the file as well (but Perl is certainly the global safer-choice default). My view is that the user said it is language X, let's not second guess if things are going downhill.

The compelling case, to me, is the `.pl` extension: it is used for both Perl and Prolog. Any kind of auto-detection there is destined for failure without considering the contents of the file as well (but Perl is certainly the global safer-choice default). My view is that the user said it is language X, let's not second guess if things are going downhill.
Owner

I mean even if we were to choose the wrong lang_profile, we'd fail almost immediately after that, when we try to parse the file with the wrong parser. But I see your point -- let's leave this as is

I mean even if we were to choose the wrong `lang_profile`, we'd fail almost immediately after that, when we try to parse the file with the wrong parser. But I see your point -- let's leave this as is
ada4a marked this conversation as resolved
src/lib.rs Outdated
@ -419,3 +430,1 @@
fname_base.display()
)
})?;
let lang_profile = language
Owner

This might be a personal thing, but I find or_else and ok_or_else in the same method chain to be very confusing. Also I think needing to resort to Option::transpose is almost always.. not a good sign.

The following should be identical logic-wise while also being more legible?:

let lang_profile = if let Some(lang_name) = language {
    LangProfile::find_by_name(&lang_name)
        .ok_or_else(|| format!("Unknown language {lang_name}"))?
} else {
    LangProfile::detect_from_filename(fname_base).ok_or_else(|| {
        format!(
            "Could not find a supported language for {}",
            fname_base.display()
        )
    })?
};

It being similar to the thing over at line_merge_and_structured_resolution might be considered an additional benefit

This might be a personal thing, but I find `or_else` and `ok_or_else` in the same method chain to be very confusing. Also I think needing to resort to `Option::transpose` is almost always.. not a good sign. The following should be identical logic-wise while also being more legible?: ```rust let lang_profile = if let Some(lang_name) = language { LangProfile::find_by_name(&lang_name) .ok_or_else(|| format!("Unknown language {lang_name}"))? } else { LangProfile::detect_from_filename(fname_base).ok_or_else(|| { format!( "Could not find a supported language for {}", fname_base.display() ) })? }; ``` It being similar to the thing over at `line_merge_and_structured_resolution` might be considered an additional benefit
mathstuf marked this conversation as resolved
src/lib.rs Outdated
@ -422,0 +430,4 @@
let lang_profile = language
.map(|lang_name| {
LangProfile::find_by_name(&lang_name)
.ok_or_else(|| format!("Unknown language {lang_name}"))
Owner
See https://codeberg.org/mergiraf/mergiraf/pulls/296/files#issuecomment-3129242
ada4a marked this conversation as resolved
@ -162,6 +163,7 @@ pub static SUPPORTED_LANGUAGES: LazyLock<Vec<LangProfile>> = LazyLock::new(|| {
},
LangProfile {
name: "Java properties",
Owner

I'd definitively have a test case for languages whose names contain spaces -- but that's for later

I'd definitively have a test case for languages whose names contain spaces -- but that's for later
@ -349,6 +354,7 @@ pub static SUPPORTED_LANGUAGES: LazyLock<Vec<LangProfile>> = LazyLock::new(|| {
},
LangProfile {
name: "Javascript",
alternate_names: &[],
Owner

(This is just an example but) I really don't expect anyone to type out "javascript" -- the first instinct will definitively be "js". This raises a question of whether we should allow specifying languages just by their extensions

(This is just an example but) I really don't expect anyone to type out "javascript" -- the first instinct will definitively be "js". This raises a question of whether we should allow specifying languages just by their extensions
Author
Member

I think the list can certainly be expanded in the future. Probably want a test to make sure there aren't any ambiguous options…

I think the list can certainly be expanded in the future. Probably want a test to make sure there aren't any ambiguous options…
mathstuf force-pushed manual-language-selection from 8bbf5415d4 to 6bf6706c2e 2025-03-30 16:29:16 +02:00 Compare
ada4a left a comment
Owner

Ok I think the code looks pretty good now! The only thing left are tests...

Ok I think the code looks pretty good now! The only thing left are tests...
Owner

Oh and this will definitely need docs.. I think a paragraph in mergiraf.org/usage should do, but I'm not sure where exactly one would put it. Since this is supposed to be a last-resort (or at least unhappy-path) kind of option, probably somewhere near the end?

Oh and this will definitely need docs.. I think a paragraph in mergiraf.org/usage should do, but I'm not sure where exactly one would put it. Since this is supposed to be a last-resort (or at least unhappy-path) kind of option, probably somewhere near the end?
Owner

I think the option would make sense for mgf_dev as well (for both subcommands in fact), but we can leave that for another PR

I think the option would make sense for `mgf_dev` as well (for both subcommands in fact), but we can leave that for another PR
Author
Member

Just wanted to chime back in: not sure when I'll get time to look at this again.

Another thing I thought about was using gitattributes to store language information for files that use this to avoid having to pass the flag all the time. But that's a separate PR :) . Something like mergiraf.language=c++ or the like.

Just wanted to chime back in: not sure when I'll get time to look at this again. Another thing I thought about was using `gitattributes` to store language information for files that use this to avoid having to pass the flag all the time. But that's a separate PR :) . Something like `mergiraf.language=c++` or the like.
Owner

@mathstuf thanks a lot for the update, and for the contributions! :)
I'd be happy to help out with any of the outstanding things (tests, documentation) to get this PR merged, so it doesn't get stale.

@mathstuf thanks a lot for the update, and for the contributions! :) I'd be happy to help out with any of the outstanding things (tests, documentation) to get this PR merged, so it doesn't get stale.
Owner

Another thing I thought about was using gitattributes to store language information for files that use this to avoid having to pass the flag all the time. But that's a separate PR :) .

Using .gitattributes for this does make a lot of sense.

Something like mergiraf.language=c++ or the like.

The only reservation I have about this is that this kind of language-identifier-attribute feels like something many more programs would use, not just Mergiraf. Therefore I'd be keen to use some standard attribute instead of introducing a Mergiraf-specific one. Unfortunately, the closest thing to that (that I know of) is GitHub's Linguist, and it doesn't even need the specifier most of the time, since its heuristics seem to work good enough on their own.

> Another thing I thought about was using `gitattributes` to store language information for files that use this to avoid having to pass the flag all the time. But that's a separate PR :) . Using `.gitattributes` for this does make a lot of sense. > Something like `mergiraf.language=c++` or the like. The only reservation I have about this is that this kind of language-identifier-attribute feels like something many more programs would use, not just Mergiraf. Therefore I'd be keen to use some standard attribute instead of introducing a Mergiraf-specific one. Unfortunately, the closest thing to that (that I know of) is GitHub's [Linguist](https://github.com/github-linguist), and it doesn't even need the specifier most of the time, since its heuristics seem to work good enough on their own.
wetneb self-assigned this 2025-04-16 12:45:14 +02:00
Owner

I'm working on fixing conflicts, adding tests and documenting this PR.

I'm working on fixing conflicts, adding tests and documenting this PR.
Owner

Let's continue this in #327 (I can't force-push to this branch).

Let's continue this in #327 (I can't force-push to this branch).
wetneb closed this pull request 2025-04-16 13:32:39 +02:00
Author
Member

Sorry for going dormant here, just haven't had time for this. I subscribed to the new PR (strange that a mention didn't do so…).

Also, is there an option I missed to allow maintainers to push or does Forgejo just not support it (yet)?

Sorry for going dormant here, just haven't had time for this. I subscribed to the new PR (strange that a mention didn't do so…). Also, is there an option I missed to allow maintainers to push or does Forgejo just not support it (yet)?
Owner

@mathstuf normally, maintainers should be allowed to push to PR branches, but maybe not force-push (which is what I tried to fix the conflicts)

@mathstuf normally, maintainers should be allowed to push to PR branches, but maybe not force-push (which is what I tried to fix the conflicts)
wetneb referenced this pull request from a commit 2025-04-28 20:29:11 +02:00
All checks were successful
/ test (pull_request) Successful in 55s

Pull request closed

Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#296
No description provided.