WIP: feat: Gettext support (*.po, *.pot) #584

Closed
wetneb wants to merge 1 commit from wetneb/mergiraf:gettext_support into main
Owner

Interestingly, support for Gettext files is essentially useless without #576, because each entry in the file is normally preceded by a comment.
Also, this comment node is not marked as "extra" in the grammar. We will probably need some additional field in the language profiles to specify a set of node types that should also be bundled, even though they are not extras. This would also be useful for Python docstrings (assuming we can change the grammar appropriately).

Interestingly, support for Gettext files is essentially useless without #576, because each entry in the file is normally preceded by a comment. Also, this `comment` node is not marked as "extra" in the grammar. We will probably need some additional field in the language profiles to specify a set of node types that should also be bundled, even though they are not extras. This would also be useful for Python docstrings (assuming we can change the grammar appropriately).
feat: Gettext support (*.po, *.pot)
Some checks failed
/ test (pull_request) Failing after 1m41s
8ac9df1914
Owner

Also, this comment node is not marked as "extra" in the grammar.

Hm, looking at the spec, I don't think those comments are really extra -- rather, they're optional attributes of the message they annotate. Bundling them together with the message looks to me like something that should be implemented by the grammar...

> Also, this comment node is not marked as "extra" in the grammar. Hm, looking at the [spec](https://www.gnu.org/software/gettext/manual/html_node/PO-File-Entries.html), I don't think those comments are really extra -- rather, they're optional attributes of the message they annotate. Bundling them together with the message looks to me like something that should be implemented by the grammar...
Author
Owner

I agree, but this sentence makes me think that some comments might be allowed to appear as isolated lines:

The comment lines beginning with #, are special because they are not completely ignored by the programs as comments generally are.

Doesn't that sort of imply that the other comments (starting with # ) are meant to be completely ignored by the program, hence could appear in any order and any line in the file?

For instance you could perhaps have something like this:

# Dear translator, you are awesome, thank you for volunteering!

# --- Dialogs ---

msgid "ok"
msgstr "OK"

msgid "cancel"
msgstr "Cancel"

# --- Error messages ---

msgid "unknown-error"
msgstr "An unknown error happened"

msgid "not-found"
msgstr "File not found"

# Thank you for completing the translation of foobar!
I agree, but this sentence makes me think that some comments might be allowed to appear as isolated lines: > The comment lines beginning with `#,` are special because they are not completely ignored by the programs as comments generally are. Doesn't that sort of imply that the other comments (starting with `# `) are meant to be completely ignored by the program, hence could appear in any order and any line in the file? For instance you could perhaps have something like this: ``` # Dear translator, you are awesome, thank you for volunteering! # --- Dialogs --- msgid "ok" msgstr "OK" msgid "cancel" msgstr "Cancel" # --- Error messages --- msgid "unknown-error" msgstr "An unknown error happened" msgid "not-found" msgstr "File not found" # Thank you for completing the translation of foobar! ```
Owner

Ah, I see what you mean. I tried searching Github for some examples, and apparently what happens instead is that the # comments are put onto a dummy entry at the beginning of the file, like this:

# something something name
# something something licence
msgid ""
msgstr ""

Which would suggest that the comments can't be free-standing after all (even though your example does look plausible and helpful).

But that's of course no hard proof. I'm assuming that the Gettext format support was requested by someone? If so, I'd probably ask them about this, because they're likely more familiar with the language..

Ah, I see what you mean. I [tried searching Github](https://github.com/search?q=path%3A*.po+%22%23+%22&type=code) for some examples, and apparently what happens instead is that the `# ` comments are put onto a dummy entry at the beginning of the file, like this: ```po # something something name # something something licence msgid "" msgstr "" ``` Which would suggest that the comments can't be free-standing after all (even though your example does look plausible and helpful). But that's of course no hard proof. I'm assuming that the Gettext format support was requested by someone? If so, I'd probably ask them about this, because they're likely more familiar with the language..
Author
Owner

Support for this format wasn't asked for by anyone, but in the dataset of merge cases that I'm gathering, it's the most common file format that we don't handle yet.

I totally believe the comments are indeed bundlable in the grammar like you said for the overwhelming majority of PO files out there. We can actually use the dataset to check if the parsing errors increase if we change the grammar.

Support for this format wasn't asked for by anyone, but in the dataset of merge cases that I'm gathering, it's the most common file format that we don't handle yet. I totally believe the comments are indeed bundlable in the grammar like you said for the overwhelming majority of PO files out there. We can actually use the dataset to check if the parsing errors increase if we change the grammar.
wetneb force-pushed gettext_support from 8ac9df1914 to 452d2d3398 2025-09-17 15:04:30 +02:00 Compare
Author
Owner

So, thanks to the bundling feature, the motivating test case now passes, but running this on real examples of merge conflicts on .po files, I don't get very convincing results.

This seems to come mostly from cases like this:

<<<<<<< LEFT
#: template.php:92
msgid "today"
msgstr "dnes"
||||||| BASE
=======
#: template.php:108
msgid "today"
msgstr "dnes"
>>>>>>> RIGHT

There is no way for mergiraf to know how to resolve the conflict between template.php:92 and template.php:108.
The desired behavior from a user standpoint would be to pick any of the two, because it doesn't matter: at the next update of the file, the line number will be corrected. What matters is the msgid to msgstr mapping, for which there is no conflict here.

So maybe mergiraf isn't a great fit for .po files.

So, thanks to the bundling feature, the motivating test case now passes, but running this on real examples of merge conflicts on `.po` files, I don't get very convincing results. This seems to come mostly from cases like this: ``` <<<<<<< LEFT #: template.php:92 msgid "today" msgstr "dnes" ||||||| BASE ======= #: template.php:108 msgid "today" msgstr "dnes" >>>>>>> RIGHT ``` There is no way for mergiraf to know how to resolve the conflict between `template.php:92` and `template.php:108`. The desired behavior from a user standpoint would be to pick any of the two, because it doesn't matter: at the next update of the file, the line number will be corrected. What matters is the `msgid` to `msgstr` mapping, for which there is no conflict here. So maybe mergiraf isn't a great fit for `.po` files.
Author
Owner

Here are more detailed statistics:

Language Cases Exact Format Conflict Differ Parse Panic Time (s)
*.po 5,864 10 (0%) 0 5,524 (94%) 9 (0%) 320 (5%) 1 (0%) 0.764

Not exactly convincing.

Here are more detailed statistics: | Language | Cases | Exact | Format | Conflict | Differ | Parse | Panic | Time (s) | | -------- | ----- | ----- | ------ | -------- | ------ | ----- | ----- | -------- | | `*.po` | 5,864 | 10 (0%) | 0 | 5,524 (94%) | 9 (0%) | 320 (5%) | 1 (0%) | 0.764 | Not exactly convincing.
wetneb closed this pull request 2025-09-17 20:47:43 +02:00
All checks were successful
/ test (pull_request) Successful in 1m38s

Pull request closed

Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: mergiraf/mergiraf#584
No description provided.