Repository Languages can provide inaccurate percentages with some files (Puppet seen as Pascal)
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
The repository languages feature can present some repositories with incorrect percentage calculations of files when Puppet (.pp
) is used. These are incorrectly interpreted as "Pascal" files (also .pp
).
For example, this test project (selected randomly as it contained a good mix of Puppet files), copied from a source on a competitor git website, has different calculations:
Source | GitLab.com |
---|---|
86.1% Puppet, 9.8% Shell, 3.1% Ruby | 50% Pascal, 31.4% Puppet, 9.2% Shell, 5.4% M4 |
Additionally, we see different results when viewing the repository locally and using several different tools. As noted in this epic GitLab switched from github-linguist
to go-enry
in 15.x, but the results are not always accurate. I've also included SCC
for comparison:
github-linguist
81.47% 108946 Puppet
9.23% 12349 Shell
5.38% 7193 M4
2.93% 3916 Ruby
0.99% 1321 HTML
go-enry
50.04% Pascal
31.43% Puppet
9.23% Shell
5.38% M4
2.93% Ruby
0.99% HTML+ERB
SCC
───────────────────────────────────────────────────────────────────────────────
Language Files Lines Blanks Comments Code Complexity
───────────────────────────────────────────────────────────────────────────────
Puppet 497 5481 534 2199 2748 48
Ruby 23 165 23 0 142 2
Shell 23 324 87 73 164 58
Ruby HTML 3 41 9 0 32 0
Markdown 2 72 19 0 53 0
Gemfile 1 10 1 0 9 0
Monkey C 1 176 0 0 176 9
Rakefile 1 24 6 1 17 0
YAML 1 14 0 0 14 0
gitignore 1 1 0 0 1 0
───────────────────────────────────────────────────────────────────────────────
Total 553 6308 679 2273 3356 117
───────────────────────────────────────────────────────────────────────────────
## Author Note, for "Lines", this equals 86% of the repository files, for "Code", this equals 81% of the repository.
The values from github-linguist
or SCC
are closer to the actual consistency of files for the repository, and do not incorrectly guess that Pascal is used, which is what go-enry
currently does.
Steps to reproduce
View, clone or fork the example project to view the current percentages (also shown below)
Current Percentages |
---|
![]() |
Example Project
The same project as linked above.
What is the current bug behavior?
Language calculation for repositories with Puppet files are incorrectly interpreted as Pascal files.
What is the expected correct behavior?
Language percentage calculation is correct, or more accurate, in comparison to our competitors.
Output of checks
This bug happens on GitLab.com
Possible fixes
It looks like both enry and linguist have implementations that support the language pattern detection between Puppet and Pascal. Is it possible some detection is being missed when a repository is checked or scanned?