PeerGuardian / Bugs / #335 Default list encoding

I dont agree that its not able to load the UTF-8 files. Ive started looking into this because of other oddities I was seeing with the default block lists but have come to the conclusion that pgld is not only capable of properly loading UTF-8, it is in fact converting to UTF-8 by default (although incorrectly if you dont explicitly tell it to load a particular file as UTF-8 using the -c flag to pgld).

So what happens when you run pgld:
1. pgld.c processes the command line options looking for -c for the default charset of the file (if provided).
2. pgld.c builds a list of all blocklists passed into it and assigns them all the same charset (this is the first flaw in this method because its a huge assumption. If one of them ends up being a different charset other than ISO8859-1 or ASCII basically and you dont explicitly provide the correct charset, then the conversion assumes ISO8859-1 and strips any non ASCII characters from the loaded strings and then converts it to UTF-8).
3. pgld.c loads all lists provided by calling load_list in parser.c .
4. In parser.c load_list takes the file name and the charset provided to pgld at the command line using the -c option (or the default of ISO8859-1). Now, here is where if you dont provide a charset to pgld, it uses ISO8859-1 and strips any characters away that are basically non ASCII but still loads the file (I dont know if it will behave against a more exotic charset). So it uses iconv_open to cast the file to UTF-8 once loaded regardless of its input charset.
5. pgld either dumps the contents of the file back you and exits or merges all the lists and starts the blocking queue if I understand what that is correctly.

With the above its easy to test that this is whats happening. We know that the iblocklist edu block list is UTF-8:

$ wget 'http://list.iblocklist.com/?list=imlmncgrkbnacgcwfjvh&fileformat=p2p&archiveformat=gz' -O edu.gz
$ gunzip edu.gz
$ file edu
edu: UTF-8 Unicode text

Now test what pgld is actually doing:

$ pgld -c UTF-8 -m edu > test_utf8
WARN: No valid ASCII blocklist format line: # List distributed by iblocklist.com
WARN: No valid ASCII blocklist format line: 
INFO: ASCII: 48227 entries loaded from "edu"
INFO: Merged 11443 of 48227 entries.
INFO: Blocking 36784 IP ranges (227996116 IPs).
$ pgld -m edu > test_ascii
WARN: No valid ASCII blocklist format line: # List distributed by iblocklist.com
WARN: No valid ASCII blocklist format line: 
INFO: ASCII: 48227 entries loaded from "edu"
INFO: Merged 11443 of 48227 entries.
INFO: Blocking 36784 IP ranges (227996116 IPs).
$ diff test_utf8 test_ascii | head -n 15
1238c1238
< Collège Boréal:38.112.18.84-38.112.18.87
---
> CollÃ¨ge BorÃ©al:38.112.18.84-38.112.18.87
1370c1370
< reassign to \\\"Rajabhat Rajanagarindra University ÁËÒÇÔ·:58.137.146.0-58.137.146.255
---
> reassign to \\\"Rajabhat Rajanagarindra University ÃÃÃ:58.137.146.0-58.137.146.255
1643c1643
< QINGDAO UNIVERSITY OF SCIENCE AND TECHNOLOGY¡ê? EASTERN AREA-:60.209.128.160-60.209.128.191
---
> QINGDAO UNIVERSITY OF SCIENCE AND TECHNOLOGYÂ¡Ãª? EASTERN A:60.209.128.160-60.209.128.191
3325c3325
< XiaoGan vocational technical education institute£:61.183.22.128-61.183.22.255

So based on the above, the worst case is either failure to load because it cant cast the original file to ISO8859-1 (which I dont even know if its possible) or it casts to best equivalent in ISO8859-1 or outright truncates it away in the case of QINGDAO above.

Did we lose anything in the conversion to UTF-8?:

$ wc -l edu 
48229 edu
$ wc -l test_utf8 
36784 test_utf8
$ wc -l test_ascii
36784 test_ascii

So our original edu file with 48229 lines minus a blank line and a comment line is 48227 minus the 11443 merged (which im assuming means the IP ranges got merged hopefully) gives us the 36784 in both output files regardless of the assumed starting character set.

Now the only way i can see this going south is with other more exotic charsets if you dont specify the initial charset correctly, but I havent yet figured out what options are being passed to pgld and if its properly changing it based on each file's charset or not. Regardless, its most likely not having UTF-8 loading issue for most of these most likely.

Last edit: Julio Lajara 2016-08-20

Default list encoding

PeerGuardian - a privacy oriented firewall application

Group

Searches

Help

#335 Default list encoding

Discussion