[go: up one dir, main page]

Menu

#335 Default list encoding

PeerGuardian_Linux
accepted
nobody
None
9
2016-08-20
2016-06-11
Ganply
No

The default charset is set as "ISO8859-1" in parser.c
IBlockList files are encoded in "UTF-8", so PeerGuardian is not able to read them anymore.

Discussion

  • jre-phoenix

    jre-phoenix - 2016-06-12
    • status: open --> accepted
    • Priority: 5 --> 9
     
  • jre-phoenix

    jre-phoenix - 2016-06-12

    Thanks a lot! I hope to look into this and make a bugfix release soon. Unfortunately (for pgl) I'm working on other projects currently.

    I assume you found this because you got a master blocklist that was way to small?
    Have you fixed this on your side and tested it successfully?

     
  • Julio Lajara

    Julio Lajara - 2016-08-20

    I dont agree that its not able to load the UTF-8 files. Ive started looking into this because of other oddities I was seeing with the default block lists but have come to the conclusion that pgld is not only capable of properly loading UTF-8, it is in fact converting to UTF-8 by default (although incorrectly if you dont explicitly tell it to load a particular file as UTF-8 using the -c flag to pgld).

    So what happens when you run pgld:
    1. pgld.c processes the command line options looking for -c for the default charset of the file (if provided).
    2. pgld.c builds a list of all blocklists passed into it and assigns them all the same charset (this is the first flaw in this method because its a huge assumption. If one of them ends up being a different charset other than ISO8859-1 or ASCII basically and you dont explicitly provide the correct charset, then the conversion assumes ISO8859-1 and strips any non ASCII characters from the loaded strings and then converts it to UTF-8).
    3. pgld.c loads all lists provided by calling load_list in parser.c .
    4. In parser.c load_list takes the file name and the charset provided to pgld at the command line using the -c option (or the default of ISO8859-1). Now, here is where if you dont provide a charset to pgld, it uses ISO8859-1 and strips any characters away that are basically non ASCII but still loads the file (I dont know if it will behave against a more exotic charset). So it uses iconv_open to cast the file to UTF-8 once loaded regardless of its input charset.
    5. pgld either dumps the contents of the file back you and exits or merges all the lists and starts the blocking queue if I understand what that is correctly.

    With the above its easy to test that this is whats happening. We know that the iblocklist edu block list is UTF-8:

    $ wget 'http://list.iblocklist.com/?list=imlmncgrkbnacgcwfjvh&fileformat=p2p&archiveformat=gz' -O edu.gz
    $ gunzip edu.gz
    $ file edu
    edu: UTF-8 Unicode text
    

    Now test what pgld is actually doing:

    $ pgld -c UTF-8 -m edu > test_utf8
    WARN: No valid ASCII blocklist format line: # List distributed by iblocklist.com
    WARN: No valid ASCII blocklist format line: 
    INFO: ASCII: 48227 entries loaded from "edu"
    INFO: Merged 11443 of 48227 entries.
    INFO: Blocking 36784 IP ranges (227996116 IPs).
    $ pgld -m edu > test_ascii
    WARN: No valid ASCII blocklist format line: # List distributed by iblocklist.com
    WARN: No valid ASCII blocklist format line: 
    INFO: ASCII: 48227 entries loaded from "edu"
    INFO: Merged 11443 of 48227 entries.
    INFO: Blocking 36784 IP ranges (227996116 IPs).
    $ diff test_utf8 test_ascii | head -n 15
    1238c1238
    < Collège Boréal:38.112.18.84-38.112.18.87
    ---
    > Collège Boréal:38.112.18.84-38.112.18.87
    1370c1370
    < reassign to \\\"Rajabhat Rajanagarindra University ÁËÒÇÔ·:58.137.146.0-58.137.146.255
    ---
    > reassign to \\\"Rajabhat Rajanagarindra University ÁËÒ:58.137.146.0-58.137.146.255
    1643c1643
    < QINGDAO UNIVERSITY OF SCIENCE AND TECHNOLOGY¡ê? EASTERN AREA-:60.209.128.160-60.209.128.191
    ---
    > QINGDAO UNIVERSITY OF SCIENCE AND TECHNOLOGY¡ê? EASTERN A:60.209.128.160-60.209.128.191
    3325c3325
    < XiaoGan vocational technical education institute£:61.183.22.128-61.183.22.255
    

    So based on the above, the worst case is either failure to load because it cant cast the original file to ISO8859-1 (which I dont even know if its possible) or it casts to best equivalent in ISO8859-1 or outright truncates it away in the case of QINGDAO above.

    Did we lose anything in the conversion to UTF-8?:

    $ wc -l edu 
    48229 edu
    $ wc -l test_utf8 
    36784 test_utf8
    $ wc -l test_ascii
    36784 test_ascii
    

    So our original edu file with 48229 lines minus a blank line and a comment line is 48227 minus the 11443 merged (which im assuming means the IP ranges got merged hopefully) gives us the 36784 in both output files regardless of the assumed starting character set.

    Now the only way i can see this going south is with other more exotic charsets if you dont specify the initial charset correctly, but I havent yet figured out what options are being passed to pgld and if its properly changing it based on each file's charset or not. Regardless, its most likely not having UTF-8 loading issue for most of these most likely.

     

    Last edit: Julio Lajara 2016-08-20

Log in to post a comment.