I dont agree that its not able to load the UTF-8 files. Ive started looking into this because of other oddities I was seeing with the default block lists but have come to the conclusion that pgld is not only capable of properly loading UTF-8, it is in fact converting to UTF-8 by default (although incorrectly if you dont explicitly tell it to load a particular file as UTF-8 using the -c flag to pgld).
So what happens when you run pgld:
1. pgld.c processes the command line options looking for -c for the default charset of the file (if provided).
2. pgld.c builds a list of all blocklists passed into it and assigns them all the same charset (this is the first flaw in this method because its a huge assumption. If one of them ends up being a different charset other than ISO8859-1 or ASCII basically and you dont explicitly provide the correct charset, then the conversion assumes ISO8859-1 and strips any non ASCII characters from the loaded strings and then converts it to UTF-8).
3. pgld.c loads all lists provided by calling load_list in parser.c .
4. In parser.c load_list takes the file name and the charset provided to pgld at the command line using the -c option (or the default of ISO8859-1). Now, here is where if you dont provide a charset to pgld, it uses ISO8859-1 and strips any characters away that are basically non ASCII but still loads the file (I dont know if it will behave against a more exotic charset). So it uses iconv_open to cast the file to UTF-8 once loaded regardless of its input charset.
5. pgld either dumps the contents of the file back you and exits or merges all the lists and starts the blocking queue if I understand what that is correctly.
With the above its easy to test that this is whats happening. We know that the iblocklist edu block list is UTF-8:
So based on the above, the worst case is either failure to load because it cant cast the original file to ISO8859-1 (which I dont even know if its possible) or it casts to best equivalent in ISO8859-1 or outright truncates it away in the case of QINGDAO above.
So our original edu file with 48229 lines minus a blank line and a comment line is 48227 minus the 11443 merged (which im assuming means the IP ranges got merged hopefully) gives us the 36784 in both output files regardless of the assumed starting character set.
Now the only way i can see this going south is with other more exotic charsets if you dont specify the initial charset correctly, but I havent yet figured out what options are being passed to pgld and if its properly changing it based on each file's charset or not. Regardless, its most likely not having UTF-8 loading issue for most of these most likely.
Last edit: Julio Lajara 2016-08-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks a lot! I hope to look into this and make a bugfix release soon. Unfortunately (for pgl) I'm working on other projects currently.
I assume you found this because you got a master blocklist that was way to small?
Have you fixed this on your side and tested it successfully?
I dont agree that its not able to load the UTF-8 files. Ive started looking into this because of other oddities I was seeing with the default block lists but have come to the conclusion that pgld is not only capable of properly loading UTF-8, it is in fact converting to UTF-8 by default (although incorrectly if you dont explicitly tell it to load a particular file as UTF-8 using the -c flag to pgld).
So what happens when you run pgld:
1. pgld.c processes the command line options looking for -c for the default charset of the file (if provided).
2. pgld.c builds a list of all blocklists passed into it and assigns them all the same charset (this is the first flaw in this method because its a huge assumption. If one of them ends up being a different charset other than ISO8859-1 or ASCII basically and you dont explicitly provide the correct charset, then the conversion assumes ISO8859-1 and strips any non ASCII characters from the loaded strings and then converts it to UTF-8).
3. pgld.c loads all lists provided by calling load_list in parser.c .
4. In parser.c load_list takes the file name and the charset provided to pgld at the command line using the -c option (or the default of ISO8859-1). Now, here is where if you dont provide a charset to pgld, it uses ISO8859-1 and strips any characters away that are basically non ASCII but still loads the file (I dont know if it will behave against a more exotic charset). So it uses iconv_open to cast the file to UTF-8 once loaded regardless of its input charset.
5. pgld either dumps the contents of the file back you and exits or merges all the lists and starts the blocking queue if I understand what that is correctly.
With the above its easy to test that this is whats happening. We know that the iblocklist edu block list is UTF-8:
Now test what pgld is actually doing:
So based on the above, the worst case is either failure to load because it cant cast the original file to ISO8859-1 (which I dont even know if its possible) or it casts to best equivalent in ISO8859-1 or outright truncates it away in the case of QINGDAO above.
Did we lose anything in the conversion to UTF-8?:
So our original edu file with 48229 lines minus a blank line and a comment line is 48227 minus the 11443 merged (which im assuming means the IP ranges got merged hopefully) gives us the 36784 in both output files regardless of the assumed starting character set.
Now the only way i can see this going south is with other more exotic charsets if you dont specify the initial charset correctly, but I havent yet figured out what options are being passed to pgld and if its properly changing it based on each file's charset or not. Regardless, its most likely not having UTF-8 loading issue for most of these most likely.
Last edit: Julio Lajara 2016-08-20