jEdit often opens files in the default encoding instead
of opening them in the encoding saved in recent.xml. It
also doesn't detect that the file is UTF-8 based on the
UTF-8 magic numbers like it used to.
You can reproduce it as follows.
1.) Create a new file.
2.) Open the buffer options and set the encoding to UTF-8
3.) Copy and paste some non-ascii characters in the
buffer, such as German umlauts.
4.) Save the buffer.
5.) Close jEdit.
6.) Verify that the recent.xml file has the correct
encoding in it.
7.) Start up jEdit again.
8.) Right click on the file in the browser and look at
the encoding. It is now the DEFAULT encoding rather
than UTF-8
9.) Open the file and see the characters as garbage. It
fails to see that it is UTF-8 based on the magic
characters as well.
This has caused a number of files to inadvertently get
saved in the wrong encoding because the files are read
using the wrong encoding and then saved in the wrong
encoding. This kills many characters that are non-ASCII.
The activity.log after starting jEdit at step #7
Try the scenario with this file.
Logged In: YES
user_id=478898
It seems to depend on the file but happens every time with a
particular file I've created using jEdit. I attached it below.
Logged In: YES
user_id=75113
Hi Ian,
I can't reproduce the problem following your tests. By any
chance, does the file *name* you're saving have any extra
characters? I think the current code might mess things up in
some cases if that happens...
Logged In: YES
user_id=864970
That file does not contain any Byte Order Mark (
http://en.wikipedia.org/wiki/Byte_Order_Mark ) so jEdit
can't see that the file is supposed to be UTF-8
It should start with EF BB BF but starts with 23 20 4a
Logged In: YES
user_id=75113
UTF-8 doesn't have any BOM; UTF-8Y does. As for the problem,
I'd need a copy of the $HOME/.jedit/recent.xml file that
causes the problem, otherwise, I can't reproduce it...
Logged In: YES
user_id=864970
So the Unicode organization is wrong when they show the BOM
for UTF-8?
http://www.unicode.org/unicode/faq/utf_bom.html#BOM :-)
To be precise: UTF-8 doesn't *need* to have a BOM, but jEdit
will know from it that the file is UTF-8. How else should it
know?
Logged In: YES
user_id=75113
That's beyond the point of the bug; jEdit will recognize the
BOM if it's there, and should restore it with whatever
enconding the history file says it was last edited with. So
until we see the recent.xml that causes the problem no
discussion here is gonna do any good.
recent.xml file.
Logged In: YES
user_id=478898
I can reproduce this bug with these steps (I also added the
recent.xml after running this scenario).
1.) Delete the recent.xml file.
2.) Open jEdit.
3.) Right click on the file in the file browser. Select
encoding and see that it is set to the default encoding
rather than UTF-8 (that by itself is a bug). Select UTF-8 as
the encoding from the File Browser for this file.
4.) Open the file.
5.) Close the file.
6.) Close jEdit.
7.) Open jEdit. Right click on the file in the file browser
and note the encoding (it's the default encoding). Don't
select a new encoding.
8.) Open the file. Notice it's opened in the default
encoding rather than UTF-8.
Logged In: YES
user_id=864970
I downloaded the file and it didn't open with UTF-8 when I switched my default
encoding to MacRoman.
When I copied the first 2 japanese characters (from .ok=...) to the comment in
the first line, it opened as UTF-8.
Then I concatenated all lines, up to this first characters to one line and found
that the japanese characters appear at position 188 (or a bit later). Maybe jEdit
doesn't check more than 128 characters to find the proper encoding?
just wild guessing...
Logged In: YES
user_id=1483238
Originator: NO
Hi Ian. I can't see what is a bug in the steps you
sent.
In the step 3, it is normal that the encoding is set to
the default encoding because jEdit has no information
about encoding of the file.
In the step 8, it is normal that the file is opened in
the default encoding because it is selected in step 7.
I can see an issue at the step 7. It might be correct
if the File System Browser look for encoding in
recent.xml. If this is your issue, it seems have to be
a new feature request. This tracker item is full of
non essential information.
Logged In: YES
user_id=478898
Originator: YES
Hey k_satoda,
jEdit automatically detects if a file is UTF-8 or not. It's done this for a while. Even if no information is contained in the recent.xml file (It's in BufferIORequest.java if you care to take a look). Perhaps my mentioning that, and how jEdit is failing to do so is beside the point and perhaps a separate bug. But any perceived "non-essential" information is simply me trying to explain in the best way I can the issue that I am having. As well as attempting to provide the most information I can about reproducing the issue.
You are right and step 3 is not a bug. You are right that the recent.xml file has no information about the encoding of the file at that point. So it may not see that the file is not encoded in UTF-8 at that point.
However, in step 4 I open the file in UTF-8. In step 5 and 6 I close the file and close jEdit which should write that the file is encoded in UTF-8 to the recent.xml file. It is after all the encoding the file was set at when I closed the file. (I believe it does if you check the recent.xml after step 6).
Again, however, When I reopen jEdit and reopen the file it doesn't then know the file is in UTF-8. Even though it should have read that info from the recent.xml file (step 8). That is definitely a bug. I did not select the default encoding in step 7. It was already selected. I just looked at it. In fact I say in step 7 specifically NOT to select an encoding.
Logged In: YES
user_id=478898
Originator: YES
Ok, attempting to simplify things.
1.) Apparently you're right. jEdit, since at least 4.2, has not used the encoding from the recent.xml file when opening the file using the File Browser. Though it does use the caret info. I think not using the encoding is a bug since it uses the other info from the recent.xml. Though, I'm willing to move it to a feature request.
2.) I noticed that BufferIORequest doesn't exist anymore so you can't look at it in SVN. But it used to detect whether a file was UTF-8, UTF-8Y, or UTF-16 when opening the file. If it could detect that it was one of these encodings it would open the file in that encoding rather than the default encoding. jEdit is no longer doing that. Whether that is a bug, or a feature that was removed I'm not sure.
Logged In: YES
user_id=1483238
Originator: NO
> 1.) ...
I looked into the code.
The problem is the browser has "current encoding".
Though "Encoding" menu is shown in pop-up menu for a
file, it's not for the file. It shows the current
encoding of the browser. The current encoding is
initialized by the global default encoding at start-up
of the browser. The current encoding is used for all
files opened from the browser. Then, the encoding in
recent.xml is not used because a encoding for opening
file is specified by the browser.
A new feature request is fine. This problem seems to
need some discussion of the way to be fixed.
> 2.) ...
jEdit can detect UTF-8Y, but can't detect UTF-8 which
don't have BOM. UTF-8 can be detected only if the file
is a XML file and has right encoding declaration. jEdit
can also detect UTF-16 LE and BE with BOM. I think
this is not changed between 4.2final and 4.3pre9 (and
current svn trunk).
Logged In: YES
user_id=1483238
Originator: NO
> 1.) ...
I looked into the code.
The problem is the browser has "current encoding".
Though "Encoding" menu is shown in pop-up menu for a
file, it's not for the file. It shows the current
encoding of the browser. The current encoding is
initialized by the global default encoding at start-up
of the browser. The current encoding is used for all
files opened from the browser. Then, the encoding in
recent.xml is not used because a encoding for opening
file is specified by the browser.
A new feature request is fine. This problem seems to
need some discussion of the way to be fixed.
> 2.) ...
jEdit can detect UTF-8Y, but can't detect UTF-8 which
don't have BOM. UTF-8 can be detected only if the file
is a XML file and has right encoding declaration. jEdit
can also detect UTF-16 LE and BE with BOM. I think
this is not changed between 4.2final and 4.3pre9 (and
current svn trunk).
Logged In: YES
user_id=478898
Originator: YES
Created new feature request 1721796.
Logged In: YES
user_id=1483238
Originator: NO
Closing this one because the remaining issue was moved
into the feature request.