By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,798 Members | 1,342 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,798 IT Pros & Developers. It's quick & easy.

UNICODE in Java Help

P: n/a
Hello all.

I am trying to write a Java3D loader for a geometry file from a
game, which has Unicode characters (Korean) in it. I wrote the loader
and it works in Windows, but I recently brushed off Windows completely
and am now under Linux. When I try to load the filenames now, I get ??????.
This is the block of code in my loader which reads the strings
from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and does not
properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how long the
string is from there. There is no string length indication in the file,
so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read the
Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!
Jul 17 '05 #1
Share this Question
Share on Google+
13 Replies


P: n/a
On Thu, 27 May 2004 03:54:01 GMT, Nicholas Pappas
<no*****@rightstep.org> wrote or quoted :
numTextures = in.readInt();


The key is the declaration of in. What format are these data? What
encoding?

See http://mindprod.com/fileio.html
to select the correct method.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Jul 17 '05 #2

P: n/a
Roedy Green wrote:
On Thu, 27 May 2004 03:54:01 GMT, Nicholas Pappas
numTextures = in.readInt();


The key is the declaration of in. What format are these data? What
encoding?


'in' is a LittleEndianInputStream, which extends FilterInputStream and
implements DataInput. In the case of reading bytes (as done to
construct the Strings in question), the read() function is simple a
pass-through -- no changes to the default behavior.
I received another suggestion about the encoding and will be trying
that this evening when I get home. However, I'm concerned that I am
going to get this working for Linux (perhaps by select UTF-8) and then
it will stop working in Windows.
Will forcing the input stream to a certain encoding under Linux break
Windows?
Jul 17 '05 #3

P: n/a
under Java you are suppose to be shielded from machine level details and
that includes unicode issues.... is the jre the same or later than the
one used on your windows platform...?

- perry

Nicholas Pappas wrote:
Hello all.

I am trying to write a Java3D loader for a geometry file from a
game, which has Unicode characters (Korean) in it. I wrote the loader
and it works in Windows, but I recently brushed off Windows completely
and am now under Linux. When I try to load the filenames now, I get
??????.
This is the block of code in my loader which reads the strings
from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and does
not properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how long
the string is from there. There is no string length indication in the
file, so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read the
Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!


Jul 17 '05 #4

P: n/a
Nicholas Pappas wrote:
This is the block of code in my loader which reads the strings
from the file:
[...]
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);
[...]
Does anyone have any suggestions on how I fix this so I can read the
Korean text in both Windows and Linux (and other OSs)?


You need a basic understanding of the relationship between bytes and
characters, and of the concept of character encodings. And you need
more information about your input file; specifically, what character
encoding it uses. There are a number of potential problems here:

1. (Actually not related to character encodings) Your call to in.read is
flawed. Take a look at the API documentation for that method.
Specifically, the method is not guaranteed to read the entire array. It
is only specified to read at least one byte but not more than the length
of the array, and to return to number of bytes that it has read. If you
want to read the entire byte array, you'll need to write a loop; sorta
like this:

int pos = 0;
while (pos < bmpPath.length)
{
int len = in.read(bmpPath, pos, bmpPath.length - pos);

if (len == -1) handlePrematureEOF();
else pos += len;
}

Of course, handlePrematureEOF() should be replaced with appropriate
error-handling code, such as throwing an exception indicating the bad
file format.

2. You don't specify an encoding when you convert the data in the byte
array to text. That data was encoding in some specific encoding when
the file was written. The code you've written will work only if you get
lucky and the platform-default character encoding happens to match the
encoding in the file. To make this work reliably in a cross-platform
way, you need to discover what encoding was used in the file, and
specify that in a separate parameter, for example:

textures[i] = new String(bmpPath, 0, len, "UTF-8");

(That gets you UTF-8 encoding, which is probably a decent guess; but you
need to find out the real encoding to be sure this will work. It should
be documented with the file format spec.)

3. This is a bit of a subtle one, actually. The test for bytes to equal
zero, which you use to determine the end of the String, will not work
reliably across character encodings. In any multi-byte character
encoding, there's a chance that there will be an embedded zero byte
inside of a character, but the character code itself will be non-zero.

To work around this, you need to swap the order. If your strings are
null-terminated, then convert your byte array to characters first, then
look for a null character (i.e., Unicode value zero), rather than a zero
byte. That looks like this:

InputStreamReader in = new InputStreamReader(
new ByteArrayInputStream(bmpPath), "UTF-8");
StringWriter sw = new StringWriter();

int c;
while (c > 0) sw.write((char) c);

textures[i] = sw.toString();

This is an alternative to the String constructor you used to convert to
characters, and notice that you still need to know the proper character
encoding.

Hope that gets you started,

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
Jul 17 '05 #5

P: n/a
That is what I thought too (the shielding part). :)

I last used 1.4.2 in Windows, but am using 1.4.1 right now under Linux.
I've been trying to upgrade to 1.5, but the self-installer bin doesn't
seem to want to install correctly on Gentoo Linux. :(

perry anderson wrote:
under Java you are suppose to be shielded from machine level details and
that includes unicode issues.... is the jre the same or later than the
one used on your windows platform...?

- perry

Nicholas Pappas wrote:
Hello all.

I am trying to write a Java3D loader for a geometry file from a
game, which has Unicode characters (Korean) in it. I wrote the loader
and it works in Windows, but I recently brushed off Windows completely
and am now under Linux. When I try to load the filenames now, I get
??????.
This is the block of code in my loader which reads the strings
from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and
does not properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how long
the string is from there. There is no string length indication in the
file, so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read
the Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!


Jul 17 '05 #6

P: n/a
On Thu, 27 May 2004 09:43:04 -0400, Nicholas Pappas
<no*****@rightstep.org> wrote or quoted :
Will forcing the input stream to a certain encoding under Linux break
Windows?


If the files have different encodings in different platforms you are
going to be in trouble. If you both write and read the file you can
force the encoding and thereby be consistent on all platforms. If you
don't specify the encoding you get a pig an a poke, whatever the
locale things is reasonable, highly unlikely to be something exotic
like your file.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Jul 17 '05 #7

P: n/a
On Thu, 27 May 2004 13:35:29 -0600, Chris Smith <cd*****@twu.net>
wrote or quoted :
You need a basic understanding of the relationship between bytes and
characters, and of the concept of character encodings. And you need
more information about your input file; specifically, what character
encoding it uses. There are a number of potential problems here:


Read up on encodings, http://mindprod.com/jgloss/encoding.html. OP
has not told us enough about what he is doing. Where did this file
come from? Is it encoded the same way on all platforms or is it being
provided in a variety of encodings by some third party software?

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Jul 17 '05 #8

P: n/a
Chris Smith wrote:
2. You don't specify an encoding when you convert the data in the byte
array to text. That data was encoding in some specific encoding when
the file was written. The code you've written will work only if you get
lucky and the platform-default character encoding happens to match the
encoding in the file. To make this work reliably in a cross-platform
way, you need to discover what encoding was used in the file, and
specify that in a separate parameter, for example:

textures[i] = new String(bmpPath, 0, len, "UTF-8");


Well, I gave UTF-8 and all the encoding listed on this page:
http://java.sun.com/j2se/1.4.2/docs/...t/Charset.html
No luck. :(

All the directories show up correctly in Konqueror (KDE browser). Is
there some way I can detect the encoding being used there? What is
Windows default encoding, anyone know? :)

Thanks again for all the help!
Jul 17 '05 #9

P: n/a
Roedy Green wrote:
On Thu, 27 May 2004 09:43:04 -0400, Nicholas Pappas
Will forcing the input stream to a certain encoding under Linux break
Windows?


If the files have different encodings in different platforms you are
going to be in trouble. If you both write and read the file you can
force the encoding and thereby be consistent on all platforms.


Thankfully I do not need to write to the files, so I only need to
figure out how to read them.
Is there a Linux/UNIX command (or even a Windows command) that will
display the character set? Seems unlikely, but I'll cross my fingers
anyway. The files show up correctly in the KDE browser -- might I be
able to figure something out from there?

Thanks again!
Jul 17 '05 #10

P: n/a
Nicholas Pappas wrote:
Chris Smith wrote:
2. You don't specify an encoding when you convert the data in the byte
array to text. That data was encoding in some specific encoding when
the file was written. The code you've written will work only if you
get lucky and the platform-default character encoding happens to match
the encoding in the file. To make this work reliably in a
cross-platform way, you need to discover what encoding was used in the
file, and specify that in a separate parameter, for example:

textures[i] = new String(bmpPath, 0, len, "UTF-8");

Well, I gave UTF-8 and all the encoding listed on this page:
http://java.sun.com/j2se/1.4.2/docs/...t/Charset.html
No luck. :(

All the directories show up correctly in Konqueror (KDE browser).
Is there some way I can detect the encoding being used there? What is
Windows default encoding, anyone know? :)


The Windows default encoding is CP1252 (or CP1251?). Actually, it
varies depending on the locale. In any case you are unlikely to find
this to be supported in Linux.

The simplest thing to do is to create the file using an editor that
allows you to specify the encoding. Windows Notepad will let you save
files an Unicode, but I do not know if it is UTF-8, UTF-16, UTF-16LE,
UTF-16BE, etc. You could use jEdit (a free Java-based editor).

Also, it strikes me as odd that you are trying to read a file using
ObjectInputStream that you did not create using ObjectOutputStream.

Ray
Jul 17 '05 #11

P: n/a
On Fri, 28 May 2004 00:57:13 GMT, Nicholas Pappas
<no*****@rightstep.org> wrote or quoted :
All the directories show up correctly in Konqueror (KDE browser). Is
there some way I can detect the encoding being used there? What is
Windows default encoding, anyone know? :)


It depends to the locale. you set the locale in the control panel.

You can find out what it is using Wassup.

See http://mindprod.com/wassup.html
Maybe it is encoded in some proprietary way.

Do a hex dump of it and display that here. Maybe some of us will
recognise it.

Unfortunately, the powers that be decided not to make encodings
self-identifying.

Encodings are something that tries to impose order on chaos, where
everyone used a different national local 8-bit encoding, and never
exchanged files with others so the problem never came up.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Jul 17 '05 #12

P: n/a
also, as part of the new fast i/o api under java there is another way to
specify and handle unicode issues...

let me go check on it and get back to you...

- perry
Nicholas Pappas wrote:
That is what I thought too (the shielding part). :)

I last used 1.4.2 in Windows, but am using 1.4.1 right now under
Linux. I've been trying to upgrade to 1.5, but the self-installer bin
doesn't seem to want to install correctly on Gentoo Linux. :(

perry anderson wrote:
under Java you are suppose to be shielded from machine level details
and that includes unicode issues.... is the jre the same or later than
the one used on your windows platform...?

- perry

Nicholas Pappas wrote:
Hello all.

I am trying to write a Java3D loader for a geometry file from
a game, which has Unicode characters (Korean) in it. I wrote the
loader and it works in Windows, but I recently brushed off Windows
completely and am now under Linux. When I try to load the filenames
now, I get ??????.
This is the block of code in my loader which reads the strings
from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and
does not properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how
long the string is from there. There is no string length indication
in the file, so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read
the Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!



Jul 17 '05 #13

P: n/a
this is what im talking about

http://java.sun.com/j2se/1.4.2/docs/...nio/index.html
http://java.sun.com/j2se/1.4.2/docs/...ple/index.html

- perry
perry anderson wrote:
also, as part of the new fast i/o api under java there is another way to
specify and handle unicode issues...

let me go check on it and get back to you...

- perry
Nicholas Pappas wrote:
That is what I thought too (the shielding part). :)

I last used 1.4.2 in Windows, but am using 1.4.1 right now under
Linux. I've been trying to upgrade to 1.5, but the self-installer bin
doesn't seem to want to install correctly on Gentoo Linux. :(

perry anderson wrote:
under Java you are suppose to be shielded from machine level details
and that includes unicode issues.... is the jre the same or later
than the one used on your windows platform...?

- perry

Nicholas Pappas wrote:

Hello all.

I am trying to write a Java3D loader for a geometry file from
a game, which has Unicode characters (Korean) in it. I wrote the
loader and it works in Windows, but I recently brushed off Windows
completely and am now under Linux. When I try to load the filenames
now, I get ??????.
This is the block of code in my loader which reads the
strings from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and
does not properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how
long the string is from there. There is no string length indication
in the file, so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read
the Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!


Jul 17 '05 #14

This discussion thread is closed

Replies have been disabled for this discussion.