473,320 Members | 1,982 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

UNICODE in Java Help

Hello all.

I am trying to write a Java3D loader for a geometry file from a
game, which has Unicode characters (Korean) in it. I wrote the loader
and it works in Windows, but I recently brushed off Windows completely
and am now under Linux. When I try to load the filenames now, I get ??????.
This is the block of code in my loader which reads the strings
from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and does not
properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how long the
string is from there. There is no string length indication in the file,
so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read the
Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!
Jul 17 '05 #1
13 6295
On Thu, 27 May 2004 03:54:01 GMT, Nicholas Pappas
<no*****@rightstep.org> wrote or quoted :
numTextures = in.readInt();


The key is the declaration of in. What format are these data? What
encoding?

See http://mindprod.com/fileio.html
to select the correct method.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Jul 17 '05 #2
Roedy Green wrote:
On Thu, 27 May 2004 03:54:01 GMT, Nicholas Pappas
numTextures = in.readInt();


The key is the declaration of in. What format are these data? What
encoding?


'in' is a LittleEndianInputStream, which extends FilterInputStream and
implements DataInput. In the case of reading bytes (as done to
construct the Strings in question), the read() function is simple a
pass-through -- no changes to the default behavior.
I received another suggestion about the encoding and will be trying
that this evening when I get home. However, I'm concerned that I am
going to get this working for Linux (perhaps by select UTF-8) and then
it will stop working in Windows.
Will forcing the input stream to a certain encoding under Linux break
Windows?
Jul 17 '05 #3
under Java you are suppose to be shielded from machine level details and
that includes unicode issues.... is the jre the same or later than the
one used on your windows platform...?

- perry

Nicholas Pappas wrote:
Hello all.

I am trying to write a Java3D loader for a geometry file from a
game, which has Unicode characters (Korean) in it. I wrote the loader
and it works in Windows, but I recently brushed off Windows completely
and am now under Linux. When I try to load the filenames now, I get
??????.
This is the block of code in my loader which reads the strings
from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and does
not properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how long
the string is from there. There is no string length indication in the
file, so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read the
Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!


Jul 17 '05 #4
Nicholas Pappas wrote:
This is the block of code in my loader which reads the strings
from the file:
[...]
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);
[...]
Does anyone have any suggestions on how I fix this so I can read the
Korean text in both Windows and Linux (and other OSs)?


You need a basic understanding of the relationship between bytes and
characters, and of the concept of character encodings. And you need
more information about your input file; specifically, what character
encoding it uses. There are a number of potential problems here:

1. (Actually not related to character encodings) Your call to in.read is
flawed. Take a look at the API documentation for that method.
Specifically, the method is not guaranteed to read the entire array. It
is only specified to read at least one byte but not more than the length
of the array, and to return to number of bytes that it has read. If you
want to read the entire byte array, you'll need to write a loop; sorta
like this:

int pos = 0;
while (pos < bmpPath.length)
{
int len = in.read(bmpPath, pos, bmpPath.length - pos);

if (len == -1) handlePrematureEOF();
else pos += len;
}

Of course, handlePrematureEOF() should be replaced with appropriate
error-handling code, such as throwing an exception indicating the bad
file format.

2. You don't specify an encoding when you convert the data in the byte
array to text. That data was encoding in some specific encoding when
the file was written. The code you've written will work only if you get
lucky and the platform-default character encoding happens to match the
encoding in the file. To make this work reliably in a cross-platform
way, you need to discover what encoding was used in the file, and
specify that in a separate parameter, for example:

textures[i] = new String(bmpPath, 0, len, "UTF-8");

(That gets you UTF-8 encoding, which is probably a decent guess; but you
need to find out the real encoding to be sure this will work. It should
be documented with the file format spec.)

3. This is a bit of a subtle one, actually. The test for bytes to equal
zero, which you use to determine the end of the String, will not work
reliably across character encodings. In any multi-byte character
encoding, there's a chance that there will be an embedded zero byte
inside of a character, but the character code itself will be non-zero.

To work around this, you need to swap the order. If your strings are
null-terminated, then convert your byte array to characters first, then
look for a null character (i.e., Unicode value zero), rather than a zero
byte. That looks like this:

InputStreamReader in = new InputStreamReader(
new ByteArrayInputStream(bmpPath), "UTF-8");
StringWriter sw = new StringWriter();

int c;
while (c > 0) sw.write((char) c);

textures[i] = sw.toString();

This is an alternative to the String constructor you used to convert to
characters, and notice that you still need to know the proper character
encoding.

Hope that gets you started,

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
Jul 17 '05 #5
That is what I thought too (the shielding part). :)

I last used 1.4.2 in Windows, but am using 1.4.1 right now under Linux.
I've been trying to upgrade to 1.5, but the self-installer bin doesn't
seem to want to install correctly on Gentoo Linux. :(

perry anderson wrote:
under Java you are suppose to be shielded from machine level details and
that includes unicode issues.... is the jre the same or later than the
one used on your windows platform...?

- perry

Nicholas Pappas wrote:
Hello all.

I am trying to write a Java3D loader for a geometry file from a
game, which has Unicode characters (Korean) in it. I wrote the loader
and it works in Windows, but I recently brushed off Windows completely
and am now under Linux. When I try to load the filenames now, I get
??????.
This is the block of code in my loader which reads the strings
from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and
does not properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how long
the string is from there. There is no string length indication in the
file, so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read
the Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!


Jul 17 '05 #6
On Thu, 27 May 2004 09:43:04 -0400, Nicholas Pappas
<no*****@rightstep.org> wrote or quoted :
Will forcing the input stream to a certain encoding under Linux break
Windows?


If the files have different encodings in different platforms you are
going to be in trouble. If you both write and read the file you can
force the encoding and thereby be consistent on all platforms. If you
don't specify the encoding you get a pig an a poke, whatever the
locale things is reasonable, highly unlikely to be something exotic
like your file.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Jul 17 '05 #7
On Thu, 27 May 2004 13:35:29 -0600, Chris Smith <cd*****@twu.net>
wrote or quoted :
You need a basic understanding of the relationship between bytes and
characters, and of the concept of character encodings. And you need
more information about your input file; specifically, what character
encoding it uses. There are a number of potential problems here:


Read up on encodings, http://mindprod.com/jgloss/encoding.html. OP
has not told us enough about what he is doing. Where did this file
come from? Is it encoded the same way on all platforms or is it being
provided in a variety of encodings by some third party software?

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Jul 17 '05 #8
Chris Smith wrote:
2. You don't specify an encoding when you convert the data in the byte
array to text. That data was encoding in some specific encoding when
the file was written. The code you've written will work only if you get
lucky and the platform-default character encoding happens to match the
encoding in the file. To make this work reliably in a cross-platform
way, you need to discover what encoding was used in the file, and
specify that in a separate parameter, for example:

textures[i] = new String(bmpPath, 0, len, "UTF-8");


Well, I gave UTF-8 and all the encoding listed on this page:
http://java.sun.com/j2se/1.4.2/docs/...t/Charset.html
No luck. :(

All the directories show up correctly in Konqueror (KDE browser). Is
there some way I can detect the encoding being used there? What is
Windows default encoding, anyone know? :)

Thanks again for all the help!
Jul 17 '05 #9
Roedy Green wrote:
On Thu, 27 May 2004 09:43:04 -0400, Nicholas Pappas
Will forcing the input stream to a certain encoding under Linux break
Windows?


If the files have different encodings in different platforms you are
going to be in trouble. If you both write and read the file you can
force the encoding and thereby be consistent on all platforms.


Thankfully I do not need to write to the files, so I only need to
figure out how to read them.
Is there a Linux/UNIX command (or even a Windows command) that will
display the character set? Seems unlikely, but I'll cross my fingers
anyway. The files show up correctly in the KDE browser -- might I be
able to figure something out from there?

Thanks again!
Jul 17 '05 #10
Nicholas Pappas wrote:
Chris Smith wrote:
2. You don't specify an encoding when you convert the data in the byte
array to text. That data was encoding in some specific encoding when
the file was written. The code you've written will work only if you
get lucky and the platform-default character encoding happens to match
the encoding in the file. To make this work reliably in a
cross-platform way, you need to discover what encoding was used in the
file, and specify that in a separate parameter, for example:

textures[i] = new String(bmpPath, 0, len, "UTF-8");

Well, I gave UTF-8 and all the encoding listed on this page:
http://java.sun.com/j2se/1.4.2/docs/...t/Charset.html
No luck. :(

All the directories show up correctly in Konqueror (KDE browser).
Is there some way I can detect the encoding being used there? What is
Windows default encoding, anyone know? :)


The Windows default encoding is CP1252 (or CP1251?). Actually, it
varies depending on the locale. In any case you are unlikely to find
this to be supported in Linux.

The simplest thing to do is to create the file using an editor that
allows you to specify the encoding. Windows Notepad will let you save
files an Unicode, but I do not know if it is UTF-8, UTF-16, UTF-16LE,
UTF-16BE, etc. You could use jEdit (a free Java-based editor).

Also, it strikes me as odd that you are trying to read a file using
ObjectInputStream that you did not create using ObjectOutputStream.

Ray
Jul 17 '05 #11
On Fri, 28 May 2004 00:57:13 GMT, Nicholas Pappas
<no*****@rightstep.org> wrote or quoted :
All the directories show up correctly in Konqueror (KDE browser). Is
there some way I can detect the encoding being used there? What is
Windows default encoding, anyone know? :)


It depends to the locale. you set the locale in the control panel.

You can find out what it is using Wassup.

See http://mindprod.com/wassup.html
Maybe it is encoded in some proprietary way.

Do a hex dump of it and display that here. Maybe some of us will
recognise it.

Unfortunately, the powers that be decided not to make encodings
self-identifying.

Encodings are something that tries to impose order on chaos, where
everyone used a different national local 8-bit encoding, and never
exchanged files with others so the problem never came up.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Jul 17 '05 #12
also, as part of the new fast i/o api under java there is another way to
specify and handle unicode issues...

let me go check on it and get back to you...

- perry
Nicholas Pappas wrote:
That is what I thought too (the shielding part). :)

I last used 1.4.2 in Windows, but am using 1.4.1 right now under
Linux. I've been trying to upgrade to 1.5, but the self-installer bin
doesn't seem to want to install correctly on Gentoo Linux. :(

perry anderson wrote:
under Java you are suppose to be shielded from machine level details
and that includes unicode issues.... is the jre the same or later than
the one used on your windows platform...?

- perry

Nicholas Pappas wrote:
Hello all.

I am trying to write a Java3D loader for a geometry file from
a game, which has Unicode characters (Korean) in it. I wrote the
loader and it works in Windows, but I recently brushed off Windows
completely and am now under Linux. When I try to load the filenames
now, I get ??????.
This is the block of code in my loader which reads the strings
from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and
does not properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how
long the string is from there. There is no string length indication
in the file, so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read
the Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!



Jul 17 '05 #13
this is what im talking about

http://java.sun.com/j2se/1.4.2/docs/...nio/index.html
http://java.sun.com/j2se/1.4.2/docs/...ple/index.html

- perry
perry anderson wrote:
also, as part of the new fast i/o api under java there is another way to
specify and handle unicode issues...

let me go check on it and get back to you...

- perry
Nicholas Pappas wrote:
That is what I thought too (the shielding part). :)

I last used 1.4.2 in Windows, but am using 1.4.1 right now under
Linux. I've been trying to upgrade to 1.5, but the self-installer bin
doesn't seem to want to install correctly on Gentoo Linux. :(

perry anderson wrote:
under Java you are suppose to be shielded from machine level details
and that includes unicode issues.... is the jre the same or later
than the one used on your windows platform...?

- perry

Nicholas Pappas wrote:

Hello all.

I am trying to write a Java3D loader for a geometry file from
a game, which has Unicode characters (Korean) in it. I wrote the
loader and it works in Windows, but I recently brushed off Windows
completely and am now under Linux. When I try to load the filenames
now, I get ??????.
This is the block of code in my loader which reads the
strings from the
file:

/** get the number of texture files */
numTextures = in.readInt();

/** skip ahead 4 bytes */
in.skipBytes(4);

/** load the texture files strings */
textures = new String[numTextures];
for (int i=0; i < numTextures; i++) {
/** read in the 40 byte buffer */
in.read(bmpPath);

/** trim buffer to length and store */
for (len=0; len < 40; len++) {
if (bmpPath[len] == 0)
break;
}
textures[i] = new String(bmpPath, 0, len);

/** skip ahead 40 bytes */
in.skipBytes(40);
}

By the time it enters the String array it is all messed up and
does not properly represent the correct paths anymore.
The current reader takes in 40 bytes and then figures out how
long the string is from there. There is no string length indication
in the file, so I have to figure it out within the byte array.

Does anyone have any suggestions on how I fix this so I can read
the Korean text in both Windows and Linux (and other OSs)?

Thank you for any help!


Jul 17 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Nicholas Pappas | last post by:
Hello all. First, a many thanks to all who helped out with my first issue with getting UNICODE moving as it should in Java. I am now able to open the file and store the strings as they appeared...
1
by: krammer | last post by:
Hello, I have the following questions that I have not been able to find any *good* answers for. Your help would me much appreciated!, fyi, I am a Java XML guy and I have no experience with SGML...
2
by: Dale Gerdemann | last post by:
I'm having trouble with Unicode encoding in DOM. As a simple example, I read in a UTF-8 encoded xml file such as: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <aText>letter 'a' with...
8
by: zahidal | last post by:
hello, i am facing a problem with a db2 database created with utf-8 character set. My db2 server is running on windows 2000 server, client is on another machine that is also running windows 2000...
5
by: Jamie | last post by:
I have a file that was written using Java and the file has unicode strings. What is the best way to deal with these in C? The file definition reads: Data Field Description CHAR File...
9
by: Charles F McDevitt | last post by:
I'm trying to upgrade some old code that used old iostreams. At one place in the code, I have a path/filename in a wchar_t string (unicode utf-16). I need to open an ifstream to that file. ...
6
by: John Sidney-Woollett | last post by:
Hi I need to store accented characters in a postgres (7.4) database, and access the data (mostly) using the postgres JDBC driver (from a web app). Does anyone know if: 1) Is there a...
8
by: Richard Schulman | last post by:
Sorry to be back at the goodly well so soon, but... ....when I execute the following -- variable mean_eng_txt being utf-16LE and its datatype nvarchar2(79) in Oracle: cursor.execute("""INSERT...
8
by: Yves Dorfsman | last post by:
Can you put UTF-8 characters in a dbhash in python 2.5 ? It fails when I try: #!/bin/env python # -*- coding: utf-8 -*- import dbhash db = dbhash.open('dbfile.db', 'w') db = u'☺'
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.