By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,171 Members | 1,163 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,171 IT Pros & Developers. It's quick & easy.

Need help with unicode strings.

P: n/a
I have a file that I want to read a UTF-16 unicode string from.
What is the easiest way to accomplish that?

Thanks in advance,
Nick Z.
Nov 17 '05 #1
Share this Question
Share on Google+
14 Replies


P: n/a
Nick Z. <an*******@none.com> wrote:
I have a file that I want to read a UTF-16 unicode string from.
What is the easiest way to accomplish that?


Using a StreamReader with Encoding.Unicode - it's a doddle.

(I don't believe you'll see any difference between UCS-2 and UTF-16 - I
don't know which Encoding.Unicode is supposed to represent, but I'm
sure it *actually* ends up working out as UTF-16, because .NET strings
are UTF-16 in some sense anyway...)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #2

P: n/a
A StreamReader using the UnicodeEncoding

dim reader as new StreamReader(fs,
System.Text.Encoding.UnicodeEncoding)

where fs is a filestream

hth,
Alan.

Nov 17 '05 #3

P: n/a
Thanks,
I cant believe I overlooked this.
I even played with the Encoding enumeration before and it didnt work for
me for some reason, works now...

Thanks for taking the time to answer,
Nick Z.

Nick Z. wrote:
I have a file that I want to read a UTF-16 unicode string from.
What is the easiest way to accomplish that?

Thanks in advance,
Nick Z.

Nov 17 '05 #4

P: n/a
I am having trouble with the StreamReader.
As far as I can see there is no way to read a string.
Really I dont know why this is so hard.

In the file there is a simple string terminated with three empty bytes.
Is there no function like ReadString() anywhere? That would simply read
a unicode string untill it gets to a terminating character.

StreamReader:

ReadLine method reads the null bytes and goes on to read another 1000
bytes untill it gets to a 0x0D or something in that order.

Read() is described to retun the next character yet it returns an int?
What? Casting the int into a char doesnt seem to work...

Read(char[],int,int) reads the string fine (i think), asuming I found
the length of the string before hand. However, right after the mehtod
returns the Position property of the BaseStream is now fast-forwarded a
1000 bytes or so when only 35 characters were read.

ReadBlock() is the same as Read(char[],int,int) as far as I can see.

BinaryReader:

ReadString(), oh yeah I thought. However, this is straight from the docs
"The string is prefixed with the length, encoded as an integer seven
bits at a time.". Isnt this UTF-7? In any case it doesnt work for me.

So what are my options at this point?

Thanks for taking the time to answer,
Nick Z.

Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
I have a file that I want to read a UTF-16 unicode string from.
What is the easiest way to accomplish that?

Using a StreamReader with Encoding.Unicode - it's a doddle.

(I don't believe you'll see any difference between UCS-2 and UTF-16 - I
don't know which Encoding.Unicode is supposed to represent, but I'm
sure it *actually* ends up working out as UTF-16, because .NET strings
are UTF-16 in some sense anyway...)

Nov 17 '05 #5

P: n/a
Nick Z. <an*******@none.com> wrote:
I am having trouble with the StreamReader.
As far as I can see there is no way to read a string.
Really I dont know why this is so hard.

In the file there is a simple string terminated with three empty bytes.
Is there no function like ReadString() anywhere? That would simply read
a unicode string untill it gets to a terminating character.
What exactly do you mean by "terminating character"?
StreamReader:

ReadLine method reads the null bytes and goes on to read another 1000
bytes untill it gets to a 0x0D or something in that order.
Indeed.
Read() is described to retun the next character yet it returns an int?
What? Casting the int into a char doesnt seem to work...

Read(char[],int,int) reads the string fine (i think), asuming I found
the length of the string before hand. However, right after the mehtod
returns the Position property of the BaseStream is now fast-forwarded a
1000 bytes or so when only 35 characters were read.
Yes.
ReadBlock() is the same as Read(char[],int,int) as far as I can see.

BinaryReader:

ReadString(), oh yeah I thought. However, this is straight from the docs
"The string is prefixed with the length, encoded as an integer seven
bits at a time.". Isnt this UTF-7? In any case it doesnt work for me.
No, it's not UTF-7. Only the *length* is encoded as an integer seven
bits at a time.
So what are my options at this point?


Well, you could start by telling us what your file format is. It sounds
like it's a mixture of binary and text, which is bad news to start with
I'm afraid.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #6

P: n/a
The terminating char is the null character '\0'.
It is database file. Eveything works except that I can't get the strings
read properly.

This seems to work:

private string ReadUtf16String()
{
try
{
string readString = String.Empty;

char c = binReader.ReadChar();
while(c != '\0')
{
readString += c.ToString();
c = binReader.ReadChar();
}

return readString;
}
catch(EndOfStreamException)
{
eof = true;
return null;
}
catch(Exception ex)
{
throw new Exception("Error reading a UTF-16 string.", ex);
}
}

However, for some reason some strings have two bytes in front of them
that tell the size I suppose? Is that a standart for UTF-16 strings or
is this something that is limited to this particular file. Is there a
way to distinguish between the strings that have these two bytes and the
ones that dont? (this is the main part that is causing the trouble)
Should I dispose of those two bytes or are they part of the string?

I undestand that these questions might be relavant only in this
database, but you have any advice, I will greatly appreciate it.

Thanks,
Nick Z.
Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
I am having trouble with the StreamReader.
As far as I can see there is no way to read a string.
Really I dont know why this is so hard.

In the file there is a simple string terminated with three empty bytes.
Is there no function like ReadString() anywhere? That would simply read
a unicode string untill it gets to a terminating character.

What exactly do you mean by "terminating character"?

StreamReader:

ReadLine method reads the null bytes and goes on to read another 1000
bytes untill it gets to a 0x0D or something in that order.

Indeed.

Read() is described to retun the next character yet it returns an int?
What? Casting the int into a char doesnt seem to work...

Read(char[],int,int) reads the string fine (i think), asuming I found
the length of the string before hand. However, right after the mehtod
returns the Position property of the BaseStream is now fast-forwarded a
1000 bytes or so when only 35 characters were read.

Yes.

ReadBlock() is the same as Read(char[],int,int) as far as I can see.

BinaryReader:

ReadString(), oh yeah I thought. However, this is straight from the docs
"The string is prefixed with the length, encoded as an integer seven
bits at a time.". Isnt this UTF-7? In any case it doesnt work for me.

No, it's not UTF-7. Only the *length* is encoded as an integer seven
bits at a time.

So what are my options at this point?

Well, you could start by telling us what your file format is. It sounds
like it's a mixture of binary and text, which is bad news to start with
I'm afraid.

Nov 17 '05 #7

P: n/a
Nick Z. <an*******@none.com> wrote:
The terminating char is the null character '\0'.
It is database file. Eveything works except that I can't get the strings
read properly.
Okay, if it's a database file, and therefore a binary file with text
bits in, you'd be best off reading it as binary into a buffer (possibly
writing into a temporary MemoryStream), finding the terminating bytes
(two 0s by the sounds of it) and then using Encoding.GetString to
convert that buffer into a string.

I assume you have no control over the format of this database?
This seems to work:

private string ReadUtf16String()
{
try
{
string readString = String.Empty;

char c = binReader.ReadChar();
while(c != '\0')
{
readString += c.ToString();
c = binReader.ReadChar();
}

return readString;
}
catch(EndOfStreamException)
{
eof = true;
return null;
}
catch(Exception ex)
{
throw new Exception("Error reading a UTF-16 string.", ex);
}
}
Well, that's a really bad way of building up a string, to start with -
use StringBuilder instead. However, other than that it will work - but
it might be slow.
However, for some reason some strings have two bytes in front of them
that tell the size I suppose? Is that a standart for UTF-16 strings or
is this something that is limited to this particular file. Is there a
way to distinguish between the strings that have these two bytes and the
ones that dont? (this is the main part that is causing the trouble)
Should I dispose of those two bytes or are they part of the string?

I undestand that these questions might be relavant only in this
database, but you have any advice, I will greatly appreciate it.


No, that's not standard. It sounds like you really need to get hold of
the specs of the database format. Databases are likely to have formats
which are somewhat quicker to understand than this - it seems unlikely
that they'd want to go hunting through a file for terminators just to
find out where a field ends.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #8

P: n/a
Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
The terminating char is the null character '\0'.
It is database file. Eveything works except that I can't get the strings
read properly.

Okay, if it's a database file, and therefore a binary file with text
bits in, you'd be best off reading it as binary into a buffer (possibly
writing into a temporary MemoryStream), finding the terminating bytes
(two 0s by the sounds of it) and then using Encoding.GetString to
convert that buffer into a string.

I assume you have no control over the format of this database?


I dont have control over the format that is correct.

This seems to work:

private string ReadUtf16String()
{
try
{
string readString = String.Empty;

char c = binReader.ReadChar();
while(c != '\0')
{
readString += c.ToString();
c = binReader.ReadChar();
}

return readString;
}
catch(EndOfStreamException)
{
eof = true;
return null;
}
catch(Exception ex)
{
throw new Exception("Error reading a UTF-16 string.", ex);
}
}

Well, that's a really bad way of building up a string, to start with -
use StringBuilder instead. However, other than that it will work - but
it might be slow.


Yes I realize this, but performance is not my major concern now.
I will improve on it down the road, thanks.

However, for some reason some strings have two bytes in front of them
that tell the size I suppose? Is that a standart for UTF-16 strings or
is this something that is limited to this particular file. Is there a
way to distinguish between the strings that have these two bytes and the
ones that dont? (this is the main part that is causing the trouble)
Should I dispose of those two bytes or are they part of the string?

I undestand that these questions might be relavant only in this
database, but you have any advice, I will greatly appreciate it.

No, that's not standard. It sounds like you really need to get hold of
the specs of the database format. Databases are likely to have formats
which are somewhat quicker to understand than this - it seems unlikely
that they'd want to go hunting through a file for terminators just to
find out where a field ends.


The specs say that the strings are 16 bit UCS-2 character (little
endian) cariable length. I just need a way of knowing when I need to
dispose of the first two bytes. It seems that when the database contains
only english characters, the bytes are not even there, not empty, just
not there. I have a database that has some taiwanese strings.

These are the specs, the file I am interested in is H10DB.dat:
http://scribbleninja.org.uk/iriver/w..._Specification
Nov 17 '05 #9

P: n/a
Nick Z. <an*******@none.com> wrote:
No, that's not standard. It sounds like you really need to get hold of
the specs of the database format. Databases are likely to have formats
which are somewhat quicker to understand than this - it seems unlikely
that they'd want to go hunting through a file for terminators just to
find out where a field ends.


The specs say that the strings are 16 bit UCS-2 character (little
endian) cariable length. I just need a way of knowing when I need to
dispose of the first two bytes. It seems that when the database contains
only english characters, the bytes are not even there, not empty, just
not there. I have a database that has some taiwanese strings.

These are the specs, the file I am interested in is H10DB.dat:
http://scribbleninja.org.uk/iriver/w..._Specification


Is the character you read always the same? If so, could you just test
for that and ignore it?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #10

P: n/a
No the first two bytes are not the same for different strings if thats
what you mean.

But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

Is there a way to check if these two bytes are valid characters or not?
That way I will probably be able to tell the difference between these
strings.

Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
No, that's not standard. It sounds like you really need to get hold of
the specs of the database format. Databases are likely to have formats
which are somewhat quicker to understand than this - it seems unlikely
that they'd want to go hunting through a file for terminators just to
find out where a field ends.


The specs say that the strings are 16 bit UCS-2 character (little
endian) cariable length. I just need a way of knowing when I need to
dispose of the first two bytes. It seems that when the database contains
only english characters, the bytes are not even there, not empty, just
not there. I have a database that has some taiwanese strings.

These are the specs, the file I am interested in is H10DB.dat:
http://scribbleninja.org.uk/iriver/w..._Specification

Is the character you read always the same? If so, could you just test
for that and ignore it?

Nov 17 '05 #11

P: n/a
Nick Z. <an*******@none.com> wrote:
No the first two bytes are not the same for different strings if thats
what you mean.

But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

Is there a way to check if these two bytes are valid characters or not?
That way I will probably be able to tell the difference between these
strings.


Well, they probably *are* valid characters - just not characters your
text editor supports. If you only ever get one at the start of a non-
ASCII string, and you *always* get one at the start of a non-ASCII
string, just read it as a character, check whether it's in ASCII or not
(i.e. is its code <= 128) and if not, discard it.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #12

P: n/a
I think that hit the spot! =)

Thank you very very much. You've been a tremendous help!
Nick Z.

Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
No the first two bytes are not the same for different strings if thats
what you mean.

But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

Is there a way to check if these two bytes are valid characters or not?
That way I will probably be able to tell the difference between these
strings.

Well, they probably *are* valid characters - just not characters your
text editor supports. If you only ever get one at the start of a non-
ASCII string, and you *always* get one at the start of a non-ASCII
string, just read it as a character, check whether it's in ASCII or not
(i.e. is its code <= 128) and if not, discard it.

Nov 17 '05 #13

P: n/a
> But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

This is probably the sign that the character has no glyph in the current
font. Try using some big Unicode font (Arial Unicode MS, or Lucida Sans
Unicode).
--
Mihai Nita [Microsoft MVP, Windows - SDK]
------------------------------------------
Replace _year_ with _ to get the real email
Nov 17 '05 #14

P: n/a
Mihai N. <nm**************@yahoo.com> wrote:
But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

This is probably the sign that the character has no glyph in the current
font. Try using some big Unicode font (Arial Unicode MS, or Lucida Sans
Unicode).


Or look up the character code on www.unicode.org for an authoritative
answer :)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #15

This discussion thread is closed

Replies have been disabled for this discussion.