473,396 Members | 1,846 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Need help with unicode strings.

I have a file that I want to read a UTF-16 unicode string from.
What is the easiest way to accomplish that?

Thanks in advance,
Nick Z.
Nov 17 '05 #1
14 1741
Nick Z. <an*******@none.com> wrote:
I have a file that I want to read a UTF-16 unicode string from.
What is the easiest way to accomplish that?


Using a StreamReader with Encoding.Unicode - it's a doddle.

(I don't believe you'll see any difference between UCS-2 and UTF-16 - I
don't know which Encoding.Unicode is supposed to represent, but I'm
sure it *actually* ends up working out as UTF-16, because .NET strings
are UTF-16 in some sense anyway...)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #2
A StreamReader using the UnicodeEncoding

dim reader as new StreamReader(fs,
System.Text.Encoding.UnicodeEncoding)

where fs is a filestream

hth,
Alan.

Nov 17 '05 #3
Thanks,
I cant believe I overlooked this.
I even played with the Encoding enumeration before and it didnt work for
me for some reason, works now...

Thanks for taking the time to answer,
Nick Z.

Nick Z. wrote:
I have a file that I want to read a UTF-16 unicode string from.
What is the easiest way to accomplish that?

Thanks in advance,
Nick Z.

Nov 17 '05 #4
I am having trouble with the StreamReader.
As far as I can see there is no way to read a string.
Really I dont know why this is so hard.

In the file there is a simple string terminated with three empty bytes.
Is there no function like ReadString() anywhere? That would simply read
a unicode string untill it gets to a terminating character.

StreamReader:

ReadLine method reads the null bytes and goes on to read another 1000
bytes untill it gets to a 0x0D or something in that order.

Read() is described to retun the next character yet it returns an int?
What? Casting the int into a char doesnt seem to work...

Read(char[],int,int) reads the string fine (i think), asuming I found
the length of the string before hand. However, right after the mehtod
returns the Position property of the BaseStream is now fast-forwarded a
1000 bytes or so when only 35 characters were read.

ReadBlock() is the same as Read(char[],int,int) as far as I can see.

BinaryReader:

ReadString(), oh yeah I thought. However, this is straight from the docs
"The string is prefixed with the length, encoded as an integer seven
bits at a time.". Isnt this UTF-7? In any case it doesnt work for me.

So what are my options at this point?

Thanks for taking the time to answer,
Nick Z.

Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
I have a file that I want to read a UTF-16 unicode string from.
What is the easiest way to accomplish that?

Using a StreamReader with Encoding.Unicode - it's a doddle.

(I don't believe you'll see any difference between UCS-2 and UTF-16 - I
don't know which Encoding.Unicode is supposed to represent, but I'm
sure it *actually* ends up working out as UTF-16, because .NET strings
are UTF-16 in some sense anyway...)

Nov 17 '05 #5
Nick Z. <an*******@none.com> wrote:
I am having trouble with the StreamReader.
As far as I can see there is no way to read a string.
Really I dont know why this is so hard.

In the file there is a simple string terminated with three empty bytes.
Is there no function like ReadString() anywhere? That would simply read
a unicode string untill it gets to a terminating character.
What exactly do you mean by "terminating character"?
StreamReader:

ReadLine method reads the null bytes and goes on to read another 1000
bytes untill it gets to a 0x0D or something in that order.
Indeed.
Read() is described to retun the next character yet it returns an int?
What? Casting the int into a char doesnt seem to work...

Read(char[],int,int) reads the string fine (i think), asuming I found
the length of the string before hand. However, right after the mehtod
returns the Position property of the BaseStream is now fast-forwarded a
1000 bytes or so when only 35 characters were read.
Yes.
ReadBlock() is the same as Read(char[],int,int) as far as I can see.

BinaryReader:

ReadString(), oh yeah I thought. However, this is straight from the docs
"The string is prefixed with the length, encoded as an integer seven
bits at a time.". Isnt this UTF-7? In any case it doesnt work for me.
No, it's not UTF-7. Only the *length* is encoded as an integer seven
bits at a time.
So what are my options at this point?


Well, you could start by telling us what your file format is. It sounds
like it's a mixture of binary and text, which is bad news to start with
I'm afraid.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #6
The terminating char is the null character '\0'.
It is database file. Eveything works except that I can't get the strings
read properly.

This seems to work:

private string ReadUtf16String()
{
try
{
string readString = String.Empty;

char c = binReader.ReadChar();
while(c != '\0')
{
readString += c.ToString();
c = binReader.ReadChar();
}

return readString;
}
catch(EndOfStreamException)
{
eof = true;
return null;
}
catch(Exception ex)
{
throw new Exception("Error reading a UTF-16 string.", ex);
}
}

However, for some reason some strings have two bytes in front of them
that tell the size I suppose? Is that a standart for UTF-16 strings or
is this something that is limited to this particular file. Is there a
way to distinguish between the strings that have these two bytes and the
ones that dont? (this is the main part that is causing the trouble)
Should I dispose of those two bytes or are they part of the string?

I undestand that these questions might be relavant only in this
database, but you have any advice, I will greatly appreciate it.

Thanks,
Nick Z.
Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
I am having trouble with the StreamReader.
As far as I can see there is no way to read a string.
Really I dont know why this is so hard.

In the file there is a simple string terminated with three empty bytes.
Is there no function like ReadString() anywhere? That would simply read
a unicode string untill it gets to a terminating character.

What exactly do you mean by "terminating character"?

StreamReader:

ReadLine method reads the null bytes and goes on to read another 1000
bytes untill it gets to a 0x0D or something in that order.

Indeed.

Read() is described to retun the next character yet it returns an int?
What? Casting the int into a char doesnt seem to work...

Read(char[],int,int) reads the string fine (i think), asuming I found
the length of the string before hand. However, right after the mehtod
returns the Position property of the BaseStream is now fast-forwarded a
1000 bytes or so when only 35 characters were read.

Yes.

ReadBlock() is the same as Read(char[],int,int) as far as I can see.

BinaryReader:

ReadString(), oh yeah I thought. However, this is straight from the docs
"The string is prefixed with the length, encoded as an integer seven
bits at a time.". Isnt this UTF-7? In any case it doesnt work for me.

No, it's not UTF-7. Only the *length* is encoded as an integer seven
bits at a time.

So what are my options at this point?

Well, you could start by telling us what your file format is. It sounds
like it's a mixture of binary and text, which is bad news to start with
I'm afraid.

Nov 17 '05 #7
Nick Z. <an*******@none.com> wrote:
The terminating char is the null character '\0'.
It is database file. Eveything works except that I can't get the strings
read properly.
Okay, if it's a database file, and therefore a binary file with text
bits in, you'd be best off reading it as binary into a buffer (possibly
writing into a temporary MemoryStream), finding the terminating bytes
(two 0s by the sounds of it) and then using Encoding.GetString to
convert that buffer into a string.

I assume you have no control over the format of this database?
This seems to work:

private string ReadUtf16String()
{
try
{
string readString = String.Empty;

char c = binReader.ReadChar();
while(c != '\0')
{
readString += c.ToString();
c = binReader.ReadChar();
}

return readString;
}
catch(EndOfStreamException)
{
eof = true;
return null;
}
catch(Exception ex)
{
throw new Exception("Error reading a UTF-16 string.", ex);
}
}
Well, that's a really bad way of building up a string, to start with -
use StringBuilder instead. However, other than that it will work - but
it might be slow.
However, for some reason some strings have two bytes in front of them
that tell the size I suppose? Is that a standart for UTF-16 strings or
is this something that is limited to this particular file. Is there a
way to distinguish between the strings that have these two bytes and the
ones that dont? (this is the main part that is causing the trouble)
Should I dispose of those two bytes or are they part of the string?

I undestand that these questions might be relavant only in this
database, but you have any advice, I will greatly appreciate it.


No, that's not standard. It sounds like you really need to get hold of
the specs of the database format. Databases are likely to have formats
which are somewhat quicker to understand than this - it seems unlikely
that they'd want to go hunting through a file for terminators just to
find out where a field ends.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #8
Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
The terminating char is the null character '\0'.
It is database file. Eveything works except that I can't get the strings
read properly.

Okay, if it's a database file, and therefore a binary file with text
bits in, you'd be best off reading it as binary into a buffer (possibly
writing into a temporary MemoryStream), finding the terminating bytes
(two 0s by the sounds of it) and then using Encoding.GetString to
convert that buffer into a string.

I assume you have no control over the format of this database?


I dont have control over the format that is correct.

This seems to work:

private string ReadUtf16String()
{
try
{
string readString = String.Empty;

char c = binReader.ReadChar();
while(c != '\0')
{
readString += c.ToString();
c = binReader.ReadChar();
}

return readString;
}
catch(EndOfStreamException)
{
eof = true;
return null;
}
catch(Exception ex)
{
throw new Exception("Error reading a UTF-16 string.", ex);
}
}

Well, that's a really bad way of building up a string, to start with -
use StringBuilder instead. However, other than that it will work - but
it might be slow.


Yes I realize this, but performance is not my major concern now.
I will improve on it down the road, thanks.

However, for some reason some strings have two bytes in front of them
that tell the size I suppose? Is that a standart for UTF-16 strings or
is this something that is limited to this particular file. Is there a
way to distinguish between the strings that have these two bytes and the
ones that dont? (this is the main part that is causing the trouble)
Should I dispose of those two bytes or are they part of the string?

I undestand that these questions might be relavant only in this
database, but you have any advice, I will greatly appreciate it.

No, that's not standard. It sounds like you really need to get hold of
the specs of the database format. Databases are likely to have formats
which are somewhat quicker to understand than this - it seems unlikely
that they'd want to go hunting through a file for terminators just to
find out where a field ends.


The specs say that the strings are 16 bit UCS-2 character (little
endian) cariable length. I just need a way of knowing when I need to
dispose of the first two bytes. It seems that when the database contains
only english characters, the bytes are not even there, not empty, just
not there. I have a database that has some taiwanese strings.

These are the specs, the file I am interested in is H10DB.dat:
http://scribbleninja.org.uk/iriver/w..._Specification
Nov 17 '05 #9
Nick Z. <an*******@none.com> wrote:
No, that's not standard. It sounds like you really need to get hold of
the specs of the database format. Databases are likely to have formats
which are somewhat quicker to understand than this - it seems unlikely
that they'd want to go hunting through a file for terminators just to
find out where a field ends.


The specs say that the strings are 16 bit UCS-2 character (little
endian) cariable length. I just need a way of knowing when I need to
dispose of the first two bytes. It seems that when the database contains
only english characters, the bytes are not even there, not empty, just
not there. I have a database that has some taiwanese strings.

These are the specs, the file I am interested in is H10DB.dat:
http://scribbleninja.org.uk/iriver/w..._Specification


Is the character you read always the same? If so, could you just test
for that and ignore it?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #10
No the first two bytes are not the same for different strings if thats
what you mean.

But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

Is there a way to check if these two bytes are valid characters or not?
That way I will probably be able to tell the difference between these
strings.

Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
No, that's not standard. It sounds like you really need to get hold of
the specs of the database format. Databases are likely to have formats
which are somewhat quicker to understand than this - it seems unlikely
that they'd want to go hunting through a file for terminators just to
find out where a field ends.


The specs say that the strings are 16 bit UCS-2 character (little
endian) cariable length. I just need a way of knowing when I need to
dispose of the first two bytes. It seems that when the database contains
only english characters, the bytes are not even there, not empty, just
not there. I have a database that has some taiwanese strings.

These are the specs, the file I am interested in is H10DB.dat:
http://scribbleninja.org.uk/iriver/w..._Specification

Is the character you read always the same? If so, could you just test
for that and ignore it?

Nov 17 '05 #11
Nick Z. <an*******@none.com> wrote:
No the first two bytes are not the same for different strings if thats
what you mean.

But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

Is there a way to check if these two bytes are valid characters or not?
That way I will probably be able to tell the difference between these
strings.


Well, they probably *are* valid characters - just not characters your
text editor supports. If you only ever get one at the start of a non-
ASCII string, and you *always* get one at the start of a non-ASCII
string, just read it as a character, check whether it's in ASCII or not
(i.e. is its code <= 128) and if not, discard it.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #12
I think that hit the spot! =)

Thank you very very much. You've been a tremendous help!
Nick Z.

Jon Skeet [C# MVP] wrote:
Nick Z. <an*******@none.com> wrote:
No the first two bytes are not the same for different strings if thats
what you mean.

But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

Is there a way to check if these two bytes are valid characters or not?
That way I will probably be able to tell the difference between these
strings.

Well, they probably *are* valid characters - just not characters your
text editor supports. If you only ever get one at the start of a non-
ASCII string, and you *always* get one at the start of a non-ASCII
string, just read it as a character, check whether it's in ASCII or not
(i.e. is its code <= 128) and if not, discard it.

Nov 17 '05 #13
> But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

This is probably the sign that the character has no glyph in the current
font. Try using some big Unicode font (Arial Unicode MS, or Lucida Sans
Unicode).
--
Mihai Nita [Microsoft MVP, Windows - SDK]
------------------------------------------
Replace _year_ with _ to get the real email
Nov 17 '05 #14
Mihai N. <nm**************@yahoo.com> wrote:
But it looks like the bytes are never characters, its always displayed
as the same "square" when opened in a text editor.

This is probably the sign that the character has no glyph in the current
font. Try using some big Unicode font (Arial Unicode MS, or Lucida Sans
Unicode).


Or look up the character code on www.unicode.org for an authoritative
answer :)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 17 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Alban Hertroys | last post by:
Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file....
3
by: Buddy Robbins | last post by:
Hey folks, I'm trying to use the PathCleanupSpec function from the shell library. The function prototype is: int PathCleanupSpec( LPCWSTR pszDir, LPWSTR pszSpec) In the old days of VB6, I...
9
by: Hugo Amselschlag | last post by:
Hi there, I've implemented a local system hook to suppress certain windows beeing displayed by the axWebbrowser control. Now I need some more information before I can decide, whether to suppress...
2
by: Fuzzyman | last post by:
Hello all, Can someone confirm that compiled regular expressions from ascii strings will always (and safely) yield unicode values when matched against unicode strings ? I've tested it and it...
4
by: WaterWalk | last post by:
Hello, I'm currently learning string manipulation. I'm curious about what is the favored way for string manipulation in C, expecially when strings contain non-ASCII characters. For example, if...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
14
by: Dennis Benzinger | last post by:
Hi! The following program in an UTF-8 encoded file: # -*- coding: UTF-8 -*- FIELDS = ("Fächer", ) FROZEN_FIELDS = frozenset(FIELDS) FIELDS_SET = set(FIELDS)
3
by: sophie_newbie | last post by:
Hi, I want to store python text strings that characters like "é" "Č" in a mysql varchar text field. Now my problem is that mysql does not seem to accept these characters. I'm wondering if there...
13
by: George Sakkis | last post by:
It seems xml.etree.cElementTree.iterparse() is not unicode aware: .... print elem.text .... Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 64,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.