467,077 Members | 943 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,077 developers. It's quick & easy.

How to read unicode file line by line on Linux platform

Hi, all:
I just need to parse a unicode file, and assume to get data one line
by one line.
I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
normally on Windows platform.

However, when migrate it to Linux platform, issue occurs.
Linux only has fopen() function, and fgetws() could not correctly get
lines, in fact, it gets nothing.

I thought to use fread() instead, but it could not get data one line by
one line.

Is there any good way to solve this problem?

Thanks~

Nov 15 '05 #1
  • viewed: 9247
Share:
2 Replies
<he*******@gmail.com> wrote in message
news:11*********************@g49g2000cwa.googlegro ups.com...
Hi, all:
I just need to parse a unicode file, and assume to get data one line
by one line.
I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
normally on Windows platform.

However, when migrate it to Linux platform, issue occurs.
Linux only has fopen() function, and fgetws() could not correctly get
lines, in fact, it gets nothing.

I thought to use fread() instead, but it could not get data one line by
one line.

Is there any good way to solve this problem?


Yes, go to www.unicode.org and get yourself the article "To the BMP and
beyond!" by Muller of Adobe Systems, Unicode FAQ, Unicode standard and some
charts. Find out how "code points" are stored in UTF-8 and UTF-16. Write
code to read/write code points in the needed UTF from/to the file. Then
process the file code point by code point. Most likely you'll only need to
look for code points with values of 13 and 10 (i.e. the famous '\r' and '\n'
:) to find out where the lines begin and end. But for full Unicode coverage,
please do read the Unicode FAQ and standard.

HTH
Alex
Nov 15 '05 #2
he*******@gmail.com wrote:
Hi, all:
I just need to parse a unicode file, and assume to get data one line
by one line.
My first guess at "unicode file" would be a file which contains some
documentation on Unicode, kinda like this "unicode file" (not the link,
but the actual file):

http://www.unicode.org/faq/basic_q.html#a
I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
normally on Windows platform.

However, when migrate it to Linux platform, issue occurs.
Linux only has fopen() function, and fgetws() could not correctly get
lines, in fact, it gets nothing.

I thought to use fread() instead, but it could not get data one line by
one line.


So, with what encoding are the file's contents encoded? Note that "unicode"
is not an answer. Possible answers are UTF-16LE, UTF-16BE, UTF-16 with
BOM, UTF-8, UTF-7, ASCII, ISO-8859-1, ISO-2022-JP, Big5, etc.

I'll take a guess, though. Likely it's one of the UTF-16 encodings. In which
case, note that for Linux the natural encoding meant for representing the
Unicode character map is UTF-8. UTF-8 and UTF-16 are wildly different from
the standpoint of C. You'll need to convert the file. A great C library
for dealing with the myriad issues with Unicode and UTF is ICU:

http://icu.sourceforge.net/
http://www-306.ibm.com/software/glob.../icu/index.jsp

If I sound harsh or condescending it's because Unicode and UTF requires a
significant rethinking of how one deals with text, and it cannot be
understated. It goes way beyond the differences between UTF-16 and UTF-8.
And having to interoperate with broken software all day has hardened me.

Also note that this is all beyond the scope of what comp.lang.c deal withs.

- Bill
Nov 15 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

8 posts views Thread by Francis Girard | last post: by
1 post views Thread by Venkat | last post: by
7 posts views Thread by Robert | last post: by
12 posts views Thread by damjan | last post: by
5 posts views Thread by Josh | last post: by
1 post views Thread by erikcw | last post: by
1 post views Thread by anonymous | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.