he*******@gmail.com wrote:
Hi, all:
I just need to parse a unicode file, and assume to get data one line
by one line.
My first guess at "unicode file" would be a file which contains some
documentation on Unicode, kinda like this "unicode file" (not the link,
but the actual file):
http://www.unicode.org/faq/basic_q.html#a
I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work
normally on Windows platform.
However, when migrate it to Linux platform, issue occurs.
Linux only has fopen() function, and fgetws() could not correctly get
lines, in fact, it gets nothing.
I thought to use fread() instead, but it could not get data one line by
one line.
So, with what encoding are the file's contents encoded? Note that "unicode"
is not an answer. Possible answers are UTF-16LE, UTF-16BE, UTF-16 with
BOM, UTF-8, UTF-7, ASCII, ISO-8859-1, ISO-2022-JP, Big5, etc.
I'll take a guess, though. Likely it's one of the UTF-16 encodings. In which
case, note that for Linux the natural encoding meant for representing the
Unicode character map is UTF-8. UTF-8 and UTF-16 are wildly different from
the standpoint of C. You'll need to convert the file. A great C library
for dealing with the myriad issues with Unicode and UTF is ICU:
http://icu.sourceforge.net/ http://www-306.ibm.com/software/glob.../icu/index.jsp
If I sound harsh or condescending it's because Unicode and UTF requires a
significant rethinking of how one deals with text, and it cannot be
understated. It goes way beyond the differences between UTF-16 and UTF-8.
And having to interoperate with broken software all day has hardened me.
Also note that this is all beyond the scope of what comp.lang.c deal withs.
- Bill