473,701 Members | 2,731 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

how to read a Unicode file

I want to read a xml file in Unicode, UTF-8 or a native encoding
into a wchar_t type string, so i write a routine as follows, however,
sometimes a Unicode file including Chinese character cannot be read
completely. and I cannot tell where its root located, so NEED your
help, GIVE me a hand please.
THX.
static Status LoadXMLFile2Str ing(const char *filename, wchar_t *text){
FILE *f;
if(!(f = fopen(filename, "r"))){
__printDebugA(" Input file %s cannot be opened.", filename);
return ERROR;
}
char *encoding;
//transform routine: other --unicode --other
const unsigned char UTF_8_HEAD[3] = {239, 187, 191};
const unsigned char UNICODE_HEAD[2] = {255, 254};
const unsigned char UNICODE_BIGENDI AN_HEAD[2] = {254, 255};
unsigned char head[3];
fread(head, 1, 3, f);
if(!memcmp(head , UNICODE_HEAD, 2)){
encoding = "UNICODE";
}
else if(!memcmp(head , UNICODE_BIGENDI AN_HEAD, 2)){
encoding = "UNICODE_BIGEND IAN";
}
else if(!memcmp(head , UTF_8_HEAD, 3)){
encoding = "UTF_8";
}
else{
encoding = "ANSI";
}
char *str = (char *) malloc((MAXXMLF ILESIZE + 1) * sizeof(char));
int i = 0;
if(!strcmp(enco ding, "ANSI")){
str[0] = head[0];
str[1] = head[1];
str[2] = head[2];
i = 3;
}
else if(!strcmp(enco ding, "UNICODE") || !strcmp(encodin g,
"UNICODE_BIGEND IAN")){
str[0] = head[2];
i = 1;
}
while(!feof(f)) {
if(i >= MAXXMLFILESIZE) {
db_error(L"The file is too large.");
return ERROR;
}
str[i] = fgetc(f);
i++;
}
str[i] = '\0';
if(!strcmp(enco ding, "UNICODE")) {
for(int j = 0; j < i - 1; j++){
if(j % 2){
text[j/2] += ((unsigned char) str[j]) << 8;
}
else{
text[j/2] = (unsigned char) str[j];
}
}
text[j/2] = 0;
//db_debug(L"%d", wcslen(text));
}
else if(!strcmp(enco ding, "UNICODE_BIGEND IAN")){
for(int j = 0; j < i; j++){
if(j % 2){
text[j/2] = (text[j/2] << 8) + (unsigned char) str[j];
}
else{
text[j/2] = (unsigned char) str[j];
}
}
text[j/2] = 0;
}
else if(!strcmp(enco ding, "UTF_8")){
UTF2Unicode(str , text);
}
else if(!strcmp(enco ding, "ANSI")){
setlocale(LC_CT YPE, "");
mbstowcs(text, str, MAXXMLFILESIZE + 1);
}
else{
assert(FALSE);
}
free(str);
fclose(f);
return OK;
}

Nov 7 '06 #1
2 7530
help
"st******@gmail .com дµÀ£º
"
I want to read a xml file in Unicode, UTF-8 or a native encoding
into a wchar_t type string, so i write a routine as follows, however,
sometimes a Unicode file including Chinese character cannot be read
completely. and I cannot tell where its root located, so NEED your
help, GIVE me a hand please.
THX.
static Status LoadXMLFile2Str ing(const char *filename, wchar_t *text){
FILE *f;
if(!(f = fopen(filename, "r"))){
__printDebugA(" Input file %s cannot be opened.", filename);
return ERROR;
}
char *encoding;
//transform routine: other --unicode --other
const unsigned char UTF_8_HEAD[3] = {239, 187, 191};
const unsigned char UNICODE_HEAD[2] = {255, 254};
const unsigned char UNICODE_BIGENDI AN_HEAD[2] = {254, 255};
unsigned char head[3];
fread(head, 1, 3, f);
if(!memcmp(head , UNICODE_HEAD, 2)){
encoding = "UNICODE";
}
else if(!memcmp(head , UNICODE_BIGENDI AN_HEAD, 2)){
encoding = "UNICODE_BIGEND IAN";
}
else if(!memcmp(head , UTF_8_HEAD, 3)){
encoding = "UTF_8";
}
else{
encoding = "ANSI";
}
char *str = (char *) malloc((MAXXMLF ILESIZE + 1) * sizeof(char));
int i = 0;
if(!strcmp(enco ding, "ANSI")){
str[0] = head[0];
str[1] = head[1];
str[2] = head[2];
i = 3;
}
else if(!strcmp(enco ding, "UNICODE") || !strcmp(encodin g,
"UNICODE_BIGEND IAN")){
str[0] = head[2];
i = 1;
}
while(!feof(f)) {
if(i >= MAXXMLFILESIZE) {
db_error(L"The file is too large.");
return ERROR;
}
str[i] = fgetc(f);
i++;
}
str[i] = '\0';
if(!strcmp(enco ding, "UNICODE")) {
for(int j = 0; j < i - 1; j++){
if(j % 2){
text[j/2] += ((unsigned char) str[j]) << 8;
}
else{
text[j/2] = (unsigned char) str[j];
}
}
text[j/2] = 0;
//db_debug(L"%d", wcslen(text));
}
else if(!strcmp(enco ding, "UNICODE_BIGEND IAN")){
for(int j = 0; j < i; j++){
if(j % 2){
text[j/2] = (text[j/2] << 8) + (unsigned char) str[j];
}
else{
text[j/2] = (unsigned char) str[j];
}
}
text[j/2] = 0;
}
else if(!strcmp(enco ding, "UTF_8")){
UTF2Unicode(str , text);
}
else if(!strcmp(enco ding, "ANSI")){
setlocale(LC_CT YPE, "");
mbstowcs(text, str, MAXXMLFILESIZE + 1);
}
else{
assert(FALSE);
}
free(str);
fclose(f);
return OK;
}
Nov 10 '06 #2

st******@gmail. com wrote:
I want to read a xml file in Unicode, UTF-8 or a native encoding
into a wchar_t type string, so i write a routine as follows, however,
sometimes a Unicode file including Chinese character cannot be read
completely. and I cannot tell where its root located, so NEED your
help, GIVE me a hand please.
THX.
[code sniped]

This code is horrible on so many levels. Mostly I suspect because it is
in C rather than C++.

You will have something much easier to work with if you reformulate
this in C++ and apply some more useful abstractions to it.

As for your error, you are only checking a few encodings and assuming
that there is a BOM to tell you which to use. You need to check the XML
prolog. It may be that the Chinese file is using a different encoding.
K

Nov 10 '06 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
28342
by: Achim Domma | last post by:
Hi, I read some text from a utf-8 encoded text file like this: text = codecs.open('example.txt','r','utf8').read() If I pass this text to a COM object, I can see that there is still the BOM in the file, which marks the file as utf-8. Simply removing the first character in the string is not ok, because the BOM is optional. So I tried something like this:
17
10499
by: Guyon Morée | last post by:
what is the difference? if I open a text file in binary (rb) mode, it doesn't matter... the read() output is the same.
5
2074
by: wolfgang haefelinger | last post by:
Greetings, I'm trying to read (japanese) chars from a file. While doing so I encounter that a char with length 2 is returned. Is this to be expected or is there something wrong? Basically it's this what I'm doing: import codecs f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed
3
7771
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
2
9789
by: hezhenjie | last post by:
Hi, all: I just need to parse a unicode file, and assume to get data one line by one line. I use _wfopen(), fgetws(), wcslen(), wcsstr(), making it work normally on Windows platform. However, when migrate it to Linux platform, issue occurs. Linux only has fopen() function, and fgetws() could not correctly get lines, in fact, it gets nothing.
1
4177
by: Jordan | last post by:
I have a unicode XML file that I am trying to read using the .NET XmlTextReader in C#. How do I read the unicode file? If I try to using the XmlTextReader.Read() method, it throws an exception. The exception reads: The '€' character, hexadecimal value 0x80, cannot begin with a name. Line 1, position 2. Any suggestions? I read on Microsoft's website about writing surrogate pairs, but I can't find any documentation that confirms the
10
2540
by: Tibby | last post by:
I need to read/write not only text files, but binary as well. It seems like on binary files, it doesn't right the last 10% of the file. -- Thanks --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.538 / Virus Database: 333 - Release Date: 11/10/2003
1
4060
by: markww | last post by:
Hi, I'm just trying to print the contents of a notepad file I saved with unicode encoding on my win xp machine. I keep getting strange characters printed out though. Why doesn't this work? std::wifstream wifile("C:\\unicode.txt"); if (wifile.is_open()) { wchar_t wszBuffer;
14
5767
by: Zoro | last post by:
My task is to read html files from disk and save them onto SQL Server database field. I have created an nvarchar(max) field to hold them. The problem is that some characters, particularly html entities, and French/German special characters are lost and/or replaced by a question mark. This is really frustrating. I have tried using StreamReader with ALL the encodings available and none work correctly. Each encoding handles some characters...
0
8649
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9229
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9083
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8977
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8934
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
4410
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4662
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2398
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2035
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.