473,473 Members | 1,825 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Reading Unicode Strings from File

I have a file that was written using Java and the file has unicode
strings. What is the best way to deal with these in C? The file
definition reads:

Data Field Description
CHAR[32] File identifier (64 bytes corresponding to Unicode character
string padded with '0' Unicode characters.
CHAR[16] File format version (32 bytes corresponding to Unicode
character string "x.y.z" where x, y, z are integers
corresponding to major, minor and revision
number of the File format version) padded with '0' Unicode
characters.
INTEGER Main file header length [bytes].
....
The data field defitions are from Java primitives:

CHAR Unicode character. 16-bit Unicode character.
INTEGER Signed integer number. 32-bit two's complement signed integer.

There is absolutely no need for these strings to be in unicode format and
I am at a loss as to how to convert them to a standard C character array.
Moreover, in the example below, I seem to be out in the byte count as my
integer is garbage. Any ideas would be greatly appreciated.

#include <stdlib.h>
#include <stdio.h>

#define ARGC_FAILURE 100
#define OPEN_FAILURE 101
#define CLOSE_FAILURE 102

int main(int argc, char *argv[])
{
FILE *fp;
long n;
char d_id[64];
char d_version[32];
int d_hdrlen;

if (argc !=2)
{
printf("Usage: read_adf filename\n");
return ARGC_FAILURE;
}

// Open the file
if ( (fp = fopen(argv[1], "r")) == NULL)
{
printf("%s: Error opening %s", argv[0], argv[1]);
return OPEN_FAILURE;
}

// Read the contents
n = fread(d_id, sizeof(d_id), 1, fp);
n = fread(d_version, sizeof(d_version), 1, fp);
n = fread(&d_hdrlen, sizeof(d_hdrlen), 1, fp);

// Display the contents
printf(" ID: %s\n", d_id);
printf(" VER: %s\n", d_version);
printf(" HDR Length: %d\n", d_hdrlen);

// Close the file
if (fclose(fp) == EOF)
{
printf("%s: Error closing %s", argv[0], argv[1]);
return CLOSE_FAILURE;
}

return 0;
}


Nov 13 '05 #1
5 18621
On Mon, 24 Nov 2003 16:50:02 +0000, in comp.lang.c , Jamie
<ja*********************@dummy.com> wrote:
I have a file that was written using Java and the file has unicode
strings. What is the best way to deal with these in C?


C knows nothing of Unicode. However your platform probably does, since
it seems it uses them. Almost certainly your compiler has some
platform-specific functions to convert unicode to C strings adn
vice-versa. You might also find the wchar_t type may match unicode on
your platform. You'd have to experiment to find out.

--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.angelfire.com/ms3/bchambless0/welcome_to_clc.html>
Nov 13 '05 #2
"Jamie" <ja*********************@dummy.com> wrote:
I have a file that was written using Java and the file has unicode
strings. What is the best way to deal with these in C? The file
definition reads:

Data Field Description
CHAR[32] File identifier (64 bytes corresponding to Unicode
character string padded with '0' Unicode characters.
CHAR[16] File format version (32 bytes corresponding to
Unicode character string "x.y.z" where x, y, z
are integers corresponding to major, minor and
revision number of the File format version)
padded with '0' Unicode characters.
INTEGER Main file header length [bytes].


This seems to work on my system. It assumes that the multibyte
and wide character system on your C implementation follows the
same standard as the Java unicode file.

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <assert.h>

struct header {
char id[32];
int major, minor, revision;
int length;
};

void read_header(struct header *hd, FILE *fp)
{
wchar_t wid[32];
wchar_t wver[16];
char ver[16];

assert(sizeof(wchar_t) == 2);
assert(sizeof(int) == 4);

fread(wid, sizeof(wchar_t), 32, fp);
wcstombs(hd->id, wid, 32);

fread(wver, sizeof(wchar_t), 16, fp);
wcstombs(ver, wver, 16);
sscanf(ver, "%d.%d.%d", &hd->major, &hd->minor, &hd->revision);

fread(&hd->length, 1, sizeof(int), fp);
}
I also wrote some companion functions for testing:

void write_header(const struct header *hd, FILE *fp)
{
wchar_t wid[32] = {0};
wchar_t wver[16] = {0};
char ver[16];

assert(sizeof(wchar_t) == 2);
assert(sizeof(int) == 4);

mbstowcs(wid, hd->id, 32);
fwrite(wid, sizeof(wchar_t), 32, fp);

sprintf(ver, "%d.%d.%d", hd->major, hd->minor, hd->revision);
mbstowcs(wver, ver, 16);
fwrite(wver, sizeof(wchar_t), 16, fp);

fwrite(&hd->length, 1, sizeof(int), fp);
}

int main(int argc, char **argv)
{
if(argc != 3)
{
fprintf(stderr, "Usage requires two arguments,\n"
"First is either 'r' (read) or 'w' (write)\n"
"Second is the file name\n");
}
else if(argv[1][0] == 'r')
{
FILE *fp = fopen(argv[2], "rb");
if(fp == NULL)
{
fprintf(stderr, "Error opening file %s for binary read\n", argv[2]);
}
else
{
struct header hd;
read_header(&hd, fp);
printf("id = \"%s\"\n", hd.id);
printf("ver = %d.%d.%d\n", hd.major, hd.minor, hd.revision);
printf("length = %d\n", hd.length);
fclose(fp);
}
}
else if(argv[1][0] == 'w')
{
FILE *fp = fopen(argv[2], "wb");
if(fp == NULL)
{
fprintf(stderr, "Error opening file %s for binary write\n", argv[2]);
}
else
{
struct header hd = {"ident", 1, 2, 3, 4};
write_header(&hd, fp);
fclose(fp);
}
}
return 0;
}
--
Simon.
Nov 13 '05 #3
On Mon, 24 Nov 2003 11:50:02 -0500, Jamie wrote:
I have a file that was written using Java and the file has unicode
strings. What is the best way to deal with these in C? The file
definition reads:

Data Field Description
CHAR[32] File identifier (64 bytes corresponding to Unicode
character
string padded with '0' Unicode characters.
CHAR[16] File format version (32 bytes corresponding to Unicode
character string "x.y.z" where x, y, z are integers
corresponding to major, minor and revision number of the
File format version) padded with '0' Unicode characters.
INTEGER Main file header length [bytes]. ...
The data field defitions are from Java primitives:

CHAR Unicode character. 16-bit Unicode character. INTEGER Signed
integer number. 32-bit two's complement signed integer.

There is absolutely no need for these strings to be in unicode format
and I am at a loss as to how to convert them to a standard C character
array. Moreover, in the example below, I seem to be out in the byte
count as my integer is garbage. Any ideas would be greatly appreciated.


The easiest way to decode these strings is probably with encdec:

http://www.ioplex.com/~miallen/encdec/

See the dec_mbscpy function and use the identifier "JAVA" or maybe
"UTF-16BE".

Keep in mind that unless you use wide character strings throughout your
program you will be limited to the locale dependant codepage. On Linux
and some Unix you can run in a UTF-8 locale like LANG=en_US.UTF-8 to
support unicode but otherwise a Unicode string with characters that fall
outside of the locale dependant encoding range will generate an EILSEQ
error. So to properly support Unicode in your application you'll need
to use wchar_t (required if you use Windows for example) or the UTF-8
locale on Unix (see setlocale and encdec tests). Or you could just claim
the files must encode these strings with only characters of the locale
dependent encoding (e.g. ISO-8859-1) and cross your fingers.

Also, if DataOutputStream was used to encode the strings there may be
a leading integer denoting the number of characters that follow. But it
doesn't sound like that is the case. It sounds like a custom encoding.

Mike
Nov 13 '05 #4
And what is the best strategy for when wchar_t != 2? I'm running linux on
x86 and ppc hardware and find wchar_t = 4. I am looking for a cleaver way
of defining a two element char (i.e. Java unicode representation) that
isn't too reliant on hardware :(

Thanks,
Jamie

On Wed, 26 Nov 2003, Simon Biber wrote:
This seems to work on my system. It assumes that the multibyte
and wide character system on your C implementation follows the
same standard as the Java unicode file.

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <assert.h>

struct header {
char id[32];
int major, minor, revision;
int length;
};

void read_header(struct header *hd, FILE *fp)
{
wchar_t wid[32];
wchar_t wver[16];
char ver[16];

assert(sizeof(wchar_t) == 2);
assert(sizeof(int) == 4);

fread(wid, sizeof(wchar_t), 32, fp);
wcstombs(hd->id, wid, 32);

fread(wver, sizeof(wchar_t), 16, fp);
wcstombs(ver, wver, 16);
sscanf(ver, "%d.%d.%d", &hd->major, &hd->minor, &hd->revision);

fread(&hd->length, 1, sizeof(int), fp);
}
I also wrote some companion functions for testing:

void write_header(const struct header *hd, FILE *fp)
{
wchar_t wid[32] = {0};
wchar_t wver[16] = {0};
char ver[16];

assert(sizeof(wchar_t) == 2);
assert(sizeof(int) == 4);

mbstowcs(wid, hd->id, 32);
fwrite(wid, sizeof(wchar_t), 32, fp);

sprintf(ver, "%d.%d.%d", hd->major, hd->minor, hd->revision);
mbstowcs(wver, ver, 16);
fwrite(wver, sizeof(wchar_t), 16, fp);

fwrite(&hd->length, 1, sizeof(int), fp);
}

int main(int argc, char **argv)
{
if(argc != 3)
{
fprintf(stderr, "Usage requires two arguments,\n"
"First is either 'r' (read) or 'w' (write)\n"
"Second is the file name\n");
}
else if(argv[1][0] == 'r')
{
FILE *fp = fopen(argv[2], "rb");
if(fp == NULL)
{
fprintf(stderr, "Error opening file %s for binary read\n", argv[2]);
}
else
{
struct header hd;
read_header(&hd, fp);
printf("id = \"%s\"\n", hd.id);
printf("ver = %d.%d.%d\n", hd.major, hd.minor, hd.revision);
printf("length = %d\n", hd.length);
fclose(fp);
}
}
else if(argv[1][0] == 'w')
{
FILE *fp = fopen(argv[2], "wb");
if(fp == NULL)
{
fprintf(stderr, "Error opening file %s for binary write\n", argv[2]);
}
else
{
struct header hd = {"ident", 1, 2, 3, 4};
write_header(&hd, fp);
fclose(fp);
}
}
return 0;
}
--
Simon.


Nov 13 '05 #5
"Jamie" <ja*********************@dummy.com> wrote:
And what is the best strategy for when wchar_t != 2? I'm running linux
on x86 and ppc hardware and find wchar_t = 4. I am looking for a clever
way of defining a two element char (i.e. Java unicode representation)
that isn't too reliant on hardware :(


Well, the type 'unsigned short' is probably two bytes on your system.
However the best way to read the values, given that the endianness is
probably different between your x86 and ppc hardware, is probably to
read unsigned chars and load them into wchar_t by shifting the value
like this:

/* define type twobyte as an array of 2 unsigned char */
typedef unsigned char twobyte[2];

wchar_t wid[32];
twobyte tbid[32];

fread(tbid, sizeof(twobyte), 32, fp);
for(i = 0; i < 32; i++)
{
wid[i] = tbid[i][0] << 8 + tbid[i][1]; /* Or the other way around */
}
wcstombs(hd->id, wid, 32);

Or, if you know that there can't be any characters outside the first 256
code points, therefore the high byte of the unicode representation is
always zero, then you could forget about all the wcstombs crap, and just
copy the low byte, either tbid[i][0] or tbid[i][1], into an array of char.

--
Simon.
Nov 13 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Guilherme Salgado | last post by:
Hi there, I have a python source file encoded in unicode(utf-8) with some iso8859-1 strings. I've encoded this file as utf-8 in the hope that python will understand these strings as unicode...
1
by: Jonathon Blake | last post by:
All: Question Python is currently Unicode Compliant. What happens when strings are read in from text files that were created using GB 2312-1980, or KPS 9566-2003, or other, equally...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
5
by: rnorthedge | last post by:
I am working on a code library which needs to read in the data from large binary files. The files hold int, double and string data. This is the code for reading in the strings: protected...
5
by: Norman Diamond | last post by:
Here are two complete lines of output from Visual Studio 2005: 1>$B%W%m%8%'%/%H=PNO$K(B Authenticode $B=pL>$7$F$$$^$9(B... 1>Successfully signed: c:\T The first line means roughly: Doing...
14
by: Dennis Benzinger | last post by:
Hi! The following program in an UTF-8 encoded file: # -*- coding: UTF-8 -*- FIELDS = ("Fächer", ) FROZEN_FIELDS = frozenset(FIELDS) FIELDS_SET = set(FIELDS)
18
by: John | last post by:
Hi, I'm a beginner is using C# and .net. I have big legacy files that stores various values (ints, bytes, strings) and want to read them into a C# programme so that I can store them in a...
24
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...
13
by: George Sakkis | last post by:
It seems xml.etree.cElementTree.iterparse() is not unicode aware: .... print elem.text .... Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 64,...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.