By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
431,900 Members | 1,078 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 431,900 IT Pros & Developers. It's quick & easy.

ansi c compiler character encoding

P: n/a
Hi!

Is it determined that the C standard compiler always encode characters
with the same character excoding? If for example the functions Foo and
Bar are compiled by different compilers, is it unambiguous how to
interpret the character string in Bar?

Does string.h expect a specific string format?

void Foo(void)
{
char myTextString[11] = "stuvxyz";
Bar(myTextString);
}

void Bar(char* inp)
{
What character set to expect?
}
Aug 18 '08 #1
Share this Question
Share on Google+
12 Replies


P: n/a
Andreas Lundgren wrote:
Hi!

Is it determined that the C standard compiler always encode characters
with the same character excoding?
No.
Aug 18 '08 #2

P: n/a
Andreas Lundgren <d9****@efd.lth.sewrites:
Is it determined that the C standard compiler always encode characters
with the same character excoding? If for example the functions Foo and
Bar are compiled by different compilers, is it unambiguous how to
interpret the character string in Bar?

Does string.h expect a specific string format?

void Foo(void)
{
char myTextString[11] = "stuvxyz";
Bar(myTextString);
}

void Bar(char* inp)
{
What character set to expect?
}
No.

But if the two compilers are being used on the same system, it's very
likely that they'll use the same encoding. Since you're calling one
function from the other, presumably you're using the compilers on the
same system and linking the resulting code into a single executable or
equivalent.

Typically a given operating system will impose representations for
certain things. Though this is outside the scope of the C standard,
it's in the best interest of compiler writers to make their generate
code work and play well with that of other compilers. (For example, a
C compiler for Linux that generates code that's incompatible with code
generated by gcc wouldn't be very useful.)

This goes far beyond character set issues and includes things like
integer and floating-point type representations and function calling
conventions.

Your later followup suggests that you're concerned about some
real-world situation, presumably on some specific system. You should
ask in a newsgroup that deals with that system.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Aug 18 '08 #3

P: n/a
Keith Thompson <ks***@mib.orgwrote:
Andreas Lundgren <d9****@efd.lth.sewrites:
Is it determined that the C standard compiler always encode characters
with the same character excoding? If for example the functions Foo and
Bar are compiled by different compilers, is it unambiguous how to
interpret the character string in Bar?

Does string.h expect a specific string format?

void Foo(void)
{
char myTextString[11] = "stuvxyzåäö";
Bar(myTextString);
}

void Bar(char* inp)
{
What character set to expect?
}
No.
But if the two compilers are being used on the same system, it's very
likely that they'll use the same encoding. Since you're calling one
function from the other, presumably you're using the compilers on the
same system and linking the resulting code into a single executable or
equivalent.
Is it actually a question about the compiler at all? As far as
I can see the compiler will happily create a string literal with
whatever there is in the string, not caring a bit about the en-
coding of the string. I guess the problem is much more one of
how the source files are generated and the expectations of the
output medium.

Consider the case of using one editor for the first file, set
to output files in e.g. one of the different (and incompatible)
russian extended ASCII code pages, and the second file genera-
ted with another editor, set to output in a different encoding.
Even if you use the same compiler this should lead to trouble.
And if then the terminal that receives the output of the pro-
gram is set to a third encoding it becomes a complete mess;-)

Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
Aug 18 '08 #4

P: n/a
Jens Thoms Toerring wrote:
Keith Thompson <ks***@mib.orgwrote:
>[...]
But if the two compilers are being used on the same system, it's very
likely that they'll use the same encoding. Since you're calling one
function from the other, presumably you're using the compilers on the
same system and linking the resulting code into a single executable or
equivalent.

Is it actually a question about the compiler at all? As far as
I can see the compiler will happily create a string literal with
whatever there is in the string, not caring a bit about the en-
coding of the string. I guess the problem is much more one of
how the source files are generated and the expectations of the
output medium.
A crucial point here is that the encoding of characters in the
C source files need have nothing to do with the encoding of
characters in the execution environment. The compiler generates
execution-encoded strings from source-encoded string literals, and
the transformation is not necessarily the identity mapping. For
example, consider a compiler that reads ASCII-encoded source and
produces a program for an EBCDIC environment: An X in a source
literal would go into the compiler as the value 88, but produce a
character with the value 231 in the executed program.

The fact that the source-to-execution mapping might not be
a simple copy is surprising, but it really shouldn't be. There are
plenty of other non-copy steps in the manufacture of an execution
string from a source literal: Escapes (hex, octal, and symbolic)
are translated, adjacent literals are spliced, the quotation marks
vanish, a trailing zero appears out of thin air -- in light of all
the other things that happen to a source character on its way into
the executable program, why should we imagine that the encoding of
an 'X' would be immune to change?

--
Er*********@sun.com
Aug 18 '08 #5

P: n/a
On Aug 18, 7:48 am, Andreas Lundgren <d99...@efd.lth.sewrote:
Hi!

Is it determined that the C standard compiler always encode characters
with the same character excoding? If for example the functions Foo and
Bar are compiled by different compilers, is it unambiguous how to
interpret the character string in Bar?
No, it does not depends on the compiler...
>
Does string.h expect a specific string format?

void Foo(void)
{
char myTextString[11] = "stuvxyz";
Here, instead of char, try with wchar_t and
related functions if you are using unicode
for your messages and your .c files
Bar(myTextString);

}

void Bar(char* inp)
{
What character set to expect?
Thats depends on the user environment, but if the
user environments is using unicode, you can expect no
more than an array of bytes, other case is with
wchar_t and related functions...
>
}
Regards,
DMW
Aug 18 '08 #6

P: n/a
Daniel Molina Wegener wrote:
On Aug 18, 7:48 am, Andreas Lundgren <d99...@efd.lth.sewrote:
Hi!

Is it determined that the C standard compiler always encode characters
with the same character excoding? If for example the functions Foo and
Bar are compiled by different compilers, is it unambiguous how to
interpret the character string in Bar?

No, it does not depends on the compiler...

Does string.h expect a specific string format?

void Foo(void)
{
char myTextString[11] = "stuvxyz���";

Here, instead of char, try with wchar_t and
related functions if you are using unicode
for your messages and your .c files
Whether or not wchar_t has anything to do with unicode depends upon
the compiler; the standard makes no such requirement. When it does,
the way in which you can take advantage of that fact depends upon the
compiler as well.
Aug 18 '08 #7

P: n/a
Daniel Molina Wegener wrote, On 18/08/08 18:29:
On Aug 18, 7:48 am, Andreas Lundgren <d99...@efd.lth.sewrote:
>Hi!

Is it determined that the C standard compiler always encode characters
with the same character excoding? If for example the functions Foo and
Bar are compiled by different compilers, is it unambiguous how to
interpret the character string in Bar?

No, it does not depends on the compiler...
You are wrong. See the replies others posted before you for details.
>Does string.h expect a specific string format?

void Foo(void)
{
char myTextString[11] = "stuvxyz";

Here, instead of char, try with wchar_t and
related functions if you are using unicode
for your messages and your .c files
> Bar(myTextString);

}

void Bar(char* inp)
{
What character set to expect?

Thats depends on the user environment,
Wrong. It depends on what the function is written to expect and
(assuming the function expects a simple C string, which is likely) on
the encoding the implementation expects.

Actually, the expected encodings for standard C library functions which
handle strings and characters can be changed at run-time using the
setlocale() function, so it could also depend on what the program has
done before calling this function.
but if the
user environments is using unicode, you can expect no
more than an array of bytes,
Not necessarily.
other case is with
wchar_t and related functions...
For a start, an array of wchar_t is not simply an array of bytes.
>}
--
Flash Gordon
Aug 18 '08 #8

P: n/a
Many inputs and some disagreement.

A simple example may be the letter that in ASCII is represented by
the number 153, but in ISO-8859-1 and Unicode is represented by the
number 214.

From what I have read out, I have to specify to customers that a
specific method has an input of a city name _coded with ISO-8859-1_ in
a char pointer. Elsewhise 'Gthenborg' stores in ISO-8859-1 encoding
will not match a search for 'Gthenborg' provided in ASCII format.

Best Regards,
Andreas Lundgren
Aug 20 '08 #9

P: n/a
In article <d8**********************************@26g2000hsk.g ooglegroups.com>,
Andreas Lundgren <d9****@efd.lth.sewrote:
>A simple example may be the letter that in ASCII is represented by
the number 153
That's not ASCII. It's a Microsoft extension of ASCII called "code
page 437". ASCII has only 128 characters.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
Aug 20 '08 #10

P: n/a
Andreas Lundgren wrote:
Many inputs and some disagreement.

A simple example may be the letter Ö that in ASCII is represented by
the number 153,
There is no letter 'Ö' in ASCII, and the maximum value that an ASCII
character can have is 0177 = 127. 153 is out of range.
but in ISO-8859-1 and Unicode is represented by the
number 214.

From what I have read out, I have to specify to customers that a
specific method has an input of a city name _coded with ISO-8859-1_ in
a char pointer. Elsewhise 'Göthenborg' stores in ISO-8859-1 encoding
will not match a search for 'Göthenborg' provided in ASCII format.
If you have two different encodings for the same glyph, for example 214
in ISO-8869-1 and 153 in god-knows-what-but-not-ASCII, then they cannot
compare to be equal. The computer sees values, not the shapes of glyphs
on your output device.
Aug 20 '08 #11

P: n/a
Andreas Lundgren <d9****@efd.lth.sewrites:
Many inputs and some disagreement.

A simple example may be the letter Ö that in ASCII is represented by
the number 153,
Not ASCII, but your point remains valid.
but in ISO-8859-1 and Unicode is represented by the
number 214.
There is another big issue hidden in that phrase. That character is
indeed 214 in Unicode, but there are at least two well-known ways to
represent the U+214 in C (as a wide char and as some multi-byte
encoded string -- UTF-8 being the most commonly used in Europe and the
US).
From what I have read out, I have to specify to customers that a
specific method has an input of a city name _coded with ISO-8859-1_ in
a char pointer. Elsewhise 'Göthenborg' stores in ISO-8859-1 encoding
will not match a search for 'Göthenborg' provided in ASCII format.
You can either mandate one uniform encoding for everything or you can
allow the user to specify the encoding and convert to some suitably
all-embracing internally. You *could* tied the encoding to the string
and convert only as and when you need to but that will be a
maintenance nightmare.

--
Ben.
Aug 20 '08 #12

P: n/a
In article <d8**********************************@26g2000hsk.g ooglegroups.comAndreas Lundgren <d9****@efd.lth.sewrites:
A simple example may be the letter =D6 that in ASCII is represented by
the number 153, but in ISO-8859-1 and Unicode is represented by the
number 214.
That letter is not represented in ASCII. ASCII contains the code points
0 to 127, no more.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Aug 21 '08 #13

This discussion thread is closed

Replies have been disabled for this discussion.