'hello world' OS

Santanu Chatterjee

Hello all,

I would like to know how an OS makes a computer boot up.
For that, as a start I would like to see an e_ample (read the
underscore as the letter before y, as the keyboard here is
defective) C program (along with instructions on how to compile
and usae it) that will take over when the computer boots and
print "hello world" and reboots the machine when 'enter' is
pressed. This will be something of a "hello world" OS (for me)
(I have tried to go through the boot.S (not sure about the name)
program in the Linu_ kernel source, but I could not understand
it since I am not familiar with assembly language).

I would be glad if someone could please point me in the right
direction.

Regards,
Santanu

Nov 14 '05 #1

Subscribe Post Reply

2466

Case

Santanu Chatterjee wrote:

Hello all,

I would like to know how an OS makes a computer boot up.
For that, as a start I would like to see an e_ample (read the
underscore as the letter before y, as the keyboard here is
defective) C program (along with instructions on how to compile
and usae it) that will take over when the computer boots and
print "hello world" and reboots the machine when 'enter' is
pressed. This will be something of a "hello world" OS (for me)
(I have tried to go through the boot.S (not sure about the name)
program in the Linu_ kernel source, but I could not understand
it since I am not familiar with assembly language).

I would be glad if someone could please point me in the right
direction.

Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.

#include <stdio.h>

int main(void)
{
printf("hello world\n");
while (getchar() != 'y' - 1)
{
}
}

HTH

Case

Nov 14 '05 #2

Richard Bos

sa*****@softhome.net (Santanu Chatterjee) wrote:

I would like to know how an OS makes a computer boot up.

It doesn't. The ROM bootstrap loader does. _How_ it does this is
completely system-dependent, and therefore you should ask about it in a
newsgroup dedicated to the architecture you are interested in. There is
no C program which can demonstrate this for more than a single
architecture, and for most you will need low-level system calls, since
the higher level functions used by ISO C are dependent on the very OS
you want to replace.

Richard

Nov 14 '05 #3

Richard Bos

Case <no@no.no> wrote:

Santanu Chatterjee wrote:
I would like to know how an OS makes a computer boot up.
Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.
#include <stdio.h>

int main(void)
{
printf("hello world\n");
while (getchar() != 'y' - 1)
{
}
}

Beautiful. Not only has it nothing whatsoever to do with the question,
it is even wrong. (Hint: what makes you think the ISO C Standard
mandates ASCII?)

Richard

Nov 14 '05 #4

Joona I Palaste

Richard Bos <rl*@hoekstra-uitgeverij.nl> scribbled the following:

Case <no@no.no> wrote:
Santanu Chatterjee wrote:
> I would like to know how an OS makes a computer boot up. Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.
#include <stdio.h>

int main(void)
{
printf("hello world\n");
while (getchar() != 'y' - 1)
{
}
}

Beautiful. Not only has it nothing whatsoever to do with the question,
it is even wrong. (Hint: what makes you think the ISO C Standard
mandates ASCII?)

I figure there is no portable way whatsoever to guarantee an integer
value corresponds to the character 'x' without actually using the
character constant
'x'
either by itself, or as part of an array or string, at some point in
the C source code.
For characters corresponding to digits from 0 to 9 it can be done, but
I don't think it can be done for any other characters.

--
/-- Joona Palaste (pa*****@cc.helsinki.fi) ------------- Finland --------\
\-- http://www.helsinki.fi/~palaste --------------------- rules! --------/
"I will never display my bum in public again."
- Homer Simpson

Nov 14 '05 #5

Jeremy Yallop

Joona I Palaste wrote:

I figure there is no portable way whatsoever to guarantee an integer
value corresponds to the character 'x' without actually using the
character constant
'x'
either by itself, or as part of an array or string, at some point in
the C source code.

#include <string.h>
#include <ctype.h>

int is_x(int c) /* C locale */
{
return islower((unsigned char)c)
&& strchr("abcdefghijklmnopqrstuvwyz", c) == NULL;
}

Jeremy.

Nov 14 '05 #6

Dan Pop

In <59*************************@posting.google.com> sa*****@softhome.net (Santanu Chatterjee) writes:

I would like to know how an OS makes a computer boot up.
This is beyond the capabilities of any OS. At least the first stages
of the booting procedure are handled by programs that are not OS-specific.
These programs, are, however, heavily platform specific.
For that, as a start I would like to see an e_ample (read the
underscore as the letter before y, as the keyboard here is
defective) C program (along with instructions on how to compile
and usae it) that will take over when the computer boots and
print "hello world" and reboots the machine when 'enter' is
pressed.
There is no way to write such a program in portable C. Because there is
no OS, such a program would have to be written as a freestanding
application, i.e. all its output must be generated by its own means,
with no standard library support. And, with no standard library
support, there is no portable way of generating any output.

Furthermore, such a program may have to do things that cannot be done in
C at all, like setting various CPU registers to appropriate values.
This will be something of a "hello world" OS (for me)
(I have tried to go through the boot.S (not sure about the name)
program in the Linu_ kernel source, but I could not understand
it since I am not familiar with assembly language).

I would be glad if someone could please point me in the right
direction.

The right direction is to start learning assembly programming. It cannot
be bypassed when programming at this level, even if it's merely asm
statements embedded in C code.

And once you learn assembly, you'll discover that you have no need for C
at all for implementing the program you have in mind.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #7

Thomas Matthews

Santanu Chatterjee wrote:

Hello all,

I would like to know how an OS makes a computer boot up. A better place to discuss your issue is in news:comp.os.*.
Another place is news:comp.arch.embedded.

For that, as a start I would like to see an e_ample (read the
underscore as the letter before y, as the keyboard here is
defective) C program (along with instructions on how to compile
and usae it) that will take over when the computer boots and
print "hello world" and reboots the machine when 'enter' is
pressed. This will be something of a "hello world" OS (for me)
(I have tried to go through the boot.S (not sure about the name)
program in the Linu_ kernel source, but I could not understand
it since I am not familiar with assembly language). The boot sequence is platform dependent and usually differs
by platform and operating system.

A common sequence is:
1. Turn off all interrupts
2. Perform diagnostics (i.e. memory testing, device testing)
3. Initialize interrupt and other vectors.
4. Initialize memory structure (including stacks).
5. Initialize 'C' run-time library.
6. Jump to "main" function in C program.

Platforms with more complex operating systems would have different
sequences (and more complicated ones). Much of the boot code is
written in assembly language. The run-time environment for the
high level language must be initialized before a high-level
language can be executed.

I would be glad if someone could please point me in the right
direction.

Regards,
Santanu

On many platforms, your executable program is loaded into memory
and executed by the operating system. Your program has no idea
how long the platform has been operational before your program
is executed.
--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Nov 14 '05 #8

Case

Richard Bos wrote:

Case <no@no.no> wrote:

Santanu Chatterjee wrote:
I would like to know how an OS makes a computer boot up.

Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.
#include <stdio.h>

int main(void)
{
printf("hello world\n");
while (getchar() != 'y' - 1)
{
}
}

Beautiful. Not only has it nothing whatsoever to do with the question,
it is even wrong. (Hint: what makes you think the ISO C Standard
mandates ASCII?)

The OP is unable to type an 'x' that's why I used his
suggestion to get there. Thanks for pointing out that
it has nothing to do with the question, I did not know
that. And, on his OS, Linu_, ASCII is quit common, if
I'm right; but could you please confirm. BTW, do you
know a real world character set in which 'x' != 'y' - 1?

Case

Nov 14 '05 #9

Joona I Palaste

Jeremy Yallop <je****@jdyallop.freeserve.co.uk> scribbled the following:

Joona I Palaste wrote:
I figure there is no portable way whatsoever to guarantee an integer
value corresponds to the character 'x' without actually using the
character constant
'x'
either by itself, or as part of an array or string, at some point in
the C source code.
#include <string.h>
#include <ctype.h> int is_x(int c) /* C locale */
{
return islower((unsigned char)c)
&& strchr("abcdefghijklmnopqrstuvwyz", c) == NULL;
}

Quite clever. AFAIK, however, this can be only done for one character in
one C program. If we were trying to use such functions to identify *two*
characters, the best we could achieve would be knowing whether the input
is *either* of them, but we couldn't know *which*. If we used those
alphabet strings missing one letter, both letters would show up in each
other's alphabet strings. But if we used only one, missing two letters,
there would be no way to tell which one of them the input was.

--
/-- Joona Palaste (pa*****@cc.helsinki.fi) ------------- Finland --------\
\-- http://www.helsinki.fi/~palaste --------------------- rules! --------/
"I said 'play as you've never played before', not 'play as IF you've never
played before'!"
- Andy Capp

Nov 14 '05 #10

Richard Bos

Case <no@no.no> wrote:

Richard Bos wrote:
Case <no@no.no> wrote:
Santanu Chatterjee wrote:

I would like to know how an OS makes a computer boot up.
Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.
Beautiful. Not only has it nothing whatsoever to do with the question,
it is even wrong. (Hint: what makes you think the ISO C Standard
mandates ASCII?)

The OP is unable to type an 'x' that's why I used his
suggestion to get there.

Then first of all, you should learn to snip, because you made it look as
though you replied to that whole post; and second, how would a program
that _waits_ for an 'x' help someone who cannot _type_ an 'x'?
And, on his OS, Linu_, ASCII is quit common, if I'm right;
So bloody what? This is comp.lang.c, not comp.lang.c.linux.
BTW, do you know a real world character set in which 'x' != 'y' - 1?

No, but I do know one in which not all letters are consecutive.

Richard

Nov 14 '05 #11

Dan Pop

In <cb**********@oravannahka.helsinki.fi> Joona I Palaste <pa*****@cc.helsinki.fi> writes:

I figure there is no portable way whatsoever to guarantee an integer
value corresponds to the character 'x' without actually using the
character constant
'x'
either by itself, or as part of an array or string, at some point in
the C source code.

I was about to say that '\U0078' would do in C99, but it appears to be
a constraint violation: you can't use UCNs for the members of the basic
source character set. Only the ASCII characters that aren't part of the
basic source character set can be represented with UCNs: $, @ and `.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #12

Dan Pop

In <40***********************@news.xs4all.nl> Case <no@no.no> writes:

Santanu Chatterjee wrote:
Hello all,

I would like to know how an OS makes a computer boot up.
For that, as a start I would like to see an e_ample (read the
underscore as the letter before y, as the keyboard here is
defective) C program (along with instructions on how to compile
and usae it) that will take over when the computer boots and
print "hello world" and reboots the machine when 'enter' is
pressed. This will be something of a "hello world" OS (for me)
(I have tried to go through the boot.S (not sure about the name)
program in the Linu_ kernel source, but I could not understand
it since I am not familiar with assembly language).

I would be glad if someone could please point me in the right
direction.

Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.

#include <stdio.h>

int main(void)
{
printf("hello world\n");
while (getchar() != 'y' - 1)
{
}
}

Now explain how is it supposed to work on a freestanding platform, where:

1. The name and interface of the startup function is
implementation-defined.

2. Neither printf nor getchar are available (they typically rely on the
existence of an OS, but there is none in our case).

3. #include <stdio.h> may stop the compilation process with a message like

test.c:1:20: stdio.h: No such file or directory

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #13

Dan Pop

In <40****************@news.individual.net> rl*@hoekstra-uitgeverij.nl (Richard Bos) writes:

Case <no@no.no> wrote:
Santanu Chatterjee wrote:
> I would like to know how an OS makes a computer boot up.

Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.
#include <stdio.h>

int main(void)
{
printf("hello world\n");
while (getchar() != 'y' - 1)
{
}
}

Beautiful. Not only has it nothing whatsoever to do with the question,
it is even wrong. (Hint: what makes you think the ISO C Standard
mandates ASCII?)

It doesn't have to. EBCDIC satisfies the poster's assumption, as well.
Good luck finding a conforming hosted implementation whose execution
character set is not based on (i.e. an extension of) either ASCII
or EBCDIC.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #14

Joona I Palaste

Dan Pop <Da*****@cern.ch> scribbled the following:

In <cb**********@oravannahka.helsinki.fi> Joona I Palaste <pa*****@cc.helsinki.fi> writes:
I figure there is no portable way whatsoever to guarantee an integer
value corresponds to the character 'x' without actually using the
character constant
'x'
either by itself, or as part of an array or string, at some point in
the C source code.
I was about to say that '\U0078' would do in C99, but it appears to be
a constraint violation: you can't use UCNs for the members of the basic
source character set. Only the ASCII characters that aren't part of the
basic source character set can be represented with UCNs: $, @ and `.

Now this question is perhaps off-topic for comp.lang.c, but I don't
understand *why* you can't use UCNs for members of the basic character
set. What is the rationale behind this constraint?

--
/-- Joona Palaste (pa*****@cc.helsinki.fi) ------------- Finland --------\
\-- http://www.helsinki.fi/~palaste --------------------- rules! --------/
"The question of copying music from the Internet is like a two-barreled sword."
- Finnish rap artist Ezkimo

Nov 14 '05 #15

Jeremy Yallop

Joona I Palaste wrote:

Jeremy Yallop <je****@jdyallop.freeserve.co.uk> scribbled the following:
Joona I Palaste wrote:
I figure there is no portable way whatsoever to guarantee an
integer value corresponds to the character 'x' without actually
using the character constant 'x' either by itself, or as part of
an array or string, at some point in the C source code.

#include <string.h>
#include <ctype.h>

int is_x(int c) /* C locale */
{
return islower((unsigned char)c)
&& strchr("abcdefghijklmnopqrstuvwyz", c) == NULL;
}

Quite clever. AFAIK, however, this can be only done for one character in
one C program. If we were trying to use such functions to identify *two*
characters, the best we could achieve would be knowing whether the input
is *either* of them, but we couldn't know *which*.

Okay, here's another way, which doesn't have that restriction:

if (c == tolower('X'))

Jeremy.

Nov 14 '05 #16

Case -

Dan Pop wrote:

In <40***********************@news.xs4all.nl> Case <no@no.no> writes:

Santanu Chatterjee wrote:
Hello all,

I would like to know how an OS makes a computer boot up.
For that, as a start I would like to see an e_ample (read the
underscore as the letter before y, as the keyboard here is
defective) C program (along with instructions on how to compile
and usae it) that will take over when the computer boots and
print "hello world" and reboots the machine when 'enter' is
pressed. This will be something of a "hello world" OS (for me)
(I have tried to go through the boot.S (not sure about the name)
program in the Linu_ kernel source, but I could not understand
it since I am not familiar with assembly language).

I would be glad if someone could please point me in the right
direction.

Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.
Huh, what can he mean here? And, the OP is not even able to type
an 'x'. The question is, at least a bit, off-topic too. Sorry
guys, and especially a sorry to the OP; I'll try not to do it again.

#include <stdio.h>

int main(void)
{
printf("hello world\n");
while (getchar() != 'y' - 1)
{
}
}

Now explain how is it supposed to work on a freestanding platform, where:

1. The name and interface of the startup function is
implementation-defined.

2. Neither printf nor getchar are available (they typically rely on the
existence of an OS, but there is none in our case).

3. #include <stdio.h> may stop the compilation process with a message like

test.c:1:20: stdio.h: No such file or directory

This point 3. has a causality problem. If you're able to
run a C compiler, a C compiler that is able to print, then
I guess stdio.h will be close enough.

Case

Nov 14 '05 #17

Kenneth Brody

Jeremy Yallop wrote:

Joona I Palaste wrote:
I figure there is no portable way whatsoever to guarantee an integer
value corresponds to the character 'x' without actually using the
character constant
'x'
either by itself, or as part of an array or string, at some point in
the C source code.

#include <string.h>
#include <ctype.h>

int is_x(int c) /* C locale */
{
return islower((unsigned char)c)
&& strchr("abcdefghijklmnopqrstuvwyz", c) == NULL;
}

What happens if the user enters a lowercase accented letter, such as 'á'
(which may or may not show up on your system properly, but is an accented
'a' here)?

--
+-------------------------+--------------------+-----------------------------+
| Kenneth J. Brody | www.hvcomputer.com | |
| kenbrody at spamcop.net | www.fptech.com | #include <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------------+

Nov 14 '05 #18

Dan Pop

In <40**********************@dreader2.news.tiscali.nl > Case - <no@no.no> writes:

Dan Pop wrote:
In <40***********************@news.xs4all.nl> Case <no@no.no> writes:

Santanu Chatterjee wrote:

Hello all,

I would like to know how an OS makes a computer boot up.
For that, as a start I would like to see an e_ample (read the
underscore as the letter before y, as the keyboard here is
defective) C program (along with instructions on how to compile
and usae it) that will take over when the computer boots and
print "hello world" and reboots the machine when 'enter' is
pressed. This will be something of a "hello world" OS (for me)
(I have tried to go through the boot.S (not sure about the name)
program in the Linu_ kernel source, but I could not understand
it since I am not familiar with assembly language).

I would be glad if someone could please point me in the right
direction.

Here's a program that waits for an 'x'. Using any other input
for a program like this is dangerous and not standard C.
Huh, what can he mean here? And, the OP is not even able to type
an 'x'. The question is, at least a bit, off-topic too. Sorry
More than a bit. It's downright off-topic.
guys, and especially a sorry to the OP; I'll try not to do it again.

#include <stdio.h>

int main(void)
{
printf("hello world\n");
while (getchar() != 'y' - 1)
{
}
}

Now explain how is it supposed to work on a freestanding platform, where:

1. The name and interface of the startup function is
implementation-defined.

2. Neither printf nor getchar are available (they typically rely on the
existence of an OS, but there is none in our case).

3. #include <stdio.h> may stop the compilation process with a message like

test.c:1:20: stdio.h: No such file or directory

This point 3. has a causality problem. If you're able to
run a C compiler, a C compiler that is able to print, then
I guess stdio.h will be close enough.

I'm afraid I can't make any sense out of your incoherent statement.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #19

Dan Pop

In <cb**********@oravannahka.helsinki.fi> Joona I Palaste <pa*****@cc.helsinki.fi> writes:

Now this question is perhaps off-topic for comp.lang.c, but I don't
understand *why* you can't use UCNs for members of the basic character
set. What is the rationale behind this constraint?

I have no clue. Try the C99 rationale or ask in comp.std.c. The relevant
chapter and verse is:

6.4.3 Universal character names
....
Constraints

2 A universal character name shall not specify a character whose
short identifier is less than 00A0 other than 0024 ($), 0040 (@),
or 0060 (`), nor one in the range D800 through DFFF inclusive. 61)

____________________

61) The disallowed characters are the characters in the basic
character set and the code positions reserved by ISO/IEC 10646
for control characters, the character DELETE, and the S-zone
(reserved for use by UTF-16).

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #20

Jeremy Yallop

Kenneth Brody wrote:

Jeremy Yallop wrote:

Joona I Palaste wrote:
> I figure there is no portable way whatsoever to guarantee an integer
> value corresponds to the character 'x' without actually using the
> character constant
> 'x'
> either by itself, or as part of an array or string, at some point in
> the C source code.

#include <string.h>
#include <ctype.h>

int is_x(int c) /* C locale */
{
return islower((unsigned char)c)
&& strchr("abcdefghijklmnopqrstuvwyz", c) == NULL;
}

What happens if the user enters a lowercase accented letter, such as 'á'
(which may or may not show up on your system properly, but is an accented
'a' here)?

In the "C" locale, islower() only returns true for the 26 lowercase
letters of the Latin alphabet, so is_x() will return false for 'á'.

Jeremy.

Nov 14 '05 #21

Arthur J. O'Dwyer

On Wed, 30 Jun 2004, Joona I Palaste wrote:

Dan Pop <Da*****@cern.ch> scribbled the following:
I was about to say that '\U0078' would do in C99, but it appears to be
a constraint violation: you can't use UCNs for the members of the basic
source character set. Only the ASCII characters that aren't part of the
basic source character set can be represented with UCNs: $, @ and `.

Now this question is perhaps off-topic for comp.lang.c, but I don't
understand *why* you can't use UCNs for members of the basic character
set. What is the rationale behind this constraint?

I'm not an authority, but I assume the reason is so that implementations
that don't support extended character sets don't have to implement
anything special to parse UCNs (which I think are new in C99?).

Alternatively, it could be a B&D approach to clarity and portability:
if the only way to write 'x' is to actually use the letter 'x', and not
to use arbitrarily complicated arithmetic, then the maintainer has one
less problem to worry about when porting to an EBCDIC system. ;-)

-Arthur

Nov 14 '05 #22

Dan Pop

In <Pi**********************************@unix41.andre w.cmu.edu> "Arthur J. O'Dwyer" <aj*@nospam.andrew.cmu.edu> writes:

On Wed, 30 Jun 2004, Joona I Palaste wrote:

Dan Pop <Da*****@cern.ch> scribbled the following:
> I was about to say that '\U0078' would do in C99, but it appears to be
> a constraint violation: you can't use UCNs for the members of the basic
> source character set. Only the ASCII characters that aren't part of the
> basic source character set can be represented with UCNs: $, @ and `.
Now this question is perhaps off-topic for comp.lang.c, but I don't
understand *why* you can't use UCNs for members of the basic character
set. What is the rationale behind this constraint?

I'm not an authority, but I assume the reason is so that implementations
that don't support extended character sets don't have to implement
anything special to parse UCNs (which I think are new in C99?).

Wrong. UCN support is mandatory:

6.4.2 Identifiers

6.4.2.1 General

Syntax

1 identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit

identifier-nondigit:
nondigit
universal-character-name
other implementation-defined characters

nondigit: one of
_ a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z

digit: one of
0 1 2 3 4 5 6 7 8 9

An interesting consequence is that f$ is not a valid identifier, but
it becomes a valid identifier if the $ sign is replaced by its UCN:
f\u0024 !
Alternatively, it could be a B&D approach to clarity and portability:
if the only way to write 'x' is to actually use the letter 'x', and not
to use arbitrarily complicated arithmetic, then the maintainer has one
less problem to worry about when porting to an EBCDIC system. ;-)

Wrong again: UCNs have nothing to do with ASCII vs EBCDIC issues.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #23

Keith Thompson

Da*****@cern.ch (Dan Pop) writes:
[...]

An interesting consequence is that f$ is not a valid identifier, but
it becomes a valid identifier if the $ sign is replaced by its UCN:
f\u0024 !

I don't think so.

C99 6.4.2.1p3 says:

Each universal character name in an identifier shall designate a
character whose encoding in ISO/IEC 10646 falls into one of the
ranges specified in annex D.

The encoding of '$', 0024, is not within one of the ranges specified
in annex D.

Interestingly, the "shall" in 6.4.2.1p3 is not in a constraint, so
using f\u0024 as an identifier invokes undefined behavior (it doesn't
violate a syntax rule either). I wonder if that was the intent. It
seems to me that it would make more sense for it to be a constraint
violation, requiring a diagnostic. If I'm not mistaken, a conforming
implementation could simply ignore annex D and allow any arbitrary
UCNs in identifiers. (That doesn't make f\u0024 a valid identifier,
it just means the implemention isn't required to diagnose it.)

Another possible oversight: the same paragraph also says

The initial character shall not be a universal character name
designating a digit.

but there's no specification in annex D of which UCNs specify digits.
Presumably ISO/IEC 10646 covers that, but it would be useful to spell
it out in the C standard, perhaps in a footnote.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #24

Arthur J. O'Dwyer

On Thu, 1 Jul 2004, Keith Thompson wrote:

Da*****@cern.ch (Dan Pop) writes:
[...]
An interesting consequence is that f$ is not a valid identifier, but
it becomes a valid identifier if the $ sign is replaced by its UCN:
f\u0024 !
I don't think so.

[...] The encoding of '$', 0024, is not within one of the ranges specified
in annex D.

Interestingly, the "shall" in 6.4.2.1p3 is not in a constraint, so
using f\u0024 as an identifier invokes undefined behavior (it doesn't
violate a syntax rule either). I wonder if that was the intent. It
seems to me that it would make more sense for it to be a constraint
violation, requiring a diagnostic. If I'm not mistaken, a conforming
implementation could simply ignore annex D and allow any arbitrary
UCNs in identifiers. (That doesn't make f\u0024 a valid identifier,
it just means the implemention isn't required to diagnose it.)
I was wrong about implementations' being allowed to not-support UCNs
(all conforming implementations must, I think). But the passage to
which you're referring does seem to support the general conclusion that
UCNs were added grudgingly: there are a lot of other places where
dubious use of UCNs leads to UB rather than a constraint violation
(a couple of places in the preprocessing stages, for example). I
think this is because maybe the Committee realized that nobody was
going to build in full "Unicode"[1] support just for the benefit of
anal-retentive users.
(Non-USAnians may have a better idea, but I'm under the impression that
\u4E00 looks like "backslash, letter u, 4, E, 0, 0" in all major IDEs, so
there's no good reason to use UCNs in C code except inside string literals
anyway. It doesn't let you "write code in your own language" or
anything.)
Another possible oversight: the same paragraph also says

The initial character shall not be a universal character name
designating a digit.

but there's no specification in annex D of which UCNs specify digits.
Presumably ISO/IEC 10646 covers that, but it would be useful to spell
it out in the C standard, perhaps in a footnote.

I thought one of the sections in Annex D was labeled "Extended Digits"
or something like that?

-Arthur

Nov 14 '05 #25

Keith Thompson

"Arthur J. O'Dwyer" <aj*@nospam.andrew.cmu.edu> writes:

On Thu, 1 Jul 2004, Keith Thompson wrote: [...] I was wrong about implementations' being allowed to not-support UCNs
(all conforming implementations must, I think). But the passage to
which you're referring does seem to support the general conclusion that
UCNs were added grudgingly: there are a lot of other places where
dubious use of UCNs leads to UB rather than a constraint violation
(a couple of places in the preprocessing stages, for example). I
think this is because maybe the Committee realized that nobody was
going to build in full "Unicode"[1] support just for the benefit of
anal-retentive users.
(Non-USAnians may have a better idea, but I'm under the impression that
\u4E00 looks like "backslash, letter u, 4, E, 0, 0" in all major IDEs, so
there's no good reason to use UCNs in C code except inside string literals
anyway. It doesn't let you "write code in your own language" or
anything.)

Presumably the intent is to allow programmers to use native characters
in identifiers; nobody is expected to write "\u4E00".

In translation phase 1:

Physical source file multibyte characters are mapped, in an
implementation-defined manner, to the source character set ...

I think the sequence "\u4E00" is normally expected to occur only after
translation phase 1; in the actual source file, it should look like
the corresponding Asian ideograph. As the rationale says:

Given the current state of multibyte encodings, this mapping is
specified to be implementation-defined; but an implementation can
provide the users with utility programs that do the conversion
from UCNs to "native" multibytes or vice versa, thus providing a
way to exchange source files between implementations using the UCN
notation.

UCNs are similar to trigraphs, but they seem to work in the opposite
direction. Phase 1 maps trigraphs to their legible single-character
equivalents, but it (optionally?) maps legible native characters to
their illegible UCN equivalents. Trigraphs are intended to be used in
human-readable source code (believe it or not); UCNs are not.

Of course UCNs can be used in source code if the programmer is
sufficiently masochistic; in that case, phase 1 presumably will pass
them through unchanged.

It's quite possible that I've misunderstood this. None of the
characters that require UCNs to represent them appear on my keyboard,
so I don't have much experience with this kind of thing. Corrections
are welcome.

Another possible oversight: the same paragraph also says

The initial character shall not be a universal character name
designating a digit.

but there's no specification in annex D of which UCNs specify digits.
Presumably ISO/IEC 10646 covers that, but it would be useful to spell
it out in the C standard, perhaps in a footnote.

I thought one of the sections in Annex D was labeled "Extended Digits"
or something like that?

You're right. Annex D is two pages long; the last two sections at the
bottom of the second page are "Digits" and "Special characters".
(There's no other mention of "special characters", so I suppose they
can be used in identifiers as if they were letters.)

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #26

Dan Pop

In <ln************@nuthaus.mib.org> Keith Thompson <ks***@mib.org> writes:

Da*****@cern.ch (Dan Pop) writes:
[...]
An interesting consequence is that f$ is not a valid identifier, but
it becomes a valid identifier if the $ sign is replaced by its UCN:
f\u0024 !

I don't think so.

C99 6.4.2.1p3 says:

Each universal character name in an identifier shall designate a
character whose encoding in ISO/IEC 10646 falls into one of the
ranges specified in annex D.

The encoding of '$', 0024, is not within one of the ranges specified
in annex D.

Good point! So, \u0024 can appear only in character constants and
string literals, as expected.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de

Nov 14 '05 #27

'hello world' OS

Similar topics