Connecting Tech Pros Worldwide Forums | Help | Site Map

C, lexical

Lucas Zimmerman
Guest
 
Posts: n/a
#1: Nov 15 '05
Is there any Lex code available that describes how to scan C programs?
I'd like to
read someting related to this. One of my doubs is how C deals with
ambiguities,
for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
`//').

thanks in advance,

n.


Irrwahn Grausewitz
Guest
 
Posts: n/a
#2: Nov 15 '05

re: C, lexical


"Lucas Zimmerman" <netbogus@gmail.com> wrote:[color=blue]
>Is there any Lex code available that describes how to scan C programs?
>I'd like to
>read someting related to this. One of my doubs is how C deals with
>ambiguities,
>for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
>`//').[/color]

Well, it's not C99, but maybe a good starting point:

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

Best Regards
--
Irrwahn Grausewitz (irrwahn35@freenet.de)
welcome to clc : http://www.ungerhu.com/jxh/clc.welcome.txt
clc faq-list : http://www.faqs.org/faqs/C-faq/faq/
clc frequent answers: http://benpfaff.org/writings/clc.
Lucas Zimmerman
Guest
 
Posts: n/a
#3: Nov 15 '05

re: C, lexical


Irrwahn Grausewitz wrote:[color=blue]
> "Lucas Zimmerman" <netbogus@gmail.com> wrote:[color=green]
> >Is there any Lex code available that describes how to scan C programs?
> >I'd like to
> >read someting related to this. One of my doubs is how C deals with
> >ambiguities,
> >for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
> >`//').[/color]
>
> Well, it's not C99, but maybe a good starting point:
>
> http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
>
> Best Regards[/color]

Amazing document! thanks a lot Irrwahn.
Interesting how `char x<:N:>;' is valid in C. Is this c99 too?
I'm still learning C after 3 years studying it!! There is always
something
new to know about this language.

thanks once again,

n.

Thad Smith
Guest
 
Posts: n/a
#4: Nov 15 '05

re: C, lexical


Lucas Zimmerman wrote:
[color=blue]
> Is there any Lex code available that describes how to scan C programs?
> I'd like to
> read someting related to this. One of my doubs is how C deals with
> ambiguities,
> for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
> `//').[/color]

Those are not ambiguous because C specifies the processing order. The
first example contains the start of comment. The second example
performs a division in C90 and fragment "a = x" in C99.

Thad

Lucas Zimmerman
Guest
 
Posts: n/a
#5: Nov 15 '05

re: C, lexical


Irrwahn Grausewitz wrote:[color=blue]
> "Lucas Zimmerman" <netbogus@gmail.com> wrote:[color=green]
> >Is there any Lex code available that describes how to scan C programs?
> >I'd like to
> >read someting related to this. One of my doubs is how C deals with
> >ambiguities,
> >for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
> >`//').[/color]
>
> Well, it's not C99, but maybe a good starting point:
>
> http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
>
> Best Regards[/color]

I'm not sure but I think I found a bug in this code.
....
L?\"(\\.|[^\\"])*\" { count(); return(STRING_LITERAL); }
....

If I'm right, there is one backslash missing, so we would have this:

L?\"(\\.|[^\\\"])*\" { count(); return(STRING_LITERAL); /* right? */ }

insted of the original. It makes sense to me, since '\' is a lex regex
operator.

n.

Irrwahn Grausewitz
Guest
 
Posts: n/a
#6: Nov 15 '05

re: C, lexical


"Lucas Zimmerman" <netbogus@gmail.com> wrote:
<snip>[color=blue]
>Interesting how `char x<:N:>;' is valid in C. Is this c99 too?[/color]

Yup, digraphs are still mentioned in the standard, and I do not
expect them to be dropped any time soon.

ISO/IEC 9899:1999 (E) 6.4.6p3:

In all aspects of the language, the six tokens (*)
<: :> <% %> %: %:%:
behave, respectively, the same as the six tokens
[ ] { } # ##
except for their spelling.

(*) These tokens are sometimes called ‘‘digraphs’’.

Addition: note, that in the document I mentioned upthread the
*trigraphs* are missing.

ISO/IEC 9899:1999 (E) 5.2.1.1p1

All occurrences in a source file of the following sequences of three
characters (called trigraph sequences) are replaced with the
corresponding single character.
??= # ??) ] ??! |
??( [ ??' ^ ??> }
??/ \ ??< { ??- ~
No other trigraph sequences exist. Each ? that does not begin one of
the trigraphs listed above is not changed.

Should you ever notice, that printf("Huh???/n"); prints Huh?
followed
by a new-line, you now know why. :)

Best regards
--
Irrwahn Grausewitz (irrwahn35@freenet.de)
welcome to clc : http://www.ungerhu.com/jxh/clc.welcome.txt
clc faq-list : http://www.faqs.org/faqs/C-faq/faq/
clc frequent answers: http://benpfaff.org/writings/clc.
Simon Biber
Guest
 
Posts: n/a
#7: Nov 15 '05

re: C, lexical


Lucas Zimmerman wrote:[color=blue]
> Is there any Lex code available that describes how to scan C programs?
> I'd like to
> read someting related to this. One of my doubs is how C deals with
> ambiguities,
> for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
> `//').[/color]

C uses a "greedy parser", ie. it tries to make the largest token
possible at each point. So, x/*p is always the start of a comment, not x
divided by whatever p points to.

Your second example is equivalent to a = x/ -3; on C89, but equivalent
to a = x (with no semicolon) on C99. One of the stranger ways to tell
the difference at run time is:

[sbiber@eagle c]$ cat version.c
#include <stdio.h>

int main(void)
{
if(1//**/2
) printf("C99\n");
else printf("C89\n");

return 0;
}
[sbiber@eagle c]$ c89 version.c && ./a.out
C89
[sbiber@eagle c]$ c99 version.c && ./a.out
C99

Note how the closing parenthesis of the if statement must be on the next
line, so that it is not part of the C99 comment.

--
Simon.
Old Wolf
Guest
 
Posts: n/a
#8: Nov 15 '05

re: C, lexical


Irrwahn Grausewitz wrote:[color=blue]
>
> All occurrences in a source file of the following sequences of three
> characters (called trigraph sequences) are replaced with the
> corresponding single character.
> ??= # ??) ] ??! |
> ??( [ ??' ^ ??> }
> ??/ \ ??< { ??- ~
> No other trigraph sequences exist. Each ? that does not begin one
> of the trigraphs listed above is not changed.
>
> Should you ever notice, that printf("Huh???/n"); prints Huh?
> followed by a new-line, you now know why. :)[/color]

A more insidious example (plagiarized from www.gotw.ca article 86):

#include <stdio.h>

int main(void)
{
int x = 1;
int i;
for( i = 0; i < 100; ++i )
// What will the next line do? Increment???????????/
++x;
printf("%d\n", x);
}

Charlie Gordon
Guest
 
Posts: n/a
#9: Nov 15 '05

re: C, lexical


"Lucas Zimmerman" <netbogus@gmail.com> wrote in message
news:1126313198.207926.187920@z14g2000cwz.googlegr oups.com...[color=blue]
> Irrwahn Grausewitz wrote:[color=green]
> > "Lucas Zimmerman" <netbogus@gmail.com> wrote:[color=darkred]
> > >Is there any Lex code available that describes how to scan C programs?
> > >I'd like to
> > >read someting related to this. One of my doubs is how C deals with
> > >ambiguities,
> > >for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
> > >`//').[/color]
> >
> > Well, it's not C99, but maybe a good starting point:
> >
> > http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
> >
> > Best Regards[/color]
>
> Amazing document! thanks a lot Irrwahn.
> Interesting how `char x<:N:>;' is valid in C. Is this c99 too?
> I'm still learning C after 3 years studying it!! There is always
> something
> new to know about this language.[/color]

Its been almost 25 years, and I'm still learning as well ;-)

Enjoy!

Chqrlie.


Lucas Zimmerman
Guest
 
Posts: n/a
#10: Nov 15 '05

re: C, lexical


another question...

I tried to compile the following code with gcc:
------
#include <stdio.h>
@

int main(void) {
return 0;
}
-------

the output was:
t.c:2: error: syntax error at '@' token

My question then is: why gcc says `syntax error'? I'm not
sure what is happening here but I think the lexical analyzer
is passing '@' as a valid token to the parser and then parser
says `ok, I'm not expecting a @ so, syntax error'.

am I missing something? I thought lex would be responsible
for giving this error message since '@' is (AFAIC) not a valid
C token.

thanks a lot in advance once again,

n.

Walter Roberson
Guest
 
Posts: n/a
#11: Nov 15 '05

re: C, lexical


In article <1126630993.723249.185830@g49g2000cwa.googlegroups .com>,
Lucas Zimmerman <netbogus@gmail.com> wrote:[color=blue]
>I tried to compile the following code with gcc:
>------
>#include <stdio.h>
>@
>
>int main(void) {
> return 0;
>}
>-------[/color]
[color=blue]
>the output was:
>t.c:2: error: syntax error at '@' token[/color]
[color=blue]
>My question then is: why gcc says `syntax error'?[/color]

Why not?
[color=blue]
>I'm not
>sure what is happening here but I think the lexical analyzer
>is passing '@' as a valid token to the parser and then parser
>says `ok, I'm not expecting a @ so, syntax error'.[/color]
[color=blue]
>am I missing something? I thought lex would be responsible
>for giving this error message since '@' is (AFAIC) not a valid
>C token.[/color]

It appears to me that you are assuming that the program 'lex' is
being used to do lexical analysis, and that the result is passed
to gcc. gcc does not, however, use 'lex': it has its own built-in
lexical analyzer as -part- of its processing. gcc doesn't even
have a seperate preprocessing program (e.g., "cpp"): it does
everything up to an intermediate code representation in a single
unified program. There might be a bunch of different routines
that that unified program calls upon, but that part is all one
program, so all the error messages are going to appear to be
from the same program.
--
I was very young in those days, but I was also rather dim.
-- Christopher Priest
Keith Thompson
Guest
 
Posts: n/a
#12: Nov 15 '05

re: C, lexical


roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) writes:[color=blue]
> In article <1126630993.723249.185830@g49g2000cwa.googlegroups .com>,
> Lucas Zimmerman <netbogus@gmail.com> wrote:[color=green]
>>I tried to compile the following code with gcc:
>>------
>>#include <stdio.h>
>>@
>>
>>int main(void) {
>> return 0;
>>}
>>-------[/color]
>[color=green]
>>the output was:
>>t.c:2: error: syntax error at '@' token[/color]
>[color=green]
>>My question then is: why gcc says `syntax error'?[/color]
>
> Why not?
>[color=green]
>>I'm not
>>sure what is happening here but I think the lexical analyzer
>>is passing '@' as a valid token to the parser and then parser
>>says `ok, I'm not expecting a @ so, syntax error'.[/color]
>[color=green]
>>am I missing something? I thought lex would be responsible
>>for giving this error message since '@' is (AFAIC) not a valid
>>C token.[/color]
>
> It appears to me that you are assuming that the program 'lex' is
> being used to do lexical analysis, and that the result is passed
> to gcc. gcc does not, however, use 'lex': it has its own built-in
> lexical analyzer as -part- of its processing. gcc doesn't even
> have a seperate preprocessing program (e.g., "cpp"): it does
> everything up to an intermediate code representation in a single
> unified program. There might be a bunch of different routines
> that that unified program calls upon, but that part is all one
> program, so all the error messages are going to appear to be
> from the same program.[/color]

Or perhaps he was using "lex" as an abbreviation of "lexical
analyzer". (In any case, the "lex" program *generates* a lexical
analyzer.)

Some versions of gcc do use a separate preprocessor. For example,
"gcc -v" with version 2.95.2 shows that it invokes "cpp" followed by
"cc1". Later versions just invoke "cc1". (Later phases aren't
invoked if there's a failure in an earlier phase.)

This is off-topic, except that it illustrates that a compiler has a
lot of freedom in how it implements the translation phases described
in section 5.1.1.2 of the standard.

With gcc versions 3.4.4 and 4.0.0, the error message I get is
"error: stray '@' in program".

Also, note that a lone @ character *is* a valid preprocessor token,
though it isn't a valid token. This means that this:

#if 0
@
#endif
int main(void){}

is a legal program, but this:

#if 0
"
#endif
int main(void){}

isn't (it invokes undefined behavior).

The point of all this is that, although the standard defines 8
distinct translation phases, an implementation is not required to
implement them as separate sequential phases. As long as it processes
legal programs correctly and issues diagnostics where required, it can
do whatever it likes.

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Walter Roberson
Guest
 
Posts: n/a
#13: Nov 15 '05

re: C, lexical


In article <lnu0goyccf.fsf@nuthaus.mib.org>,
Keith Thompson <kst-u@mib.org> wrote:[color=blue]
>Also, note that a lone @ character *is* a valid preprocessor token,
>though it isn't a valid token. This means that this:[/color]
[color=blue]
>#if 0
>@
>#endif
>int main(void){}[/color]
[color=blue]
>is a legal program,[/color]

Keith, I'm not quite sure how you get that? @ is not part of
the basic C character set, so how can its behaviour be well defined?

As the validity of the presence of @ would appear to be an
implementation extension, then that implementation extension could
treat @ as an alias for " for example.
--
Any sufficiently old bug becomes a feature.
Keith Thompson
Guest
 
Posts: n/a
#14: Nov 15 '05

re: C, lexical


roberson@ibd.nrc-cnrc.gc.ca (Walter Roberson) writes:[color=blue]
> In article <lnu0goyccf.fsf@nuthaus.mib.org>,
> Keith Thompson <kst-u@mib.org> wrote:[color=green]
>>Also, note that a lone @ character *is* a valid preprocessor token,
>>though it isn't a valid token. This means that this:[/color]
>[color=green]
>>#if 0
>>@
>>#endif
>>int main(void){}[/color]
>[color=green]
>>is a legal program,[/color]
>
> Keith, I'm not quite sure how you get that? @ is not part of
> the basic C character set, so how can its behaviour be well defined?
>
> As the validity of the presence of @ would appear to be an
> implementation extension, then that implementation extension could
> treat @ as an alias for " for example.[/color]

You're right (at least partly); I didn't think of that.

C99 5.2.1 says that the source character set includes *at least* a
specified set of characters (upper and lower case letters, digits,
space, horizontal tab, vertical tab, form feed, and 29 punctuation
characters, *not* including '@'). But '@' can be, an often is, an
"extended character".

For an implementation that doesn't define '@' as part of the source
character set, any occurrence of @ in a source file invokes undefined
behavior (which, as you say, can include treating it as an alias for ").
But if '@' *is* part of the source character set, then it's a legal
preprocessor token (but not a legal token).

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Closed Thread