On Wed, 24 Aug 2005 06:22:15 -0400, "Steven T. Hatton"
<ch********@germania.sup> wrote in comp.lang.c++:
Victor Bazarov wrote:
Steven T. Hatton wrote: This is just idle curiosity. I was playing with this code from the
Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
discovered that it does strange things with strings beginning with
numbers. [...]
C is actually older than Lex, but I suspect the techniques used to
scan early C code were similar to what was incorporated into Lex.
Anybody know about this?
C _language_ is off-topic. Please visit comp.lang.c for that.
C _Library_ is on topic since it's part of C++ Library. Just so
there is no misunderstanding.
Then why does C++ have the rule regarding not beginning an identifier with a
digit? I was intending to imply that this was due to the fact that C
already had that rule.
Because it is the one and only way to avoid imposing more complex
requirements on the remaining characters of the identifier.
Leave aside the prohibitions on symbols reserved for the
implementation due to underscores, which is an issue for the linker
and not the parser. The regular expression for a valid C or C++
identifier is:
[_A-Za-z][_A-Za-z0-9]*
If you allow the first character to be a digit, you have the
apparently simpler:
[_A-Za-z0-9]*
....but that expression accepts, for example, '123' as an identifier as
opposed to an integer literal, but '123T' is truly an identifier. So
now you need a rule that states that if the identifier begins with a
digit, it must also include at least one non-digit character.
OK, what happens to '0x7fff'? Oops, that's an identifier, not a
literal with a value of 32767.
So you could make the rule that if the first character is '0', then
either the second character can't be 'x' or 'X', or the remaining
characters must contain at least one character not a digit and
[^a-fA-F].
So start rewriting the C or C++ grammar for an identifier that handles
all of these cases. Then start writing the text that explains the
limitations at several levels, from books for beginners to the
normative text of the standard itself.
Then start rewriting the preprocessor so it works according to the
rules, with all the special cases. In the first place, you either
have to forget about the tradition that you can completely pp tokenize
C source with one character's worth of ungetc() in a single pass.
Either you have to retain a lot more text to back up through, or you
have to supply some sort of state machine that processes the value as
a number and a symbol simultaneously until the disambiguating
character is encountered.
And of course, write the diagnostics to issue to the programmer when
he accidentally slips up and delivers a numeric literal where an
identifier is required.
Consider that when the C grammar evolved, it was quite common for
assemblers, which tended to be used a lot more by systems programmers
in those days than since C and to some extent C++ have become
universal, tended to have the same restriction.
Now that you've made a time and money estimate of the effort make it
work, factor in the additional opportunities for programmer error and
put together a business case. Can you make a convincing argument that
the cost of requiring these changes to every C and C++ compiler in the
world (or even just the C++ compilers) is more than compensated by the
added benefit of allowing programmers to create identifiers starting
with a digit? What's the dollar value of this benefit?
--
Jack Klein
Home:
http://JK-Technology.Com
FAQs for
comp.lang.c
http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++
http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html