Why identifiers don't beging with digits?

Steven T. Hatton

This is just idle curiosity. I was playing with this code from the Lex &&
Yacc book [http://www.oreilly.com/catalog/lex/], and discovered that it
does strange things with strings beginning with numbers.
$ cat example.l
%{
/*
* this sample demonstrates (very) simple recognition:
* a verb/not a verb.
*/

%}
%%

[\t ]+ /* ignore white space */ ;

is |
am |
are |
were |
was |
be |
being |
been |
do |
does |
did |
will |
would |
should |
can |
could |
has |
have |
had |
go { printf("%s: is a verb\n", yytext); }

[a-zA-Z]+ { printf("%s: is not a verb\n", yytext); }

..|\n { ECHO; /* normal default anyway */ }
%%

main()
{
yylex();
}
###############################
$ ls
example.l
$ flex example.l
$ ls
example.l lex.yy.c
$ gcc -o example lex.yy.c -lfl
$ ls
example example.l lex.yy.c
$ ./example
test
test: is not a verb

is
is: is a verb

is123
is: is a verb
123
123is
123is: is a verb
^D

C is actually older than Lex, but I suspect the techniques used to scan
early C code were similar to what was incorporated into Lex. Anybody know
about this?

--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Aug 24 '05 #1

Subscribe Post Reply

1716

Victor Bazarov

Steven T. Hatton wrote:

This is just idle curiosity. I was playing with this code from the
Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
discovered that it does strange things with strings beginning with
numbers. [...]

C is actually older than Lex, but I suspect the techniques used to
scan early C code were similar to what was incorporated into Lex.
Anybody know about this?

C _language_ is off-topic. Please visit comp.lang.c for that.
C _Library_ is on topic since it's part of C++ Library. Just so
there is no misunderstanding.

Identifiers don't begin with digits because there would be no way
to tell an indentifier from a number literal if only digits are
used, I guess. But I only guess. You might also want to consider
a newsgroup for compiler design for further inquiry.

V

Aug 24 '05 #2

Steven T. Hatton

Victor Bazarov wrote:

Steven T. Hatton wrote:
This is just idle curiosity. I was playing with this code from the
Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
discovered that it does strange things with strings beginning with
numbers. [...]

C is actually older than Lex, but I suspect the techniques used to
scan early C code were similar to what was incorporated into Lex.
Anybody know about this?
C _language_ is off-topic. Please visit comp.lang.c for that.
C _Library_ is on topic since it's part of C++ Library. Just so
there is no misunderstanding.

Then why does C++ have the rule regarding not beginning an identifier with a
digit? I was intending to imply that this was due to the fact that C
already had that rule.

I find it interesting that C, Lex, YACC, and C++ were all products of the
same shop, if I understand correctly. Stroustrup actually says he gave up
on using YACC to produce a formal definition of C++. But in that he seems
to blame C.
Identifiers don't begin with digits because there would be no way
to tell an indentifier from a number literal if only digits are
used, I guess. But I only guess. You might also want to consider
a newsgroup for compiler design for further inquiry.

At this point it's not really very significant to me. I was just a bit
curious about the possibility. Such historical tidbits can, however, shed
new light on how a language works, and can also exposed potential pitfalls.
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Aug 24 '05 #3

Jack Klein

On Wed, 24 Aug 2005 06:22:15 -0400, "Steven T. Hatton"
<ch********@germania.sup> wrote in comp.lang.c++:

Victor Bazarov wrote:
Steven T. Hatton wrote:
This is just idle curiosity. I was playing with this code from the
Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
discovered that it does strange things with strings beginning with
numbers. [...]

C is actually older than Lex, but I suspect the techniques used to
scan early C code were similar to what was incorporated into Lex.
Anybody know about this?

C _language_ is off-topic. Please visit comp.lang.c for that.
C _Library_ is on topic since it's part of C++ Library. Just so
there is no misunderstanding.

Then why does C++ have the rule regarding not beginning an identifier with a
digit? I was intending to imply that this was due to the fact that C
already had that rule.

Because it is the one and only way to avoid imposing more complex
requirements on the remaining characters of the identifier.

Leave aside the prohibitions on symbols reserved for the
implementation due to underscores, which is an issue for the linker
and not the parser. The regular expression for a valid C or C++
identifier is:

[_A-Za-z][_A-Za-z0-9]*

If you allow the first character to be a digit, you have the
apparently simpler:

[_A-Za-z0-9]*

....but that expression accepts, for example, '123' as an identifier as
opposed to an integer literal, but '123T' is truly an identifier. So
now you need a rule that states that if the identifier begins with a
digit, it must also include at least one non-digit character.

OK, what happens to '0x7fff'? Oops, that's an identifier, not a
literal with a value of 32767.

So you could make the rule that if the first character is '0', then
either the second character can't be 'x' or 'X', or the remaining
characters must contain at least one character not a digit and
[^a-fA-F].

So start rewriting the C or C++ grammar for an identifier that handles
all of these cases. Then start writing the text that explains the
limitations at several levels, from books for beginners to the
normative text of the standard itself.

Then start rewriting the preprocessor so it works according to the
rules, with all the special cases. In the first place, you either
have to forget about the tradition that you can completely pp tokenize
C source with one character's worth of ungetc() in a single pass.
Either you have to retain a lot more text to back up through, or you
have to supply some sort of state machine that processes the value as
a number and a symbol simultaneously until the disambiguating
character is encountered.

And of course, write the diagnostics to issue to the programmer when
he accidentally slips up and delivers a numeric literal where an
identifier is required.

Consider that when the C grammar evolved, it was quite common for
assemblers, which tended to be used a lot more by systems programmers
in those days than since C and to some extent C++ have become
universal, tended to have the same restriction.

Now that you've made a time and money estimate of the effort make it
work, factor in the additional opportunities for programmer error and
put together a business case. Can you make a convincing argument that
the cost of requiring these changes to every C and C++ compiler in the
world (or even just the C++ compilers) is more than compensated by the
added benefit of allowing programmers to create identifiers starting
with a digit? What's the dollar value of this benefit?

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html

Aug 25 '05 #4

Julián Albo

Jack Klein wrote:

put together a business case. Can you make a convincing argument that
the cost of requiring these changes to every C and C++ compiler in the
world (or even just the C++ compilers) is more than compensated by the
added benefit of allowing programmers to create identifiers starting
with a digit? What's the dollar value of this benefit?

The Obfuscated C contests will probaly be more fun.

--
Salu2

Aug 25 '05 #5

Steven T. Hatton

Jack Klein wrote:

On Wed, 24 Aug 2005 06:22:15 -0400, "Steven T. Hatton"
<ch********@germania.sup> wrote in comp.lang.c++:
Victor Bazarov wrote:
> Steven T. Hatton wrote:
>> This is just idle curiosity. I was playing with this code from the
>> Lex && Yacc book [http://www.oreilly.com/catalog/lex/], and
>> discovered that it does strange things with strings beginning with
>> numbers. [...]
>>
>> C is actually older than Lex, but I suspect the techniques used to
>> scan early C code were similar to what was incorporated into Lex.
>> Anybody know about this?
>
> C _language_ is off-topic. Please visit comp.lang.c for that.
> C _Library_ is on topic since it's part of C++ Library. Just so
> there is no misunderstanding.

Then why does C++ have the rule regarding not beginning an identifier
with a
digit? I was intending to imply that this was due to the fact that C
already had that rule.

Because it is the one and only way to avoid imposing more complex
requirements on the remaining characters of the identifier.

Leave aside the prohibitions on symbols reserved for the
implementation due to underscores, which is an issue for the linker
and not the parser. The regular expression for a valid C or C++
identifier is:

[_A-Za-z][_A-Za-z0-9]*

If you allow the first character to be a digit, you have the
apparently simpler:

[_A-Za-z0-9]*

...but that expression accepts, for example, '123' as an identifier as
opposed to an integer literal, but '123T' is truly an identifier. So
now you need a rule that states that if the identifier begins with a
digit, it must also include at least one non-digit character.

OK, what happens to '0x7fff'? Oops, that's an identifier, not a
literal with a value of 32767.

So you could make the rule that if the first character is '0', then
either the second character can't be 'x' or 'X', or the remaining
characters must contain at least one character not a digit and
[^a-fA-F].

So start rewriting the C or C++ grammar for an identifier that handles
all of these cases. Then start writing the text that explains the
limitations at several levels, from books for beginners to the
normative text of the standard itself.

Then start rewriting the preprocessor so it works according to the
rules, with all the special cases. In the first place, you either
have to forget about the tradition that you can completely pp tokenize
C source with one character's worth of ungetc() in a single pass.
Either you have to retain a lot more text to back up through, or you
have to supply some sort of state machine that processes the value as
a number and a symbol simultaneously until the disambiguating
character is encountered.

And of course, write the diagnostics to issue to the programmer when
he accidentally slips up and delivers a numeric literal where an
identifier is required.

Consider that when the C grammar evolved, it was quite common for
assemblers, which tended to be used a lot more by systems programmers
in those days than since C and to some extent C++ have become
universal, tended to have the same restriction.

Now that you've made a time and money estimate of the effort make it
work, factor in the additional opportunities for programmer error and
put together a business case. Can you make a convincing argument that
the cost of requiring these changes to every C and C++ compiler in the
world (or even just the C++ compilers) is more than compensated by the
added benefit of allowing programmers to create identifiers starting
with a digit? What's the dollar value of this benefit?

Hmmmm.... I guess that's kind of what I said in my first post to this
thread. Just not in so many words, and I was specifically talking about
lex, (Well flex, to be exact) not the general idea of regular expressions.
Thanks for the partial confirmation.
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

Aug 25 '05 #6

Similar topics

PEP 263 status check

by: John Roth | last post by:

PEP 263 is marked finished in the PEP index, however I haven't seen the specified Phase 2 in the list of changes for 2.4 which is when I expected it. Did phase 2 get cancelled, or is it just not...

Python

Unicode characters in identifiers

by: R.Kaiser | last post by:

Where can I find which Unicode characters are valid for identifiers in Visual C++ 2005? Thanks Richard

.NET Framework

399

PEP 3131: Supporting Non-ASCII Identifiers

by: =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?= | last post by:

PEP 1 specifies that PEP authors need to collect feedback from the community. As the author of PEP 3131, I'd like to encourage comments to the PEP included below, either here (comp.lang.python), or...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General