preprocessor tokenization whitespace?

Walter Roberson

I have run into a peculiarity with SGI's C compiler (7.3.1.2m). I have been
reading carefully over the ANSI X3.159-1989 specification, but I cannot
seem to find a justification for the behaviour. Could someone point
me to the appropriate section, or else confirm the behaviour as a bug?

For a particular project, I am using the C preprocessor phase only.
I am not using the standalone program 'cpp' because proper functioning
of my project depends upon being able to splice preprocessor tokens,
which is not supported in the standalone 'cpp'.

I am having the compiler stop after preprocessing by using SGI's C -P
option:

-P Runs only the preprocessor and puts the result for each source
file in a corresponding .i file. The .i file has no inline
directives in it.

It should be noted that my source is *not* C code -- I am using the
preprocessor to generate data files based upon templates.

The point I am having trouble with can be illustrated fairly simply,
by running these lines through the preprocessing phase:

#define eye L@@K
I eye

$ cpp -P look.c
I L@@K

That's with the standalone cpp program, and is the output I expect. But,

$ cc -P look.c
$ cat look.i
I L@ @K

And

$ cat look2.c
I L@@K
$ cc -P look2.c
$ cat look2.i
I L@@K

In short, certain combinations of symbols, when macro-replaced into
source, get separated by single space characters. Not every combination
is so treated: -~ and ~$ are left alone, for example. It is not
operator based, as it happens especially for ` and @ and $ .

The work around I have found is:

$ cat look3.c
#define eye L@##@K
I eye
$ cc -P look3.c
$ cat look3.i
I L@@K
The closest I have found to the whitespace-introducing behaviour is
the ANSI description of translation phases, 2.1.1.2, for phase 3:

3. The source file is decomposed into preprocessing tokens and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or comment.
Each comment is replaced by one space character. New-line characters
are retained. Whether each nonempty sequence of white-space characters
other than new-line is retained or replaced by one space character
is implimentation-defined.

Okay, so there's implimentation behaviour for *nonempty* sequence
of white-space characters, but L@@K has only the -empty- sequence
between the two @.

I see nothing in the discussion of macro replacement that would
lead to spaces being introduced {other than the behaviour of # in
function-like macro replacements.}

The only excuse I can think of is that as ` and @ and $ are not
C operators, that outside of character strings and character literals
they are perhaps not considered to be valid preprocessor tokens,
in which case the behaviour would become undefined ?
--
I've been working on a kernel
All the livelong night.
I've been working on a kernel
And it still won't work quite right. -- J. Benson & J. Doll

Nov 14 '05 #1

Subscribe Reply

3656

Alex Fraser

"Walter Roberson" <ro******@ibd.n rc-cnrc.gc.ca> wrote in message
news:ct******** **@canopus.cc.u manitoba.ca...
[snip: using a C preprocessor on non-C files gives unexpected results]

I see nothing in the discussion of macro replacement that would
lead to spaces being introduced {other than the behaviour of # in
function-like macro replacements.}

The only excuse I can think of is that as ` and @ and $ are not
C operators, that outside of character strings and character literals
they are perhaps not considered to be valid preprocessor tokens,
in which case the behaviour would become undefined ?

Sounds extremely likely; I'm not going to check. The solution is to use
something other than a C preprocessor, eg m4.

Alex

Nov 14 '05 #2

CBFalconer

Walter Roberson wrote:

.... snip ...
It should be noted that my source is *not* C code -- I am using
the preprocessor to generate data files based upon templates.

The point I am having trouble with can be illustrated fairly
simply, by running these lines through the preprocessing phase:

.... snip ...

And thus is off-topic for c.l.c. You need to find a group that
deals with your particular compiler. F'ups set.

--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

Nov 14 '05 #3

Eric Sosman

Walter Roberson wrote:

I have run into a peculiarity with SGI's C compiler (7.3.1.2m). I have been
reading carefully over the ANSI X3.159-1989 specification, but I cannot
seem to find a justification for the behaviour. Could someone point
me to the appropriate section, or else confirm the behaviour as a bug?
[preprocessor output isn't as expected]

No bug, as far as the C Standard is concerned. Check
with SGI to see whether it's a bug from their perspective.

First problem: The Standard doesn't promise that the
preprocessor will produce any kind of output at all; as far
as the Standard is concerned the preprocessor is merely
"translatio n phase 4." (You don't expect access to the
output of phase 2 or phase 5; what's special about 4?) If
phase 4 produces any incidental output, the Standard doesn't
specify what it should look like.

Second problem: The Standard describes what a translator
(of which the preprocessor is a part) must do with C source
code, but the only requirement on what it does with non-C is
that some kinds of aberrations require a diagnostic. You're
trying to (ab?)use the preprocessor as a general-purpose
macro machine, which is a bit like driving nails with a
crescent wrench: You may be able to do it, sort of, but if
things don't work out it's not the wrench's fault.

Third problem: By the time phase 4 operates most of the
source text of the program has disappeared. Phases 1 through 3
transform the source into "preprocess ing tokens" and white
space which phase 4 then shuffles around; phase 4 manipulates
tokens, not text. (The distinction is usually blurred, but its
effects can be seen here and there: consider the non-recursive
nature of macro expansion, for example.) The consequence is that
if phase 4 produces output what it must actually do is generate
a textual approximation of the internal token sequence. There
was a thread some time ago involving C source that meant one
thing if fed to a translator but something entirely different
if preprocessed first and then fed into the translator (alas,
I can't recall the details; perhaps you can find the thread on
Google). Sometimes the preprocessor cannot turn hamburger back
into cow.

It seems to me you're (mostly) running afoul of the first
two issues, with the third a looming but distant threat. What
to do? Well, it seems that your C implementation (like many)
allows you to run phases 1-4 separately from the rest of the
translator, and when you do so you get the output you want;
it's only when you run the entire translator (with a special
switch) that the output is unsatisfactory. Well then, why
don't you just use the variant that happens to give what you
want? Alternatively, use a full-fledged macro processor (m4
is often mentioned; I've never used it myself) instead of
trying to get the C translator to do something it wasn't really
designed for.

--
Eric Sosman
es*****@acm-dot-org.invalid

Nov 14 '05 #4

Lawrence Kirby

On Thu, 03 Feb 2005 12:00:25 +0000, Walter Roberson wrote:

I have run into a peculiarity with SGI's C compiler (7.3.1.2m). I have been
reading carefully over the ANSI X3.159-1989 specification, but I cannot
seem to find a justification for the behaviour. Could someone point
me to the appropriate section, or else confirm the behaviour as a bug?

For a particular project, I am using the C preprocessor phase only.
The C standard does not define the output of the preprocessor as a text
stream. It is not possible to validate such a text stream for correctness
against the standard. Such a text output is a *representation * of a
sequence of tokens and white-space. Since there is no formatting
specification for this representation different compilers can and do
produce different output.
I am not using the standalone program 'cpp' because proper functioning
of my project depends upon being able to splice preprocessor tokens,
which is not supported in the standalone 'cpp'.

I am having the compiler stop after preprocessing by using SGI's C -P
option:

-P Runs only the preprocessor and puts the result for each source
file in a corresponding .i file. The .i file has no inline
directives in it.

It should be noted that my source is *not* C code -- I am using the
preprocessor to generate data files based upon templates.
That's the basic problem, the C preprocessor isn't a general macro
language, it is specifically for C and can make assumptions based on
knowledge of the language.
The point I am having trouble with can be illustrated fairly simply, by
running these lines through the preprocessing phase:

#define eye L@@K
I eye

$ cpp -P look.c
I L@@K

That's with the standalone cpp program, and is the output I expect. But,

$ cc -P look.c
$ cat look.i
I L@ @K

And

$ cat look2.c
I L@@K
$ cc -P look2.c
$ cat look2.i
I L@@K

In short, certain combinations of symbols, when macro-replaced into
source, get separated by single space characters. Not every combination
is so treated: -~ and ~$ are left alone, for example. It is not operator
based, as it happens especially for ` and @ and $ .

The work around I have found is:

$ cat look3.c
#define eye L@##@K
I eye
$ cc -P look3.c
$ cat look3.i
I L@@K
It looks like in some cases the preprocessor inserts spaces where it
considers it would otherwise be unclear where the token boundaries are. It
is quite reasonable for it to do this, indeed it may have to if the
compiler is capable of taking this text output and completing the
compilation process on it.
The closest I have found to the whitespace-introducing behaviour is the
ANSI description of translation phases, 2.1.1.2, for phase 3:

3. The source file is decomposed into preprocessing tokens and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or comment. Each
comment is replaced by one space character. New-line characters are
retained. Whether each nonempty sequence of white-space characters
other than new-line is retained or replaced by one space character is
implimentation-defined.
The output of the "preprocess or" is the input to translation phase 7 which
says:

"White-space characters separating tokens are no longer significant"

So adding white-space between tokens is not a problem for the translation
process.
Okay, so there's implimentation behaviour for *nonempty* sequence of
white-space characters, but L@@K has only the -empty- sequence between
the two @.

I see nothing in the discussion of macro replacement that would lead to
spaces being introduced {other than the behaviour of # in function-like
macro replacements.}
I grant you that the behaviour difference between L@@K being part of a
macro replacement and not is odd, but there's nothing that you can say is
wrong with the output in either case.
The only excuse I can think of is that as ` and @ and $ are not C
operators, that outside of character strings and character literals they
They aren't even required to exist in the character set which makes their
use, even in character constants and string literals, non-portable.
are perhaps not considered to be valid preprocessor tokens, in which
case the behaviour would become undefined ?

In the grammar for a preprocessing-token there is

preprocesing-token:
...
each non-white-space character that cannot be one of the above

They would cause a constraint violation when the pp-token is converted to
a token in translation phase 7.

Lawrence

Nov 14 '05 #5

Walter Roberson

In article <42************ ***@yahoo.com>,
CBFalconer <cb********@wor ldnet.att.net> wrote:
:Walter Roberson wrote:

:> It should be noted that my source is *not* C code -- I am using
:> the preprocessor to generate data files based upon templates.

:> The point I am having trouble with can be illustrated fairly
:> simply, by running these lines through the preprocessing phase:

:And thus is off-topic for c.l.c. You need to find a group that
:deals with your particular compiler. F'ups set.

Was the question not one pertaining to the details of translation
phases in ANSI C? I pointed to particular clauses in the standard,
acknowledged that I did not know them thoroughly, and asked for
assistance from those who understand them better; I even included
simple ways to reproduce the behaviour. Why then was c.l.c
not a suitable place to have asked?
If, hypothetically, this were comp.dcom.ether net and I were to ask a
question that involved the detailed specifications of Cat5e wiring,
which I was [e.g.] interesting in using to transmit a digital signal
that did not happen to meet the ethernet frame format, then would you
have said "Wrong newsgroup, you will have to find one that deals with
the manufacturer of your particular brand of cable!", even though the
question was squarely one about what digital signal propogation
characteristics that one could expect with -any- cable rated as Cat5e?

Nov 14 '05 #6

Walter Roberson

In article <xu************ ********@comcas t.com>,
Eric Sosman <es*****@acm-dot-org.invalid> wrote:
: First problem: The Standard doesn't promise that the
:preprocessor will produce any kind of output at all;

That's a good point.

: Second problem: The Standard describes what a translator
:(of which the preprocessor is a part) must do with C source
:code, but the only requirement on what it does with non-C is
:that some kinds of aberrations require a diagnostic.

Hmmm, I think I would have to disagree with that point. The
standard describes very particular steps about what is required,
legal or invalid when the preprocessor is used. The standard makes
it clear that semantic analysis does not occur until phase 7,
so by the end of phase 4, the internal representation of the
source must not have undergone any changes that are dependant
upon the semantics of C, other than the precisely defined changes
about splicing lines together, replacement of comments with a single
blank, detection of character literals and string boundaries, and
so on as set out in phases 1-4.

: You're
:trying to (ab?)use the preprocessor as a general-purpose
:macro machine, which is a bit like driving nails with a
:crescent wrench: You may be able to do it, sort of, but if
:things don't work out it's not the wrench's fault.

A closer analogy, I would say, would be trying to use a
Robertson screw driver with a Philips screw in a situation that
depended upon the details of the physics of Philips screws.
: Third problem: By the time phase 4 operates most of the
:source text of the program has disappeared. Phases 1 through 3
:transform the source into "preprocess ing tokens" and white
:space which phase 4 then shuffles around; phase 4 manipulates
:tokens, not text.

True in one respect, but not true in another: ANSI goes to
a lot of trouble to detail that certain preprocessor operations
involve not the token itself but the "spelling" of the token,
so the preprocessor must carry around the original [whitespace-
squished] text even if (as is likely) it creates an internal
data structure that ascribes some kind of meaning to the text
sequences it is carrying around.
: It seems to me you're (mostly) running afoul of the first
:two issues, with the third a looming but distant threat. What
:to do? Well, it seems that your C implementation (like many)
:allows you to run phases 1-4 separately from the rest of the
:translator, and when you do so you get the output you want;
:it's only when you run the entire translator (with a special
:switch) that the output is unsatisfactory.

Unfortunately not; 'cpp' is the K&R preprocessor, a distinct
standalone program that will cannot do the transformations
I need [I actively use the ANSI ## preprocessor token-spliting operator.]

:Alternatively, use a full-fledged macro processor (m4
:is often mentioned; I've never used it myself) instead of
:trying to get the C translator to do something it wasn't really
:designed for.

The details of the ANSI C preprocessor are incorperated by
reference into the standards for some other languages, so it
is fair and meaningful to ask about the details even if one is
not compiling C code.
I appreciate your comments; they are good points to think about
even if I happen to split hairs a slightly different way that you.

Nov 14 '05 #7

Keith Thompson

ro******@mts.ne t (Walter Roberson) writes:

In article <xu************ ********@comcas t.com>,
Eric Sosman <es*****@acm-dot-org.invalid> wrote:
: First problem: The Standard doesn't promise that the
:preprocessor will produce any kind of output at all;

That's a good point.

: Second problem: The Standard describes what a translator
:(of which the preprocessor is a part) must do with C source
:code, but the only requirement on what it does with non-C is
:that some kinds of aberrations require a diagnostic.

Hmmm, I think I would have to disagree with that point. The
standard describes very particular steps about what is required,
legal or invalid when the preprocessor is used. The standard makes
it clear that semantic analysis does not occur until phase 7,
so by the end of phase 4, the internal representation of the
source must not have undergone any changes that are dependant
upon the semantics of C, other than the precisely defined changes
about splicing lines together, replacement of comments with a single
blank, detection of character literals and string boundaries, and
so on as set out in phases 1-4.

Your input was something like:

#define eye L@@K
I eye

you expected:

I L@@K

but your preprocessor produced:

I L@ @K

Since L@@K is not a valid C token or a sequence of valid C tokens, the
preprocessor's behavior isn't going to affect any valid C program.
Later phases of the C compiler are going to produce a syntax error
message whether the preprocessor produces "I L@@K" or "I L@ @K".

I've also seen problems using a C preprocessor on input containing
apostrophes. If there's a single apostrophe on a line, the
preprocessor is going to treat it as an incomplete character constant.
(The same thing applies to quotation marks, but standalone apostrophes
are more common.)

For anything that's going to be flagged as an error by later phases,
different C preprocessors are likely to behave differently -- and if
you manage to get your project working with the quirks of whatever
preprocessor you're currently using, it's likely to break with a later
version.

A C preprocessor, even if it happens to have the (non-required)
ability to produce text output, is really designed to work on C source
code.

You may find that m4 is more suitable for your purposes (there's a GNU
implementation) .

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Nov 14 '05 #8

Walter Roberson

In article <pa************ *************** *@netactive.co. uk>,
Lawrence Kirby <lk****@netacti ve.co.uk> wrote:
:> The only excuse I can think of is that as ` and @ and $ are not C
:> operators, that outside of character strings and character literals they

:They aren't even required to exist in the character set which makes their
:use, even in character constants and string literals, non-portable.

Checking around, I see that you are correct that those 3 characters
are not part of the minimal environment. It seems odd to think
that even the most elementary financial program using the north
american currancy symbol would be technically non-portable, but
that does appear to be the case.

Nov 14 '05 #9

Michael Wojcik

[Followups set to comp.lang.c.]

In article <7B************ ********@news1. mts.net>, ro******@mts.ne t (Walter Roberson) writes:

In article <pa************ *************** *@netactive.co. uk>,
Lawrence Kirby <lk****@netacti ve.co.uk> wrote:
:> The only excuse I can think of is that as ` and @ and $ are not C
:> operators, that outside of character strings and character literals they

:They aren't even required to exist in the character set which makes their
:use, even in character constants and string literals, non-portable.

Checking around, I see that you are correct that those 3 characters
are not part of the minimal environment. It seems odd to think
that even the most elementary financial program using the north
american currancy symbol would be technically non-portable, but
that does appear to be the case.

The standard aims to accomodate implementations on platforms where
that symbol may not be conveniently available. I don't think that's
odd at all. Chances are, if you're writing a program that requires
that symbol, it will be conveniently available to you as an implemen-
tation extension to the standard, and you ought to use that extension
just as you might use any other. Very, very few C programs do not
depend on any implementation extensions whatsoever.

--
Michael Wojcik mi************@ microfocus.com

The antics which have been drawn together in this book are huddled here
for mutual protection like sheep. If they had half a wit apiece each
would bound off in many directions, to unsimplify the target. -- Walt Kelly

Nov 14 '05 #10

Similar topics

1478

preprocessor implementation GURU question

by: Dan W. | last post by:

I'm trying to resolve a disagreement between friends --Digital Mars and BOOST-- about what the precompiler should do in a given situation: The problem arose when I was trying to compile a boost example program with the DM compiler, and the name of a file which was put together by a set of macros, ended up as, ....\...\list10 .cpp vs. ....\...\list10.cpp

C / C++

205

10660

Boost Workshop at OOPSLA 2004

by: Jeremy Siek | last post by:

CALL FOR PAPERS/PARTICIPATION C++, Boost, and the Future of C++ Libraries Workshop at OOPSLA October 24-28, 2004 Vancouver, British Columbia, Canada http://tinyurl.com/4n5pf Submissions

C / C++

40231

For loop equivalent with the preprocessor

by: Nudge | last post by:

I have an array, and an unrolled loop which looks like this: do_something(A); do_something(A); .... do_something(A); I thought: why should I type so much? I should write a macro. So I was looking to write something along the lines of:

C / C++

3029

Array managed by preprocessor

by: /* frank */ | last post by:

My teacher said that array in C is managed by preprocessor. Preprocesser replace all array occurences (i.e. int a ) with something that I don't understand/remember well. What's exactly happens with array during preprocessing/compiling stage? Thanks in advance

C / C++

2788

Is the output of the preprocessor deterministic ?

by: spibou | last post by:

Is the output of the C preprocessor deterministic ? What I mean by that is , given 2 compilers which conform to the same standard, will their preprocessors produce identical output given as input the same file ? If not then how much variation is allowed ? Is it just a bit more or less white space here and there or could could there be larger differences ? If the output is not deterministic then is it possible that the output of the...

C / C++

8234

Preprocessor directives must appear as the first non-whitespace character on a line

by: IndyChris | last post by:

I'm getting Preprocessor directives must appear as the first non-whitespace character on a line with <asp:Button OnClick="GenerateEmail" ID="btnGenerate" Runat="server" CommandArgument='<%#DataBinder.Eval(Container.DataItem,"sakEvent")%>&<%#DataBinder.Eval(Container.DataItem,"sakConfirmation")%>' Text="View"></asp:Button> I'm needing to have two values passed to the server from my button. How can I do this?

ASP.NET

3134

Preprocessor trick

by: olivier.grant | last post by:

Hi All, I'm trying to define a macro that will allow me to write the following code : #include MY_MACRO( NAME, SPACE ) and end up with the following preprocessed code : #include NAME.hpp

C / C++

7635

Indent C preprocessor directives

by: Bogdan | last post by:

Can anyone recommend a program for indentation of C preprocessor directives. My file looks like this: #ifdef a #define b #else #define c #endif int main() {

C / C++

2920

The preprocessor is just a pass

by: Sam of California | last post by:

Is it accurate to say that "the preprocessor is just a pass in the parsing of the source file"? I responded to that comment by saying that the preprocessor is not just a pass. It processes statements that the compiler does not process. The good people in the alt.comp.lang.learn.c-c++ newsgroup insist that the preprocessor is just one of many passes. The preprocessor processes a grammer unique to the preprocessor and only that grammer. ...

C / C++

9384

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9238

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9157

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9088

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8052

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

4762

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3207

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2602

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2147

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General