468,463 Members | 2,053 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,463 developers. It's quick & easy.

How to remove // comments

Recently, a heated debate started because of poor mr heathfield
was unable to compile a program with // comments.

Here is a utility for him, so that he can (at last) compile my
programs :-)

More seriously, this code takes 560 bytes. Amazing isn't it? C is very
ompact, you can do great things in a few bytes.

Obviously I have avoided here, in consideration for his pedantic
compiler flags, any C99 issues, so it will compile in obsolete
compilers, and with only ~600 bytes you can run it in the toaster!

--------------------------------------------------------------cut here

/* This program reads a C source file and writes it modified to stdout
All // comments will be replaced by /* ... */ comments, to easy the
porting to old environments or to post it in usenet, where
// comments can be broken in several lines, and messed up.
*/

#include <stdio.h>

/* This function reads a character and writes it to stdout */
static int Fgetc(FILE *f)
{
int c = fgetc(f);
if (c != EOF)
putchar(c);
return c;
}

/* This function skips strings */
static int ParseString(FILE *f)
{
int c = Fgetc(f);
while (c != EOF && c != '"') {
if (c == '\\')
c = Fgetc(f);
if (c != EOF)
c = Fgetc(f);
}
if (c == '"')
c = Fgetc(f);
return c;
}
/* Skips multi-line comments */
static int ParseComment(FILE *f)
{
int c = Fgetc(f);

while (1) {
while (c != '*') {
c = Fgetc(f);
if (c == EOF)
return EOF;
}
c = Fgetc(f);
if (c == '/')
break;
}
return Fgetc(f);
}

/* Skips // comments. Note that we use fgetc here and NOT Fgetc */
/* since we want to modify the output before gets echoed */
static int ParseCppComment(FILE *f)
{
int c = fgetc(f);

while (c != EOF && c != '\n') {
putchar(c);
c = fgetc(f);
}
if (c == '\n') {
puts(" */");
c = Fgetc(f);
}
return c;
}

/* Checks if a comment is followed after a '/' char */
static int CheckComment(int c,FILE *f)
{
if (c == '/') {
c = fgetc(f);
if (c == '*') {
putchar('*');
c = ParseComment(f);
}
else if (c == '/') {
putchar('*');
c = ParseCppComment(f);
}
else {
putchar(c);
c = Fgetc(f);
}
}
return c;
}

/* Skips chars between simple quotes */
static int ParseQuotedChar(FILE *f)
{
int c = Fgetc(f);
while (c != EOF && c != '\'') {
if (c == '\\')
c = Fgetc(f);
if (c != EOF)
c = Fgetc(f);
}
if (c == '\'')
c = Fgetc(f);
return c;
}
int main(int argc,char *argv[])
{
FILE *f;
int c;
if (argc == 1) {
fprintf(stderr,"Usage: %s <file.c>\n",argv[0]);
return EXIT_FAILURE;
}
f = fopen(argv[1],"r");
if (f == NULL) {
fprintf(stderr,"Can't find %s\n",argv[1]);
return EXIT_FAILURE;
}
c = Fgetc(f);
while (c != EOF) {
/* Note that each of the switches must advance the character */
/* read so that we avoid an infinite loop. */
switch (c) {
case '"':
c = ParseString(f);
break;
case '/':
c = CheckComment(c,f);
break;
case '\'':
c = ParseQuotedChar(f);
break;
default:
c = Fgetc(f);
}
}
fclose(f);
return 0;
}

Oct 19 '06
100 4444
Mark McIntyre wrote:
On Fri, 20 Oct 2006 20:52:19 +0200, in comp.lang.c , jacob navia
<ja***@jacob.remcomp.frwrote:

>>This is NONSENSE


Have you noticed that by making a series of pointless throwaway
inflammatory remarks, you have diverted all attention from your code?

Nobody is bothering to read it any more. Thats a shame as it might
have been interesting.
You are right.

Excuse me for this polemic.

jacob
Oct 20 '06 #51
jacob navia <ja***@jacob.remcomp.frwrites:
Mark McIntyre wrote:
>On Fri, 20 Oct 2006 20:52:19 +0200, in comp.lang.c , jacob navia
<ja***@jacob.remcomp.frwrote:
>>This is NONSENSE
Have you noticed that by making a series of pointless throwaway
inflammatory remarks, you have diverted all attention from your code?
Nobody is bothering to read it any more. Thats a shame as it might
have been interesting.

You are right.

Excuse me for this polemic.
jacob, this is the second time recently that I've seen you admit to an
error or misjudgement. I just wanted to say, with no sarcasm or
criticism intended, that this is A Good Thing. Thank you.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Oct 20 '06 #52
jacob navia wrote:
Walter Bright wrote:
.... snip ...
>>
but of course I've never seen trigraphs outside of a test suite.

Me neither. But I do not support trigraphs anyway. They are an
unnecessary feature. We had several lebgthy discussions about
this in comp.std.c.
Consider the following scenario. Joe Q Customer has this large
monstrous set of source files, containing a few hundred K lines.
It is C89 compatible, and was used on IBMery or some such without
those characters, so it uses trigraphs throughout. It compiles and
executes correctly on any standards compliant C system.

Now Joe wants to port it to a PC, and he unsuspectingly gets your
compiler to do the job. Many thousands of errors, about 3 or 4 per
line. Will Joe turn to you for any future business? Or will he
run around in circles badmouthing your system? Or do you think he
will laboriously revise all that source to satisfy your peculiar
attitude?

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

Oct 20 '06 #53
Richard Heathfield wrote:
Walter Bright said:
>Trigraphs are a worthless feature.
This "worthless feature" is sometimes the only way you can get C code to
compile on a particular implementation, because the native character set of
the implementation doesn't contain such fancy characters as { or [ - so to
dismiss it as worthless is to display mere parochialism. I've worked on a
system that had no end of trouble with [ and ] but was quite at home with
??( and ??)
EBCDIC is parochialism, not ASCII. ASCII covers 99.99999% of the systems
out there. No sane person is going to invent a new character encoding
that doesn't include ASCII.

Trigraphs would be great if they solved the problem you mentioned. But
they don't. People overwhelmingly write C code using fancy characters {
and [, and that source code fails on EBCDIC systems. You're going to
have to run the source through a translator whether trigraphs are in the
standard or not.

So what have trigraphs in the Standard bought you? Nothing. They don't
even work with RADIX50.

Nevertheless, they are in the standard and C compilers should implement
them. Digital Mars C does.

Walter Bright
www.digitalmars.com C, C++, D programming language compilers
Oct 20 '06 #54
jacob navia wrote:
Why should *I* bother about that?
Because:

1) It's only about 10-15 lines of code to implement, and that's far
easier than arguing about it.

2) Because standards compliance is important, even if one doesn't agree
with all of it.
Oct 20 '06 #55
jxh

jacob navia wrote:
Recently, a heated debate started because of poor mr heathfield
was unable to compile a program with // comments.

Here is a utility for him, so that he can (at last) compile my
programs :-)
The code below is considerably larger, but it should get the job done.
It actually removes all comments.

--
James

/*
* cstripc: A C program to strip comments from C files.
* Usage:
* cstripc [file [...]]
* cstripc [-t]
*
* The '-t' options is used for testing. It prints some pointers to
strings
* that are interlaced with comment characters.
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*****************/
/**** GLOBALS ****/
/*****************/

static const char *progname;
static int debug_flag;

/**********************/
/**** MAIN PROGRAM ****/
/**********************/

static void print_usage(void);
static void print_test(void);

static FILE * open_input_file(const char *filename);
static void close_input_file(FILE *infile);
static void parse_input_file(FILE *infile);

int
main(int argc, char *argv[])
{
progname = argv[0];
if (progname == 0) {
progname = "cstripc";
}

while (argc 1) {

if ((*argv[1] != '-') || (strcmp(argv[1], "-") == 0)) {
break;
}

if (strcmp(argv[1], "-t") == 0) {
print_test();
exit(0);
} else if (strcmp(argv[1], "-d") == 0) {
debug_flag = 1;
} else {
fprintf(stderr, "%s: Unrecognized option '%s'\n",
progname, argv[1]);
print_usage();
exit(EXIT_FAILURE);
}

--argc;
++argv;
}

if (argc <= 1) {
parse_input_file(stdin);
exit(0);
}

while (argc 1) {
FILE *infile;

parse_input_file(infile = open_input_file(argv[1]));
close_input_file(infile);

--argc;
++argv;
}
}

/**************************/
/**** PRINT USAGE/TEST ****/
/**************************/

static const char *usage_string =
"%s: A C program to strip comments from C files.\n"
"Usage:\n"
" %s [file [...]]\n"
" %s [-t]\n"
"\n"
"The '-t' options is used for testing. It prints some pointers to
strings\n"
"that are interlaced with comment characters.\n"
;

static void
print_usage(void)
{
fprintf(stderr, usage_string, progname, progname, progname);
}

static const char *a;
static const char *b;
static const char *c;

static void
print_test(void)
{
if (a) puts(a);
if (b) puts(b);
if (c) puts(c);
}

/*******************************/
/**** OPEN/CLOSE INPUT FILE ****/
/*******************************/

static const char *input_file_name;

static FILE *
open_input_file(const char *filename)
{
FILE *infile;

input_file_name = filename;

if (filename == 0) {
return 0;
}

if (strcmp(filename, "-") == 0) {
return stdin;
}

infile = fopen(filename, "r");
if (infile == 0) {
fprintf(stderr, "%s: Could not open '%s' for reading.\n",
progname, filename);
}

return infile;
}

static void
close_input_file(FILE *infile)
{
if (infile) {
if (infile != stdin) {
if (fclose(infile) == EOF)
fprintf(stderr, "%s, Could not close '%s'.\n",
progname, input_file_name);
} else {
clearerr(stdin);
}
}
}

/**************************/
/**** PARSE INPUT FILE ****/
/**************************/

typedef struct scan_state scan_state;
typedef struct scan_context scan_context;

struct scan_context {
const scan_state *ss;
char *sbuf;
unsigned sbufsz;
unsigned sbufcnt;
};

struct scan_state {
const scan_state *(*scan)(scan_context *ctx, int input);
const char *name;
};

static scan_context initial_scan_context;

static void
parse_input_file(FILE *infile)
{
int c;
scan_context ctx;

if (infile == 0) {
return;
}

ctx = initial_scan_context;

while ((c = fgetc(infile)) != EOF) {
if (debug_flag) {
fprintf(stderr, "%s\n", ctx.ss->name);
}
ctx.ss = ctx.ss->scan(&ctx, c);
}
}

/***********************/
/**** STATE MACHINE ****/
/***********************/

/*
*

************************************************** *************************
* Assume input is a syntactically correct C program.
*
* The basic algorithm is:
* Scan character by character:
* Treat trigraphs as a single character.
* If the sequence does not start a comment, emit the sequence.
* Otherwise,
* Scan character by character:
* Treat trigraphs as a single character.
* Treat the sequence '\\' '\n' as no character.
* If the sequence does not end a comment, continue consuming.
* Otherwise, emit a space, and loop back to top.

************************************************** *************************
*
*/

#define SCAN_STATE_DEFINE(name) \
static const scan_state * name##_func(scan_context *ctx, int input); \
static const scan_state name##_state = { name##_func, #name }

SCAN_STATE_DEFINE(normal);
SCAN_STATE_DEFINE(normal_maybe_tri_1);
SCAN_STATE_DEFINE(normal_maybe_tri_2);
SCAN_STATE_DEFINE(string);
SCAN_STATE_DEFINE(string_maybe_tri_1);
SCAN_STATE_DEFINE(string_maybe_tri_2);
SCAN_STATE_DEFINE(string_maybe_splice);
SCAN_STATE_DEFINE(char);
SCAN_STATE_DEFINE(char_maybe_tri_1);
SCAN_STATE_DEFINE(char_maybe_tri_2);
SCAN_STATE_DEFINE(char_maybe_splice);
SCAN_STATE_DEFINE(slash);
SCAN_STATE_DEFINE(slash_maybe_tri_1);
SCAN_STATE_DEFINE(slash_maybe_tri_2);
SCAN_STATE_DEFINE(slash_maybe_splice);
SCAN_STATE_DEFINE(slashslash);
SCAN_STATE_DEFINE(slashslash_maybe_tri_1);
SCAN_STATE_DEFINE(slashslash_maybe_tri_2);
SCAN_STATE_DEFINE(slashslash_maybe_splice);
SCAN_STATE_DEFINE(slashsplat);
SCAN_STATE_DEFINE(slashsplat_splat);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_tri_1);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_tri_2);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_splice);

#define SCAN_STATE(name) (&name##_state)

static scan_context initial_scan_context = { SCAN_STATE(normal), 0, 0,
0 };

static void sbuf_append_char(scan_context *ctx, int c);
static void sbuf_append_string(scan_context *ctx, char *s);
static void sbuf_clear(scan_context *ctx);
static void sbuf_emit(scan_context *ctx);

static const scan_state *
normal_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_tri_1);
case '"': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
case '\'': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
case '/': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(slash);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
}
}

static const scan_state *
normal_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
normal_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(normal_maybe_tri_2);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-':
case '/': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
string_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_tri_1);
case '"': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_splice);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
}
}

static const scan_state *
string_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(string)->scan(ctx, input);
}
}

static const scan_state *
string_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(string_maybe_tri_2);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
default: sbuf_emit(ctx);
return SCAN_STATE(string)->scan(ctx, input);
}
}

static const scan_state *
string_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n':
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
}
}

static const scan_state *
char_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_tri_1);
case '\'': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_splice);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
}
}

static const scan_state *
char_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(char)->scan(ctx, input);
}
}

static const scan_state *
char_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(char_maybe_tri_2);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
default: sbuf_emit(ctx);
return SCAN_STATE(char)->scan(ctx, input);
}
}

static const scan_state *
char_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n':
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
}
}

static const scan_state *
slash_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_tri_1);
case '\\': sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_splice);
case '/': sbuf_clear(ctx);
return SCAN_STATE(slashslash);
case '*': sbuf_clear(ctx);
return SCAN_STATE(slashsplat);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slash_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_string(ctx, "??");
return SCAN_STATE(normal_maybe_tri_2);
case '/': sbuf_append_char(ctx, '?');
sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_append_char(ctx, '?');
sbuf_append_char(ctx, input);
sbuf_emit(ctx);
return SCAN_STATE(normal);
default: sbuf_append_char(ctx, '?');
sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': sbuf_append_char(ctx, input);
return SCAN_STATE(slash);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slashslash_func(scan_context *ctx, int input)
{
/* UNUSED */ ctx = ctx;
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_1);
case '\\': return SCAN_STATE(slashslash_maybe_splice);
case '\n': putchar(' ');
putchar(input);
return SCAN_STATE(normal);
default: return SCAN_STATE(slashslash);
}
}

static const scan_state *
slashslash_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_2);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashslash_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_2);
case '/': return SCAN_STATE(slashslash_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': return SCAN_STATE(slashslash);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashslash_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': return SCAN_STATE(slashslash);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_func(scan_context *ctx, int input)
{
/* UNUSED */ ctx = ctx;
switch (input) {
case '*': return SCAN_STATE(slashsplat_splat);
default: return SCAN_STATE(slashsplat);
}
}

static const scan_state *
slashsplat_splat_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashsplat_splat_maybe_tri_1);
case '\\': return SCAN_STATE(slashsplat_splat_maybe_splice);
case '/': putchar(' ');
return SCAN_STATE(normal);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashsplat_splat_maybe_tri_2);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '/': return SCAN_STATE(slashsplat_splat_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': return SCAN_STATE(slashsplat);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': return SCAN_STATE(slashsplat_splat);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

/*************************/
/**** BUFFER HANDLING ****/
/*************************/

static void
sbuf_append_char(scan_context *ctx, int c)
{
if (ctx->sbuf == 0) {
ctx->sbuf = malloc(ctx->sbufsz = 128);
} else if (ctx->sbufcnt == ctx->sbufsz) {
char *p = realloc(ctx->sbuf, ctx->sbufsz *= 2);
if (p == 0) {
fprintf(stderr, "%s: memory allocation failure\n",
progname);
exit(EXIT_FAILURE);
}
ctx->sbuf = p;
}

ctx->sbuf[ctx->sbufcnt++] = c;
ctx->sbuf[ctx->sbufcnt] = '\0';
}

static void
sbuf_append_string(scan_context *ctx, char *s)
{
while (*s != '\0') {
sbuf_append_char(ctx, *s++);
}
}

static void
sbuf_clear(scan_context *ctx)
{
ctx->sbufcnt = 0;
if (ctx->sbuf) {
ctx->sbuf[ctx->sbufcnt] = '\0';
}
}

static void
sbuf_emit(scan_context *ctx)
{
if (ctx->sbuf == 0 || ctx->sbufcnt == 0) {
return;
}

printf("%s", ctx->sbuf);
sbuf_clear(ctx);
}

/********************/
/**** TEST CASES ****/
/********************/

/* a comment */
/\
* a comment split */
/\
\
* a comment split twice */
/*
block comment
*/
/* comment, trailing delimiter split *\
/
/* comment, trailing delimiter split twice *\
\
/
/* comment, trailing delimiter split once, and again by trigraph *\
??/
/

static const char *a = /* comment in code line "*/"Hello,
"/**/"World!";
static const char *b = /\
* comment on code line split */ "Hello, " /\
\
* comment on code line split twice */ "World!";

#define FOO ??/* this does not start a comment */

#if defined(__STDC__) && (__STDC__ == 1)
#if defined(__STD_VERSION__) && (__STD_VERSION__ >= 199901L)
//*** MORE TEST CASES ***//
/\
/ // comment split
/\
\
/ // comment split twice
static const char *c = // // comment on code line
"Hello, " /\
/ // comment on code line split
"World!" /\
\
/ // comment on code line split twice.
;

#define BAR ??// this does not start a comment

// This is a // comment \
on two lines

#else
static const char *c = "STDC without STD_VERSION";
#endif
#endif

Oct 21 '06 #56
jxh wrote:
jacob navia wrote:
>Recently, a heated debate started because of poor mr heathfield
was unable to compile a program with // comments.

Here is a utility for him, so that he can (at last) compile my
programs :-)

The code below is considerably larger, but it should get the job
done. It actually removes all comments.
.... snip code ...

If you just want to delete all comments, my public domain uncmnt.c
is considerably shorter. 109 lines in place of your 740 odd. It
doesn't handle trigraphs. It does maintain the original line
numbering. See:

<http://cbfalconer.home.att.net/download/>

It should be fairly easily modified to convert the comments.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>
Oct 21 '06 #57
Walter Bright said:
Richard Heathfield wrote:
>Walter Bright said:
>>Trigraphs are a worthless feature.
This "worthless feature" is sometimes the only way you can get C code to
compile on a particular implementation, because the native character set
of the implementation doesn't contain such fancy characters as { or [ -
so to dismiss it as worthless is to display mere parochialism. I've
worked on a system that had no end of trouble with [ and ] but was quite
at home with ??( and ??)

EBCDIC is parochialism, not ASCII.
I didn't say ASCII was parochialism. I said that an attitude that assumes it
is.
ASCII covers 99.99999% of the systems
out there.
Nevertheless, there are still an awful lot of mainframes around, and they
are a very important part of the C world.
No sane person is going to invent a new character encoding
that doesn't include ASCII.
....unless it makes business sense or technical sense to do that, which it
might, one day. (The Microsoft Office guys had much the same opinion of int
- "the compiler guys wouldn't change the size of an int on us - they know
it'd break all our code", but the compiler guys changed it anyway.
Trigraphs would be great if they solved the problem you mentioned. But
they don't. People overwhelmingly write C code using fancy characters {
and [, and that source code fails on EBCDIC systems. You're going to
have to run the source through a translator whether trigraphs are in the
standard or not.
That's mostly true, yes, although I did work on one site which required the
programmers to use trigraphs in their code (which was written and debugged
on PCs before being moved up to the mainframe for testing).

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Oct 21 '06 #58
Keith Thompson wrote:
"Jalapeno" <ja*******@mac.comwrites:
Walter Bright wrote:
but of course I've never seen trigraphs outside of a test suite.
Haven't worked in a z/OS shop before, huh? (or a Sys 370 one either)

It only takes an hour or two of working with int a??(8??); to get used
to them (and they become second nature quickly when you see them all
day long).

Fascinating. There have been raging arguments about trigraphs both
here and in comp.std.c for years. I think you're the first person
I've seen who actually *uses* them.
Old Mac programmers (pre OS-X) certainly new of the ??' trigraph
because
it cropped up in the multibyte character constant '????' that was used
as a
default file type. Even though such code is obviously platform
specific, you
would still see the better quality programs using '???\?' to avoid
potential
trigraph translation.

--
Peter

Oct 21 '06 #59
Richard Heathfield wrote:
Walter Bright said:
>No sane person is going to invent a new character encoding
that doesn't include ASCII.
...unless it makes business sense or technical sense to do that, which it
might, one day.
So what if some future encoding doesn't have a '?' ? Then trigraphs
won't work. If C is concerned about such a possibility, why does it
require the '?' character to exist, or any other character? '?' isn't a
valid character in the (once popular) RADIX50 encoding.
(The Microsoft Office guys had much the same opinion of int
- "the compiler guys wouldn't change the size of an int on us - they know
it'd break all our code", but the compiler guys changed it anyway.
Programmers knew that ints were going from 16 to 32 bits, and it was
useless to resist such a change. If it really was going to make life
impossible for the Office guys, I'm sure they had the clout to get a
special compiler built just for them. Microsoft wasn't going to endanger
the revenue stream from Office.

(The above is reasonable speculation on my part, I don't have any inside
knowledge of Microsoft.)

>Trigraphs would be great if they solved the problem you mentioned. But
they don't. People overwhelmingly write C code using fancy characters {
and [, and that source code fails on EBCDIC systems. You're going to
have to run the source through a translator whether trigraphs are in the
standard or not.

That's mostly true, yes, although I did work on one site which required the
programmers to use trigraphs in their code (which was written and debugged
on PCs before being moved up to the mainframe for testing).
It's a silly requirement, because a trigraph translator program is about
as trivial as it gets. Heck, CR-LF translation is routine. A viable
EBCDIC system these days has already got to be doing a lot of
translation of ASCII <-EBCDIC in order to deal with an ASCII world.
Those PC C programs you wrote had to be translated *anyway* to move them
to the mainframe.

Oct 21 '06 #60
Richard Heathfield wrote:
>
.... snip ...
>
That's mostly true, yes, although I did work on one site which
required the programmers to use trigraphs in their code (which
was written and debugged on PCs before being moved up to the
mainframe for testing).
A useful pair of filter utilities would be:

entrigph
untrigph

I don't know if it is possible to cater to all possible source.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>
Oct 21 '06 #61
Walter Bright said:
Richard Heathfield wrote:
>>
The Microsoft Office guys had much the same opinion of int
- "the compiler guys wouldn't change the size of an int on us - they know
it'd break all our code", but the compiler guys changed it anyway.

Programmers knew that ints were going from 16 to 32 bits, and it was
useless to resist such a change. If it really was going to make life
impossible for the Office guys, I'm sure they had the clout to get a
special compiler built just for them. Microsoft wasn't going to endanger
the revenue stream from Office.

(The above is reasonable speculation on my part, I don't have any inside
knowledge of Microsoft.)
My source is "Writing Solid Code", written by Steve Maguire and published by
Microsoft Press. If the anecdote were not true, I'm sure Microsoft had the
clout to refuse to publish it.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Oct 21 '06 #62
On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
>EBCDIC is parochialism, not ASCII. ASCII covers 99.99999% of the systems
out there.
Counted 'em have you? And did you do it by number of boxes, users or
compute power, revenue generated, importance to GDP or what?
>No sane person is going to invent a new character encoding
that doesn't include ASCII.
Apparently nobody told IBM.
>Trigraphs would be great if they solved the problem you mentioned. But
they don't. People overwhelmingly write C code using fancy characters {
and [, and that source code fails on EBCDIC systems.
Apparently nobody told the thousands of C programmers using IBM
mainframes for the last 40 years.
>Nevertheless, they are in the standard and C compilers should implement
them. Digital Mars C does.
Glad we agree about that.
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
Oct 21 '06 #63
Mark McIntyre <ma**********@spamcop.netwrites:
On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
[...]
>>No sane person is going to invent a new character encoding
that doesn't include ASCII.

Apparently nobody told IBM.
It's unlikely *now* that anyone would invent a new encoding that's not
based on ASCII. This wasn't the case when IBM invented EBCDIC.

But we shouldn't assume anything.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Oct 21 '06 #64
Richard Heathfield wrote:
Walter Bright said:
>Richard Heathfield wrote:
>>The Microsoft Office guys had much the same opinion of int
- "the compiler guys wouldn't change the size of an int on us - they know
it'd break all our code", but the compiler guys changed it anyway.
Programmers knew that ints were going from 16 to 32 bits, and it was
useless to resist such a change. If it really was going to make life
impossible for the Office guys, I'm sure they had the clout to get a
special compiler built just for them. Microsoft wasn't going to endanger
the revenue stream from Office.

(The above is reasonable speculation on my part, I don't have any inside
knowledge of Microsoft.)

My source is "Writing Solid Code", written by Steve Maguire and published by
Microsoft Press. If the anecdote were not true, I'm sure Microsoft had the
clout to refuse to publish it.
Microsoft Windows and Office bring in the bulk of Microsoft's revenues.
There's just no way management would ever allow a few rogue compiler
guys to jeopardize that.

Going to 32 bit ints offers real advantages to programs, and I bet that
the Office guys realized those advantages outweighed going through and
fixing a few bugs here and there.
Oct 21 '06 #65
Mark McIntyre wrote:
On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
>EBCDIC is parochialism, not ASCII. ASCII covers 99.99999% of the systems
out there.
Counted 'em have you? And did you do it by number of boxes, users or
compute power, revenue generated, importance to GDP or what?
Count 'em any way you want.

When was the last time you saw trigraphs (outside of a test case or
obfuscated C code entry) in:

1) a programming magazine listing
2) a book on programming
3) a project on sourceforge
4) a posting in this n.g.
5) a C programming web page
6) a paper submitted at a programming conference

? I haven't seen any in 25 years of being in the C business.
>No sane person is going to invent a new character encoding
that doesn't include ASCII.
Apparently nobody told IBM.
IBM isn't going to, either. EBCDIC is a legacy encoding going back to
punchcard machines. It was a legacy encoding back when I started in the
early 70's. Modern EBDCID machines are loaded up with software, and even
hardware CPU instructions, to translate ASCII <=EBCDIC. And even IBM
has gone with Unicode (an ASCII superset) to deal with multilingual text.
>Trigraphs would be great if they solved the problem you mentioned. But
they don't. People overwhelmingly write C code using fancy characters {
and [, and that source code fails on EBCDIC systems.
Apparently nobody told the thousands of C programmers using IBM
mainframes for the last 40 years.
Thousands of C programmers over 40 years, ok. Just Digital Mars has
shipped over half a million C compilers over the last 6 years. How many
C compilers would you guess Microsoft has shipped? Sun? gcc? Borland?
Watcom? Intel? Apple? Green Hills?

In 25 years I've never seen a tty, printer, or modem that supported
EBCDIC, from an ASR-33 to an HP laserprinter. I've worked on embedded
systems from Mattel Intellivision cartridges to phones. None did EBCDIC.
>Nevertheless, they are in the standard and C compilers should implement
them. Digital Mars C does.

Glad we agree about that.
Oct 21 '06 #66
Walter Bright <wa****@digitalmars-nospamm.comwrites:
Mark McIntyre wrote:
>On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
>>EBCDIC is parochialism, not ASCII. ASCII covers 99.99999% of the
systems out there.
Counted 'em have you? And did you do it by number of boxes, users or
compute power, revenue generated, importance to GDP or what?

Count 'em any way you want.

When was the last time you saw trigraphs (outside of a test case or
obfuscated C code entry) in:

1) a programming magazine listing
2) a book on programming
3) a project on sourceforge
4) a posting in this n.g.
5) a C programming web page
6) a paper submitted at a programming conference

? I haven't seen any in 25 years of being in the C business.
Earlier this week. Search this newsgroup for postings by "Jalapeno".

[...]
In 25 years I've never seen a tty, printer, or modem that supported
EBCDIC, from an ASR-33 to an HP laserprinter. I've worked on embedded
systems from Mattel Intellivision cartridges to phones. None did
EBCDIC.
Neither have I, but reliable sources tell us that EBCDIC is still
being used.

My suspicion is that IBM mainframe programmers mostly keep to
themselves. We don't see them here in comp.lang.c because they rarely
post here, not because they don't exist. They're just a separate
community.

I'd *like* to find a solution to the trigraph problem that (a) lets
anyone who still needs them continue using them (or something better,
if we can come up with it), but (b) doesn't impose "accidental
trigraphs" on the rest of us ("Huh??!"). But that's not going to
happen if we assume that trigraph users don't exist.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Oct 21 '06 #67
Keith Thompson wrote:
Mark McIntyre <ma**********@spamcop.netwrites:
>On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
[...]
>>No sane person is going to invent a new character encoding
that doesn't include ASCII.
Apparently nobody told IBM.

It's unlikely *now* that anyone would invent a new encoding that's not
based on ASCII. This wasn't the case when IBM invented EBCDIC.

But we shouldn't assume anything.
You're forced to make assumptions when writing a spec for a language.
The C standard, for example, assumes that the character '?' will always
exist.
Oct 21 '06 #68
Walter Bright <wa****@digitalmars-nospamm.comwrites:
Keith Thompson wrote:
>Mark McIntyre <ma**********@spamcop.netwrites:
>>On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
[...]
>>>No sane person is going to invent a new character encoding that
doesn't include ASCII.
Apparently nobody told IBM.
It's unlikely *now* that anyone would invent a new encoding that's
not
based on ASCII. This wasn't the case when IBM invented EBCDIC.
But we shouldn't assume anything.

You're forced to make assumptions when writing a spec for a
language. The C standard, for example, assumes that the character '?'
will always exist.
You're right, I overstated it.

We shouldn't make *unnecessary* assumptions. It's necessary to assume
that some core set of characters exists; when the C standard was
introduced, the best assumption was the intersection of EBCDIC (in all
its variants), ASCII, and the various national ASCII-based sets. I
think we can safely assume that future character sets will include at
least that core set.

I *don't* think it's necessary, or safe, to assume much more than
that.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Oct 21 '06 #69
Keith Thompson wrote:
I'd *like* to find a solution to the trigraph problem that (a) lets
anyone who still needs them continue using them (or something better,
if we can come up with it), but (b) doesn't impose "accidental
trigraphs" on the rest of us ("Huh??!"). But that's not going to
happen if we assume that trigraph users don't exist.
I wouldn't have put trigraphs in the standard because it doesn't solve
the problem for EBCDIC users, as pointed out before. But nevertheless,
it is in the standard and should be implemented. As a C tool vendor, I
support the standard. As a programmer, I don't worry about being EBCDIC
compatible.
Oct 21 '06 #70
Walter Bright said:
Richard Heathfield wrote:
>Walter Bright said:
>>Richard Heathfield wrote:
The Microsoft Office guys had much the same opinion of int
- "the compiler guys wouldn't change the size of an int on us - they
know it'd break all our code", but the compiler guys changed it anyway.
Programmers knew that ints were going from 16 to 32 bits, and it was
useless to resist such a change. If it really was going to make life
impossible for the Office guys, I'm sure they had the clout to get a
special compiler built just for them. Microsoft wasn't going to endanger
the revenue stream from Office.

(The above is reasonable speculation on my part, I don't have any inside
knowledge of Microsoft.)

My source is "Writing Solid Code", written by Steve Maguire and published
by Microsoft Press. If the anecdote were not true, I'm sure Microsoft had
the clout to refuse to publish it.

Microsoft Windows and Office bring in the bulk of Microsoft's revenues.
There's just no way management would ever allow a few rogue compiler
guys to jeopardize that.
Where did "rogue compiler guys" come from? The migration of Visual C++ to
32-bit was not a maverick operation.
Going to 32 bit ints offers real advantages to programs, and I bet that
the Office guys realized those advantages outweighed going through and
fixing a few bugs here and there.
If I come across the book in the next day or two, I'll cite the relevant
passage.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Oct 22 '06 #71
Richard Heathfield wrote:
Walter Bright said:
>Richard Heathfield wrote:
>>Walter Bright said:

Richard Heathfield wrote:
The Microsoft Office guys had much the same opinion of int
- "the compiler guys wouldn't change the size of an int on us - they
know it'd break all our code", but the compiler guys changed it anyway.
Programmers knew that ints were going from 16 to 32 bits, and it was
useless to resist such a change. If it really was going to make life
impossible for the Office guys, I'm sure they had the clout to get a
special compiler built just for them. Microsoft wasn't going to endanger
the revenue stream from Office.

(The above is reasonable speculation on my part, I don't have any inside
knowledge of Microsoft.)
My source is "Writing Solid Code", written by Steve Maguire and published
by Microsoft Press. If the anecdote were not true, I'm sure Microsoft had
the clout to refuse to publish it.
Microsoft Windows and Office bring in the bulk of Microsoft's revenues.
There's just no way management would ever allow a few rogue compiler
guys to jeopardize that.

Where did "rogue compiler guys" come from? The migration of Visual C++ to
32-bit was not a maverick operation.
Employees doing what they want to regardless of the best interests of
the corporation are known as rogues. A few rogues are good for the
health of a large organization, as they tend to shake things out of
complacency. Too many, and the organization comes unglued.
>Going to 32 bit ints offers real advantages to programs, and I bet that
the Office guys realized those advantages outweighed going through and
fixing a few bugs here and there.

If I come across the book in the next day or two, I'll cite the relevant
passage.
I'll look forward to it.
Oct 22 '06 #72
On Sat, 21 Oct 2006 19:35:34 GMT, in comp.lang.c , Keith Thompson
<ks***@mib.orgwrote:
>Mark McIntyre <ma**********@spamcop.netwrites:
>On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
[...]
>>>No sane person is going to invent a new character encoding
that doesn't include ASCII.

Apparently nobody told IBM.

It's unlikely *now* that anyone would invent a new encoding that's not
based on ASCII.
I'm not even sure that's true. I can see the Chinese deciding on some
totally new encoding scheme more suitable for their needs.
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
Oct 22 '06 #73
On Sat, 21 Oct 2006 13:41:55 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
>Mark McIntyre wrote:
>On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
>>EBCDIC is parochialism, not ASCII. ASCII covers 99.99999% of the systems
out there.
Counted 'em have you? And did you do it by number of boxes, users or
compute power, revenue generated, importance to GDP or what?

Count 'em any way you want.
Okay, lets count by excluding all non-commercial uses and by counting
CPU cycles.
>When was the last time you saw trigraphs
Earlier this week.
>Thousands of C programmers over 40 years, ok. Just Digital Mars has
shipped over half a million C compilers over the last 6 years. How many
C compilers would you guess Microsoft has shipped? Sun? gcc? Borland?
Watcom? Intel? Apple? Green Hills?
I've no idea and nor is it germane. Someone shipping for a mainframe,
or even a mini, is probably servicing hundreds, if not thousands, of
users with a single instance.

--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
Oct 22 '06 #74
Mark McIntyre wrote:
On Sat, 21 Oct 2006 19:35:34 GMT, in comp.lang.c , Keith Thompson
<ks***@mib.orgwrote:
>Mark McIntyre <ma**********@spamcop.netwrites:
>>On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
[...]
>>>No sane person is going to invent a new character encoding
that doesn't include ASCII.
Apparently nobody told IBM.
It's unlikely *now* that anyone would invent a new encoding that's not
based on ASCII.

I'm not even sure that's true. I can see the Chinese deciding on some
totally new encoding scheme more suitable for their needs.
If their needs don't include communicating with the rest of the world or
the internet or using the C, C++, Perl, Java, Ruby, Python, or D
programming languages, then they should go for it.
Oct 23 '06 #75
Walter Bright wrote:
Mark McIntyre wrote:
>Keith Thompson <ks***@mib.orgwrote:
.... snip ...
>>>
It's unlikely *now* that anyone would invent a new encoding
that's not based on ASCII.

I'm not even sure that's true. I can see the Chinese deciding on
some totally new encoding scheme more suitable for their needs.

If their needs don't include communicating with the rest of the
world or the internet or using the C, C++, Perl, Java, Ruby,
Python, or D programming languages, then they should go for it.
But that is precisely the point. With the existing C99 standard
they could go ahead and implement such a character set and program
with it in C. No sweat.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

Oct 23 '06 #76
Walter Bright said:
Richard Heathfield wrote:
>Walter Bright said:
>>Richard Heathfield wrote:
Walter Bright said:
Richard Heathfield wrote:
[whoa! :-) ]
>>>>>The Microsoft Office guys had much the same opinion of int
>- "the compiler guys wouldn't change the size of an int on us - they
>know it'd break all our code", but the compiler guys changed it
>anyway.
Programmers knew that ints were going from 16 to 32 bits, and it was
useless to resist such a change. If it really was going to make life
impossible for the Office guys, I'm sure they had the clout to get a
special compiler built just for them. Microsoft wasn't going to
endanger the revenue stream from Office.
>
(The above is reasonable speculation on my part, I don't have any
inside knowledge of Microsoft.)
My source is "Writing Solid Code", written by Steve Maguire and
published by Microsoft Press. If the anecdote were not true, I'm sure
Microsoft had the clout to refuse to publish it.
Microsoft Windows and Office bring in the bulk of Microsoft's revenues.
There's just no way management would ever allow a few rogue compiler
guys to jeopardize that.

Where did "rogue compiler guys" come from? The migration of Visual C++ to
32-bit was not a maverick operation.

Employees doing what they want to regardless of the best interests of
the corporation are known as rogues. A few rogues are good for the
health of a large organization, as they tend to shake things out of
complacency. Too many, and the organization comes unglued.
>>Going to 32 bit ints offers real advantages to programs, and I bet that
the Office guys realized those advantages outweighed going through and
fixing a few bugs here and there.

If I come across the book in the next day or two, I'll cite the relevant
passage.

I'll look forward to it.
The following extract is taken from "Writing Solid Code", Steve Maguire,
Microsoft Press, 1993. (Steve Maguire was hired by Microsoft in 1986 to
work on Macintosh Excel.)

'Owning the Compiler Is Not Enough

Some applications groups at Microsoft are now finding that they have to
review and clean up their code because so much of it is littered with
things like +2 instead of +sizeof(int), the comparison of unsigned values
to 0xFFFF instead of to something like UINT_MAX, and the use of int in data
structures when they really meant to use a 16-bit data type.
It may seem to you that the original programmers were being sloppy, but
they thought they had good reason for thinking they could safely use +2
instead of +sizeof(int). Microsoft writes its own compilers, and that gave
programmers a false sense of security. As one programmer put it a couple of
years ago, "The compiler group would never change something that would
break all of our code."
That programmer was wrong.
The compiler group changed the size of ints (and a number of other things)
to generate faster and smaller code for Intel's 80386 and newer processors.
The compiler group didn't want to break internal code, but it was far more
important for them to remain competitive in the marketplace. After all, it
wasn't their fault that some Microsoft programmers made erroneous
assumptions.'

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Oct 23 '06 #77
CBFalconer wrote:
Walter Bright wrote:
>Mark McIntyre wrote:
>>Keith Thompson <ks***@mib.orgwrote:
... snip ...
>>>It's unlikely *now* that anyone would invent a new encoding
that's not based on ASCII.
I'm not even sure that's true. I can see the Chinese deciding on
some totally new encoding scheme more suitable for their needs.
If their needs don't include communicating with the rest of the
world or the internet or using the C, C++, Perl, Java, Ruby,
Python, or D programming languages, then they should go for it.

But that is precisely the point. With the existing C99 standard
they could go ahead and implement such a character set and program
with it in C. No sweat.
Ok, what if that new character set doesn't include a '?' character?
Oct 23 '06 #78
Walter Bright wrote:
CBFalconer wrote:
Walter Bright wrote:
Mark McIntyre wrote:
Keith Thompson <ks***@mib.orgwrote:
... snip ...
>>It's unlikely *now* that anyone would invent a new encoding
that's not based on ASCII.
I'm not even sure that's true. I can see the Chinese deciding on
some totally new encoding scheme more suitable for their needs.
If their needs don't include communicating with the rest of the
world or the internet or using the C, C++, Perl, Java, Ruby,
Python, or D programming languages, then they should go for it.
But that is precisely the point. With the existing C99 standard
they could go ahead and implement such a character set and program
with it in C. No sweat.

Ok, what if that new character set doesn't include a '?' character?
Then either no conforming C implementation will be available using that
character set, or another character takes the function of '?' (similar
to how '¥' takes the function of '\' in certain character sets).

Oct 23 '06 #79
Harald van Dijk wrote:
Walter Bright wrote:
>CBFalconer wrote:
>>Walter Bright wrote:
Mark McIntyre wrote:
Keith Thompson <ks***@mib.orgwrote:
>
... snip ...
>It's unlikely *now* that anyone would invent a new encoding
>that's not based on ASCII.
I'm not even sure that's true. I can see the Chinese deciding on
some totally new encoding scheme more suitable for their needs.
If their needs don't include communicating with the rest of the
world or the internet or using the C, C++, Perl, Java, Ruby,
Python, or D programming languages, then they should go for it.
But that is precisely the point. With the existing C99 standard
they could go ahead and implement such a character set and program
with it in C. No sweat.
Ok, what if that new character set doesn't include a '?' character?

Then either no conforming C implementation will be available using that
character set, or another character takes the function of '?' (similar
to how '¥' takes the function of '\' in certain character sets).
1) So trigraphs don't future proof C against future arbitrary encodings.

2) I've programmed Japanese computers with the '¥' for '\'. It's not so
bad for one or two characters. But for more, it rapidly becomes unusable
(otherwise why didn't EBCDIC users go this route?). Would anyone want to
program in C if every character was represented by some arbitrary
squiggle that happens to have the same bit pattern? That wouldn't even
make sense for the implementors of that encoding.

3) Please explain how C99 makes it possible to make a conforming C
implementation for RADIX50 encoding, http://en.wikipedia.org/wiki/RADIX-50.
Oct 23 '06 #80
On Sun, 22 Oct 2006 18:54:58 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
>Mark McIntyre wrote:
>On Sat, 21 Oct 2006 19:35:34 GMT, in comp.lang.c , Keith Thompson
<ks***@mib.orgwrote:
>>Mark McIntyre <ma**********@spamcop.netwrites:
On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
[...]
No sane person is going to invent a new character encoding
that doesn't include ASCII.
Apparently nobody told IBM.
It's unlikely *now* that anyone would invent a new encoding that's not
based on ASCII.

I'm not even sure that's true. I can see the Chinese deciding on some
totally new encoding scheme more suitable for their needs.

If their needs don't include communicating with the rest of the world
One could argue, that since there's more of them than us, we should
adapt...
>or the internet
Puhleeze. There are already many thousands of websites which are paged
entirely exclusively in non-ASCII. In a few years, I predict a
majority of websites will have non-ASCII names.
>or using the C, C++, Perl, Java, Ruby, Python, or D
programming languages, then they should go for it.
It may surprise you to learn this, but nations using Western lettering
are in a minority.
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
Oct 23 '06 #81
Keith Thompson wrote:
"Jalapeno" <ja*******@mac.comwrites:
Walter Bright wrote:
Peter Nilsson wrote:
Some test cases for you to consider...

int c = a //* ... */
b;
int d = '??''; // this is a // comment, is it translated?

A trigraph case:

char* d = "??/""; // "

but of course I've never seen trigraphs outside of a test suite.
Haven't worked in a z/OS shop before, huh? (or a Sys 370 one either)

It only takes an hour or two of working with int a??(8??); to get used
to them (and they become second nature quickly when you see them all
day long).

Fascinating. There have been raging arguments about trigraphs both
here and in comp.std.c for years. I think you're the first person
I've seen who actually *uses* them. Maybe mainframe users just don't
post to Usenet very often?
The first? Wow. I can't speak for anyone but myself. I came to Usenet
looking for information and people interested in old hardware, and
"discovered" comp.lang.c as a side effect. C isn't and never was the
most popular way to program mainframes. There are large code bases but
they are miniscule compared to the COBOL and PL/I code bases. In the
early 1980's we started using Pascal but it died fairly quickly. I have
seen a lot of C code on mainframes that is nothing more than "portable
assembler". The specific nature of the coding techniques for the MVS
system would make comp.lang.c fairly useless as a resource to those
programmers, I suppose.

In my own experience, and that of most people here, trigraphs have
caused far more problems than they solve; if a trigraph appears in a C
source file, it's far more likely to be accidental than intentional
(unless the code is deliberately obfuscated). For example:

fprintf(stderr, "Unexpected error, what happened??!\n");
When I first started in C octal numbers caused some subtle bugs. ;o)
Since there is currently no active effort to publish a new C standard,
it looks like we're stuck with the current situation for the
forseeable future, but some of us are still trying to come up with a
better solution. For example, I've proposed *disabling* trigraphs by
default, but enabling them if there's some unique marker at the top of
the file.

For any change like this, there's a danger of breaking existing code,
but for those of us outside the IBM mainframe world, it would probably
accidentally *fix* more code than it would break.
I have neither a love nor hate for trigraphs. They are just the syntax
used. I originally responded to a poster who said he had never seen
trigraphs outside of a test suite. I have. That doesn't mean I advocate
using them. But they are in use.
Also, why do you use trigraphs rather than digraphs? They were added
in a 1995 update to the standard (I think that's right); you could
write a[8] as a<:8:rather than as a??(8??).

Any thoughts?
Well, why didn't you tell me in 1995? ;o) Looking at the docs for
the compiler (which is C92 compliant, i.e ANSI/ISO 9899:1990[1992]
(formerly ANSI X3.159-1989 C)) digraphs are available but the default
compiler switch is NODIGRAPH. So, since apparently nobody who has
worked here knew of digraphs, the compiler switch was never turned on.
IBM claims their newest compiler is C99 compliant, but it requires an
operating system upgrade to at least z/OS 1.7 to use that compiler. We
won't be upgrading the OS for at least another year.

Really, it is all just syntax. I got used to them and can go back and
forth without any trouble. YMMV, of course. Like anything in C, if you
know the pitfalls, it's easier to avoid them.

Oct 23 '06 #82

Walter Bright wrote:
Jalapeno wrote:
Character translation is only necessary if the text originates on an
ASCII system. Since all the "home grown" code here (and that supplied
by IBM) originates on EBCDIC systems absolutly no translations are
necessary and trigraphs are useful. All the world is not a PC. The
standard acknowledges that. I also understand that you don't find much
reason to have trigraphs supported. Some people use them, a lot. IBM's
Mainframes have'nt disappeared, they've just been renamed "Servers" ;o).

I understand that. My (badly explained) point was that since trigraphs
failed to make C source code portable, trigraphs shouldn't have been
part of the C standard.
I am not sure I understand your point. Portability is supposed to be a
two way street.

On the IBM mainframe, the 3270 terminal (really 91.9% is terminal
emulation on windows these days) does not have certain characters from
the C basic execution character set. The 3270 has many (IMO) better
characters.

EBCDIC however has, for instance, the '[' and ']' symbols in its set of
characters. They are there. It isn't a translation problem per se. It
is just that when the C standard was being formulated there was no way
to type them from a 3270 terminal.

There was absolutely no problem taking C source code from Unix or
Windows, for example, and translating the ASCII to EBCDIC and compiling
the source. Trigraphs mean that source typed in on a 3270 can be sent
to a Unix system via EBCDIC to ASCII translation and still compile
without having to edit the source. (system specific parts excepted)

I am not advocating trigraphs. I do see your point. There were
realities in the hardware in the 1980's and 1990's that were there. I
am sure IBM had a presence with the Standards committee.

Just understand that my whole existence in this thread is because you
said you had never seen trigraphs outside a test suite. They do exist.
It is legacy code, I know, but it is there. And it is updated
periodically.

Oct 23 '06 #83
Walter Bright wrote:
>
.... snip ...
>
3) Please explain how C99 makes it possible to make a conforming C
implementation for RADIX50 encoding,
http://en.wikipedia.org/wiki/RADIX-50.
Assuming you meant 'impossible', RADIX-50 can only hold 40
characters, 26 alpha, 10 numeric, space, and three others. No room
for the fundamental C char set.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

Oct 23 '06 #84
jxh

CBFalconer wrote:
jxh wrote:
jacob navia wrote:
Recently, a heated debate started because of poor mr heathfield
was unable to compile a program with // comments.

Here is a utility for him, so that he can (at last) compile my
programs :-)
The code below is considerably larger, but it should get the job
done. It actually removes all comments.

... snip code ...

If you just want to delete all comments, my public domain uncmnt.c
is considerably shorter. ...

<http://cbfalconer.home.att.net/download/>
Very nice. It doesn't handle other cases besides trigraphs, though.

--
-- James

Oct 23 '06 #85
Mark McIntyre <ma**********@spamcop.netwrites:
On Sun, 22 Oct 2006 18:54:58 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
>>Mark McIntyre wrote:
>>On Sat, 21 Oct 2006 19:35:34 GMT, in comp.lang.c , Keith Thompson
<ks***@mib.orgwrote:

Mark McIntyre <ma**********@spamcop.netwrites:
On Fri, 20 Oct 2006 16:40:37 -0700, in comp.lang.c , Walter Bright
<wa****@digitalmars-nospamm.comwrote:
[...]
>No sane person is going to invent a new character encoding
>that doesn't include ASCII.
Apparently nobody told IBM.
It's unlikely *now* that anyone would invent a new encoding that's not
based on ASCII.

I'm not even sure that's true. I can see the Chinese deciding on some
totally new encoding scheme more suitable for their needs.

If their needs don't include communicating with the rest of the world

One could argue, that since there's more of them than us, we should
adapt...
>>or the internet

Puhleeze. There are already many thousands of websites which are paged
entirely exclusively in non-ASCII. In a few years, I predict a
majority of websites will have non-ASCII names.
Obviously the encodings used for Chinese and/or Japanese characters
are non-ASCII, but are they necessarily *incompatible* with ASCII?
Chinese in particular has a *lots* of characters it has to represent;
reserving the first 128 codes for ASCII (including digits and
punctuation marks, which can be used in Chinese text) doesn't seem too
onerous.

Unicode is a superset of ASCII, and it can represent Chinese
characters easily enough. *If* it catches on world-wide, we can
continue to assume that the ASCII subset needed by C will be
available.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Oct 23 '06 #86
On Mon, 23 Oct 2006 20:14:19 GMT, in comp.lang.c , Keith Thompson
<ks***@mib.orgwrote:
>Obviously the encodings used for Chinese and/or Japanese characters
are non-ASCII, but are they necessarily *incompatible* with ASCII?
Quite possibly not, although people have in the past been known to
deliberately write for incompatibility, due to personal, commercial or
nationalistic reasons. This is however probably offtopic in CLC...
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
Oct 23 '06 #87
"Jalapeno" <ja*******@mac.comwrites:
[...]
EBCDIC however has, for instance, the '[' and ']' symbols in its set of
characters. They are there. It isn't a translation problem per se. It
is just that when the C standard was being formulated there was no way
to type them from a 3270 terminal.
Really? My understanding is that there are multiple versions of
EBCDIC, some of which *don't* have '[' and ']' characters. Wikipedia
<http://en.wikipedia.org/wiki/EBCDICshows a table of something
called CCSID 500, which does have '[' and ']', along with accented
characters (which, if I understand correctly, "classic" EBCDIC didn't
have).

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Oct 23 '06 #88
Mark McIntyre <ma**********@spamcop.netwrites:
On Mon, 23 Oct 2006 20:14:19 GMT, in comp.lang.c , Keith Thompson
<ks***@mib.orgwrote:
>>Obviously the encodings used for Chinese and/or Japanese characters
are non-ASCII, but are they necessarily *incompatible* with ASCII?

Quite possibly not, although people have in the past been known to
deliberately write for incompatibility, due to personal, commercial or
nationalistic reasons. This is however probably offtopic in CLC...
It's not entirely off-topic. The future evolution of character sets
could have a major effect on future C standards. If we can't assume,
for example, that the '?' character will always be available, we'll
have to think about alternatives. Though there's probably not much
point in inventing specific solutions until and unless we see an
actual character set that *doesn't* have '?', and that people want to
use to write C programs.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Oct 23 '06 #89
Mark McIntyre wrote:
On Sun, 22 Oct 2006 18:54:58 -0700, in comp.lang.c , Walter Bright
>>I'm not even sure that's true. I can see the Chinese deciding on some
totally new encoding scheme more suitable for their needs.
If their needs don't include communicating with the rest of the world
One could argue, that since there's more of them than us, we should
adapt...
You can argue that. But don't expect to be taken seriously. The Chinese
and Japanese regularly mix in western letters in their web pages, books,
and magazines.

You're suggesting that we (and the Chinese) should throw out the entire
computer infrastructure, and rewrite/rebuild everything from scratch.
>or the internet
Puhleeze. There are already many thousands of websites which are paged
entirely exclusively in non-ASCII. In a few years, I predict a
majority of websites will have non-ASCII names.
The internet encodings are all supersets of ascii. That is not going to
change.
>or using the C, C++, Perl, Java, Ruby, Python, or D
programming languages, then they should go for it.
It may surprise you to learn this, but nations using Western lettering
are in a minority.
How can a C99 compiler work with totally non-western lettering?
Oct 23 '06 #90
Jalapeno wrote:
Just understand that my whole existence in this thread is because you
said you had never seen trigraphs outside a test suite. They do exist.
It is legacy code, I know, but it is there. And it is updated
periodically.
I am not advocating removing trigraphs from the standard - what's done
is done. And I appreciate you joined in to say there are real trigraph
uses in the wild.
Oct 23 '06 #91
CBFalconer wrote:
Walter Bright wrote:
... snip ...
>3) Please explain how C99 makes it possible to make a conforming C
implementation for RADIX50 encoding,
http://en.wikipedia.org/wiki/RADIX-50.

Assuming you meant 'impossible', RADIX-50 can only hold 40
characters, 26 alpha, 10 numeric, space, and three others. No room
for the fundamental C char set.
Exactly. Trigraphs don't make C future proofed against arbitrary future
character encodings that don't have ascii as a subset.
Oct 23 '06 #92
jxh wrote:
CBFalconer wrote:
.... snip ...
>>
If you just want to delete all comments, my public domain uncmnt.c
is considerably shorter. ...

<http://cbfalconer.home.att.net/download/>

Very nice. It doesn't handle other cases besides trigraphs, though.
What do you see missing? Apart from trigraphs.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>
Oct 23 '06 #93
Keith Thompson wrote:
"Jalapeno" <ja*******@mac.comwrites:
[...]
EBCDIC however has, for instance, the '[' and ']' symbols in its set of
characters. They are there. It isn't a translation problem per se. It
is just that when the C standard was being formulated there was no way
to type them from a 3270 terminal.

Really? My understanding is that there are multiple versions of
EBCDIC, some of which *don't* have '[' and ']' characters. Wikipedia
<http://en.wikipedia.org/wiki/EBCDICshows a table of something
called CCSID 500, which does have '[' and ']', along with accented
characters (which, if I understand correctly, "classic" EBCDIC didn't
have).
Now you're making me get out the archives I see :o) Ok, the oldest
"green card" I have is from when I graduated from college and got my
first job in an IBM mainframe shop. Jan of 1979. I don't know if this
table is what you'd call "classic" EBCDIC but it was the version being
used in the USA for IBM, Amdahl, Hitachi, and National Advanced Systems
mainframes in January of 1979. It clearly shows '[' as decimal 173 and
hex AD, and ']' as decimal 189 and hex BD in the EBCDIC column of the
table. It has no characters in the BCDIC column in that range and
nothing in the ASCII column. My archives don't go back any farther than
that but 1979 is clearly prior to the formation of the standards
commitee. So EBCDIC had the characters in its set at least since
01/1979. The 3270 still doesn't have them on its keyboard. However,
having said that, I do not have access to an APL keyboard anymore so it
is possible that EBCDIC having those characters in its set may be
related to APL and its history. Someone else will have to answer that
question :o) Wikipedia says this:

http://en.wikipedia.org/wiki/APL_%28...ng_language%29

So based on that article, which clearly shows the '[' and ']' I am
going to guess that by "classic" EBCDIC you may have meant BCDIC.

I never did program in APL on an IBM mainframe but I used to see at
least one or two APL keyboards in every shop I worked in in the '80's
and '90's. I did have one class in APL in college but it was on a
CYBER, not an IBM.

Oct 23 '06 #94
"Jalapeno" <ja*******@mac.comwrites:
[...]
Now you're making me get out the archives I see :o) Ok, the oldest
"green card" I have is from when I graduated from college and got my
first job in an IBM mainframe shop. Jan of 1979. I don't know if this
table is what you'd call "classic" EBCDIC but it was the version being
used in the USA for IBM, Amdahl, Hitachi, and National Advanced Systems
mainframes in January of 1979. It clearly shows '[' as decimal 173 and
hex AD, and ']' as decimal 189 and hex BD in the EBCDIC column of the
table. It has no characters in the BCDIC column in that range and
nothing in the ASCII column. My archives don't go back any farther than
that but 1979 is clearly prior to the formation of the standards
commitee. So EBCDIC had the characters in its set at least since
01/1979. The 3270 still doesn't have them on its keyboard. However,
having said that, I do not have access to an APL keyboard anymore so it
is possible that EBCDIC having those characters in its set may be
related to APL and its history. Someone else will have to answer that
question :o) Wikipedia says this:

http://en.wikipedia.org/wiki/APL_%28...ng_language%29

So based on that article, which clearly shows the '[' and ']' I am
going to guess that by "classic" EBCDIC you may have meant BCDIC.
I actually have very little idea of what I meant by "classic" EBCDIC;
your guess is probably better than mine. My ignorance on this topic
is vast.

In this context, I suppose the most relevent version is whatever
influenced the ANSI C committee back in the 1980s. But at that time,
I think alternate ASCII-oid codes were at least as significant in
influencing the introduction of trigraphs; some national character
sets replaced some of the ASCII punctuation marks with things like
accented characters and currency symbols. I think these now have
largely been replaced by the 8-bit ISO-8859-* encodings, and by
Unicode et al.

One more data point: Unix-like systems have a command called "dd" that
converts and copies files. Some of the conversions it specifies are:

`ascii'
Convert EBCDIC to ASCII, using the conversion table specified
by POSIX. This provides a 1:1 translation for all 256 bytes.

`ebcdic'
Convert ASCII to EBCDIC. This is the inverse of the `ascii'
conversion.

`ibm'
Convert ASCII to alternate EBCDIC, using the alternate
conversion table specified by POSIX. This is not a 1:1
translation, but reflects common historical practice for `~',
`[', and `]'.

The `ascii', `ebcdic', and `ibm' conversions are mutually
exclusive.

"dd conv=ebcdic" translates '[' and ']' to 0x4a and 0x5a, respectively.
"dd conv=ibm" translates '[' and ']' to 0xad and 0xbd, respectively.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Oct 24 '06 #95
2006-10-23 <ln************@nuthaus.mib.org>,
Keith Thompson wrote:
Mark McIntyre <ma**********@spamcop.netwrites:
>On Mon, 23 Oct 2006 20:14:19 GMT, in comp.lang.c , Keith Thompson
<ks***@mib.orgwrote:
>>>Obviously the encodings used for Chinese and/or Japanese characters
are non-ASCII, but are they necessarily *incompatible* with ASCII?

Quite possibly not, although people have in the past been known to
deliberately write for incompatibility, due to personal, commercial or
nationalistic reasons. This is however probably offtopic in CLC...

It's not entirely off-topic. The future evolution of character sets
could have a major effect on future C standards. If we can't assume,
for example, that the '?' character will always be available, we'll
have to think about alternatives. Though there's probably not much
point in inventing specific solutions until and unless we see an
actual character set that *doesn't* have '?', and that people want to
use to write C programs.
The C standard does not define the graphical representation of any of
the characters the language uses. So for it to be an issue, we would
have to see an actual character set that has fewer than 98 characters.

Suppose a character set lacked ? but had $ - then we could define the
$ character as having the meaning of the ? in C. In source code
interchange, the string literal "Hello?" might become @Hello$@

62 a-zA-Z0-9
29 !#%^&*()[]{};':",.<>/?~\|-_=+ 29
9 \a\b\f\n\r\t\v \0

62+29+9 = 98 unique values required for C. And since C requires an [at
least] 8-bit type for char anyway, any system that used less wouldn't be
able to use its native character representation for C purposes anyway.
Oct 24 '06 #96

Keith Thompson wrote:
>
The `ascii', `ebcdic', and `ibm' conversions are mutually
exclusive.
I looked up the CCSID numbers to see what the 500 code page was that
was in the wikipedia article. Based on this link below, I think the
acronym EBCDIC means many things :o)

http://www-306.ibm.com/software/glob...registered.jsp

I am done now :o)

Oct 24 '06 #97
jxh wrote:
CBFalconer wrote:
.... snip ...
>>
If you just want to delete all comments, my public domain
uncmnt.c is considerably shorter. ...

<http://cbfalconer.home.att.net/download/>

Very nice. It doesn't handle other cases besides trigraphs, though.
I posted recently asking where it failed, and got no replies. I
did discover one case and corrected that. The revised code has
been posted at the above URL. It should be easily revised to
convert comments to the portable format, and I plan to do that
sometime real soon now.

As it stands I believe it is useful in generating cloaked source.
It can remove comments, id2id-20 (at same url) can revise names,
and a further utility (justify, not published) can handle the
rest. As it stands justify doesn't detect quoted strings, which
could cause problems. I may create justifyc to handle this, when
cloaking will reduce to a supervisory script. So far I have used
these things to create valid but obscure answers to homework
requests. :-)

One more useful thing for cloaking would be filters to entrigph and
detrigph.

All of this points out the advantage of writing fully portable
source to the C90 standard. Without that you have very few
guarantees that the eventual output source remains valid on the
purchasers system.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

Oct 24 '06 #98
jxh
CBFalconer wrote:
jxh wrote:
CBFalconer wrote:
... snip ...
>
If you just want to delete all comments, my public domain
uncmnt.c is considerably shorter. ...

<http://cbfalconer.home.att.net/download/>
Very nice. It doesn't handle other cases besides trigraphs, though.

I posted recently asking where it failed, and got no replies. ...
It fails the split comment cases, such as these:

/\
* this is a comment */

/\
/ this is a comment too

Also from the previous thread, I learned about not messing with the
preprocessor
directives, so both yours and mine failed cases like:

#define COMMENT_START /* blah blah blah
#define COMMENT_END blah blah blah */

Of course, keep in mind corner cases like:

/* hey */ #define FOO \
/* bzzt */

I have fixed my program to properly deal with preprocessor directives.

--
- James

/*
* cstripc: A C program to strip comments from C files.
* Usage:
* cstripc [file [...]]
* cstripc [-t]
*
* The '-t' options is used for testing. It prints some pointers
* to strings that are interlaced with comment characters.
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*****************/
/**** GLOBALS ****/
/*****************/

static const char *progname;
static int debug_flag;

/**********************/
/**** MAIN PROGRAM ****/
/**********************/

static void print_usage(void);
static void print_test(void);

static FILE * open_input_file(const char *filename);
static void close_input_file(FILE *infile);
static void parse_input_file(FILE *infile);

int
main(int argc, char *argv[])
{
progname = argv[0];
if (progname == 0) {
progname = "cstripc";
}

while (argc 1) {

if ((*argv[1] != '-') || (strcmp(argv[1], "-") == 0)) {
break;
}

if (strcmp(argv[1], "-t") == 0) {
print_test();
exit(0);
} else if (strcmp(argv[1], "-d") == 0) {
debug_flag = 1;
} else {
fprintf(stderr, "%s: Unrecognized option '%s'\n",
progname, argv[1]);
print_usage();
exit(EXIT_FAILURE);
}

--argc;
++argv;
}

if (argc <= 1) {
parse_input_file(stdin);
exit(0);
}

while (argc 1) {
FILE *infile;

parse_input_file(infile = open_input_file(argv[1]));
close_input_file(infile);

--argc;
++argv;
}

return 0;
}

/**************************/
/**** PRINT USAGE/TEST ****/
/**************************/

static const char *usage_string =
"%s: A C program to strip comments from C files.\n"
"Usage:\n"
" %s [file [...]]\n"
" %s [-t]\n"
"\n"
"The '-t' options is used for testing. "
"It prints some pointers to strings\n"
"that are interlaced with comment characters.\n"
;

static void
print_usage(void)
{
fprintf(stderr, usage_string, progname, progname, progname);
}

static const char *a;
static const char *b;
static const char *c;

static void
print_test(void)
{
if (a) puts(a);
if (b) puts(b);
if (c) puts(c);
}

/*******************************/
/**** OPEN/CLOSE INPUT FILE ****/
/*******************************/

static const char *input_file_name;

static FILE *
open_input_file(const char *filename)
{
FILE *infile;

input_file_name = filename;

if (filename == 0) {
return 0;
}

if (strcmp(filename, "-") == 0) {
return stdin;
}

infile = fopen(filename, "r");
if (infile == 0) {
fprintf(stderr, "%s: Could not open '%s' for reading.\n",
progname, filename);
}

return infile;
}

static void
close_input_file(FILE *infile)
{
if (infile) {
if (infile != stdin) {
if (fclose(infile) == EOF)
fprintf(stderr, "%s, Could not close '%s'.\n",
progname, input_file_name);
} else {
clearerr(stdin);
}
}
}

/**************************/
/**** PARSE INPUT FILE ****/
/**************************/

typedef struct scan_state scan_state;
typedef struct scan_context scan_context;

struct scan_context {
const scan_state *ss;
char *sbuf;
unsigned sbufsz;
unsigned sbufcnt;
int bol;
};

struct scan_state {
const scan_state *(*scan)(scan_context *ctx, int input);
const char *name;
};

static scan_context initial_scan_context;

static void
parse_input_file(FILE *infile)
{
int c;
scan_context ctx;

if (infile == 0) {
return;
}

ctx = initial_scan_context;

while ((c = fgetc(infile)) != EOF) {
if (debug_flag) {
fprintf(stderr, "%s\n", ctx.ss->name);
}
ctx.ss = ctx.ss->scan(&ctx, c);
}
}

/***********************/
/**** STATE MACHINE ****/
/***********************/

/*
*
************************************************** *******************
* Assume input is a syntactically correct C program.
*
* The basic algorithm is:
* Scan character by character:
* Treat trigraphs as a single character.
* If the sequence does not start a comment, emit the sequence.
* Otherwise,
* Scan character by character:
* Treat trigraphs as a single character.
* Treat the sequence '\\' '\n' as no character.
* If the sequence does not end a comment, continue consuming.
* Otherwise, emit a space, and loop back to top.
************************************************** *******************
*
*/

#define SCAN_STATE_DEFINE(name) \
static const scan_state * name##_func(scan_context *ctx, int input); \
static const scan_state name##_state = { name##_func, #name }

SCAN_STATE_DEFINE(normal);
SCAN_STATE_DEFINE(normal_maybe_tri_1);
SCAN_STATE_DEFINE(normal_maybe_tri_2);
SCAN_STATE_DEFINE(normal_maybe_splice);
SCAN_STATE_DEFINE(string);
SCAN_STATE_DEFINE(string_maybe_tri_1);
SCAN_STATE_DEFINE(string_maybe_tri_2);
SCAN_STATE_DEFINE(string_maybe_splice);
SCAN_STATE_DEFINE(char);
SCAN_STATE_DEFINE(char_maybe_tri_1);
SCAN_STATE_DEFINE(char_maybe_tri_2);
SCAN_STATE_DEFINE(char_maybe_splice);
SCAN_STATE_DEFINE(slash);
SCAN_STATE_DEFINE(slash_maybe_tri_1);
SCAN_STATE_DEFINE(slash_maybe_tri_2);
SCAN_STATE_DEFINE(slash_maybe_splice);
SCAN_STATE_DEFINE(slashslash);
SCAN_STATE_DEFINE(slashslash_maybe_tri_1);
SCAN_STATE_DEFINE(slashslash_maybe_tri_2);
SCAN_STATE_DEFINE(slashslash_maybe_splice);
SCAN_STATE_DEFINE(slashsplat);
SCAN_STATE_DEFINE(slashsplat_splat);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_tri_1);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_tri_2);
SCAN_STATE_DEFINE(slashsplat_splat_maybe_splice);
SCAN_STATE_DEFINE(preproc);
SCAN_STATE_DEFINE(preproc_maybe_tri_1);
SCAN_STATE_DEFINE(preproc_maybe_tri_2);
SCAN_STATE_DEFINE(preproc_maybe_splice);

#define SCAN_STATE(name) (&name##_state)

static scan_context initial_scan_context = {
SCAN_STATE(normal), 0, 0, 0, 1
};

static void sbuf_append_char(scan_context *ctx, int c);
static void sbuf_append_string(scan_context *ctx, char *s);
static void sbuf_clear(scan_context *ctx);
static void sbuf_emit(scan_context *ctx);

static const scan_state *
normal_func(scan_context *ctx, int input)
{
switch (input) {
case '#': sbuf_emit(ctx);
putchar(input);
return ctx->bol ? SCAN_STATE(preproc)
: SCAN_STATE(normal);
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_tri_1);
case '"': ctx->bol = 0;
sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
case '\'': ctx->bol = 0;
sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
case '/': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(slash);
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_splice);
case '\n': ctx->bol = 1;
/* fallthrough */
case ' ':
case '\t': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
default: ctx->bol = 0;
sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
}
}

static const scan_state *
normal_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_tri_2);
default: ctx->bol = 0;
sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
normal_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': ctx->bol = 0;
putchar(input);
return SCAN_STATE(normal_maybe_tri_2);
case '=': sbuf_emit(ctx);
putchar(input);
return ctx->bol ? SCAN_STATE(preproc)
: SCAN_STATE(normal);
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': ctx->bol = 0;
sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(normal_maybe_splice);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
normal_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
default: ctx->bol = 0;
/* fallthrough */
case '\n': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
}
}

static const scan_state *
string_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_tri_1);
case '"': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_splice);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
}
}

static const scan_state *
string_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(string)->scan(ctx, input);
}
}

static const scan_state *
string_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(string_maybe_tri_2);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(string_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
default: sbuf_emit(ctx);
return SCAN_STATE(string)->scan(ctx, input);
}
}

static const scan_state *
string_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n':
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(string);
}
}

static const scan_state *
char_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_tri_1);
case '\'': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_splice);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
}
}

static const scan_state *
char_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(char)->scan(ctx, input);
}
}

static const scan_state *
char_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(char_maybe_tri_2);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(char_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
default: sbuf_emit(ctx);
return SCAN_STATE(char)->scan(ctx, input);
}
}

static const scan_state *
char_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n':
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(char);
}
}

static const scan_state *
slash_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_tri_1);
case '\\': sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_splice);
case '/': sbuf_clear(ctx);
return SCAN_STATE(slashslash);
case '*': sbuf_clear(ctx);
return SCAN_STATE(slashsplat);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slash_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_emit(ctx);
sbuf_append_string(ctx, "??");
return SCAN_STATE(normal_maybe_tri_2);
case '/': sbuf_append_char(ctx, '?');
sbuf_append_char(ctx, input);
return SCAN_STATE(slash_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_append_char(ctx, '?');
sbuf_append_char(ctx, input);
sbuf_emit(ctx);
return SCAN_STATE(normal);
default: sbuf_append_char(ctx, '?');
sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slash_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': sbuf_append_char(ctx, input);
return SCAN_STATE(slash);
default: sbuf_emit(ctx);
return SCAN_STATE(normal)->scan(ctx, input);
}
}

static const scan_state *
slashslash_func(scan_context *ctx, int input)
{
/* UNUSED */ ctx = ctx;
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_1);
case '\\': return SCAN_STATE(slashslash_maybe_splice);
case '\n': putchar(' ');
putchar(input);
return SCAN_STATE(normal);
default: return SCAN_STATE(slashslash);
}
}

static const scan_state *
slashslash_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_2);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashslash_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashslash_maybe_tri_2);
case '/': return SCAN_STATE(slashslash_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': return SCAN_STATE(slashslash);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashslash_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': return SCAN_STATE(slashslash);
default: return SCAN_STATE(slashslash)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_func(scan_context *ctx, int input)
{
/* UNUSED */ ctx = ctx;
switch (input) {
case '*': return SCAN_STATE(slashsplat_splat);
default: return SCAN_STATE(slashsplat);
}
}

static const scan_state *
slashsplat_splat_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashsplat_splat_maybe_tri_1);
case '\\': return SCAN_STATE(slashsplat_splat_maybe_splice);
case '/': putchar(' ');
return SCAN_STATE(normal);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': return SCAN_STATE(slashsplat_splat_maybe_tri_2);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '/': return SCAN_STATE(slashsplat_splat_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': return SCAN_STATE(slashsplat);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
slashsplat_splat_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n': return SCAN_STATE(slashsplat_splat);
default: return SCAN_STATE(slashsplat)->scan(ctx, input);
}
}

static const scan_state *
preproc_func(scan_context *ctx, int input)
{
switch (input) {
case '\\': sbuf_emit(ctx);
sbuf_append_char(ctx, input);
return SCAN_STATE(preproc_maybe_splice);
case '\n': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(normal);
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(preproc);
}
}

static const scan_state *
preproc_maybe_tri_1_func(scan_context *ctx, int input)
{
switch (input) {
case '?': sbuf_append_char(ctx, input);
return SCAN_STATE(preproc_maybe_tri_2);
default: sbuf_emit(ctx);
return SCAN_STATE(preproc)->scan(ctx, input);
}
}

static const scan_state *
preproc_maybe_tri_2_func(scan_context *ctx, int input)
{
switch (input) {
case '?': putchar(input);
return SCAN_STATE(preproc_maybe_tri_2);
case '/': sbuf_append_char(ctx, input);
return SCAN_STATE(preproc_maybe_splice);
case '=':
case '(':
case ')':
case '<':
case '>':
case '!':
case '\'':
case '-': sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(preproc);
default: sbuf_emit(ctx);
return SCAN_STATE(preproc)->scan(ctx, input);
}
}

static const scan_state *
preproc_maybe_splice_func(scan_context *ctx, int input)
{
switch (input) {
case '\n':
default: sbuf_emit(ctx);
putchar(input);
return SCAN_STATE(preproc);
}
}

/*************************/
/**** BUFFER HANDLING ****/
/*************************/

static void
sbuf_append_char(scan_context *ctx, int c)
{
if (ctx->sbuf == 0) {
ctx->sbuf = malloc(ctx->sbufsz = 128);
} else if (ctx->sbufcnt == ctx->sbufsz) {
char *p = realloc(ctx->sbuf, ctx->sbufsz *= 2);
if (p == 0) {
fprintf(stderr, "%s: memory allocation failure\n",
progname);
exit(EXIT_FAILURE);
}
ctx->sbuf = p;
}

ctx->sbuf[ctx->sbufcnt++] = c;
ctx->sbuf[ctx->sbufcnt] = '\0';
}

static void
sbuf_append_string(scan_context *ctx, char *s)
{
while (*s != '\0') {
sbuf_append_char(ctx, *s++);
}
}

static void
sbuf_clear(scan_context *ctx)
{
ctx->sbufcnt = 0;
if (ctx->sbuf) {
ctx->sbuf[ctx->sbufcnt] = '\0';
}
}

static void
sbuf_emit(scan_context *ctx)
{
if (ctx->sbuf == 0 || ctx->sbufcnt == 0) {
return;
}

printf("%s", ctx->sbuf);
sbuf_clear(ctx);
}

/********************/
/**** TEST CASES ****/
/********************/

/* a comment */
/\
* a comment split */
/\
\
* a comment split twice */
/*
block comment
*/
/* comment, trailing delimiter split *\
/
/* comment, trailing delimiter split twice *\
\
/
/* comment, trailing delimiter split once, and again by trigraph *\
??/
/

static const char *a = /* comment in code "*/"Hello, "/**/"World!";
static const char *b = /\
* comment on code line split */ "Hello, " /\
\
* comment on code line split twice */ "World!";

#if 0
??/* this does not start a comment */
#endif

#define FOO1 /* don't touch this */
#define FOO2 \
/* don't touch this */

/* comment */ #define FOO3 /* don't touch this */

#define FOO4 /* don't touch
#define FOO5 this */

#if defined(__STDC__) && (__STDC__ == 1)
#if defined(__STD_VERSION__) && (__STD_VERSION__ >= 199901L)
//*** MORE TEST CASES ***//
/\
/ // comment split
/\
\
/ // comment split twice
static const char *c = // // comment on code line
"Hello, " /\
/ // comment on code line split
"World!" /\
\
/ // comment on code line split twice.
;

#if 0
??// this does not start a comment
#endif

// This is a // comment \
on two lines

#else
static const char *c = "STDC without STD_VERSION";
#endif
#endif

Oct 24 '06 #99
jxh wrote:
CBFalconer wrote:
.... snip ...
>>
I posted recently asking where it failed, and got no replies. ...

It fails the split comment cases, such as these:

/\
* this is a comment */

/\
/ this is a comment too

Also from the previous thread, I learned about not messing with the
preprocessor
directives, so both yours and mine failed cases like:

#define COMMENT_START /* blah blah blah
#define COMMENT_END blah blah blah */

Of course, keep in mind corner cases like:

/* hey */ #define FOO \
/* bzzt */
Thanks, I will look into those. The #defines don't seem to be a
problem, since the second #define is within the comment and should
be ignored. i.e. you can't do that.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>
Oct 25 '06 #100

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by qwweeeit | last post: by
3 posts views Thread by Markus | last post: by
9 posts views Thread by Frank Potter | last post: by
3 posts views Thread by Laurence | last post: by
1 post views Thread by Andrus | last post: by
61 posts views Thread by arnuld | last post: by
3 posts views Thread by Allen Chen [MSFT] | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.