473,396 Members | 1,773 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Token Parser

I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks.

Nov 15 '05 #1
14 6297
Simon Morgan wrote:
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:
A man who provides a problem description and the problem code. A rarity
that it is pleasant to encounter :-)
#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
You probably want to check that it was a digit here as opposed to a
letter or some other punctuation and error if it was not. The standard
isdigit function could be of use!
while ((next_token = getchar()) != ' ' && next_token != EOF) {
Again here.
token *= 10;
token += next_token;
}
break;
}
}

return token;
}
What about explicit positive and negative numbers? e.g.
+43 -3 *
What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?


I would say that returning a token say the best option is to return a
token value indicating a number (possibly separate tokens for integer,
real etc) and return the value separately. You could either return a
structure with both items (not unreasonable) or return one of them be
passing a pointer to where it should be stored.
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.
Nov 15 '05 #2
On Sat, 23 Jul 2005 09:55:04 GMT, Simon Morgan
<me@privacy.net> wrote:
What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?


Well, it's obvious that you aren't returning enough information. You
need to return the type of the token (operator or number) as well as its
value (type of operator, value of the number). Generally, either return
the type and vakue separately, or decide that certain values are "out of
bounds" (negative numbers, for instance) and assign the operators in
that range.

In your case you seem to handle only positive integers, so you could say
for instance:

#define PLUS (-'+')
#define MINUS (-'-')
#define TIMES (-'*')

etc. then when you get an operator set token = -token.

Your code is also not at all safe. You are saying that anything not one
of your 'known' tokens or space is a digit, that will give you very
strange results with "abc def +" or even "123.567". You really need to
decide whether the token is:

An operator
A number
Something else

and possibly give an error if it's "something else".

Chris C
Nov 15 '05 #3
Simon Morgan wrote:
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}


There are several problems here. First, you are confusing tokens
with input characters. While a token can consist of a single
character, it is really a higher-level lexical object that has
both a type and a value. Your code tries to encode both type and
value in a single integer, but since one of your token types is
integer, you are bound to have problems. The usual solution is
to handle the token type and its value separately.

You are also confusing the characters '0'-'9' with the numbers 0-9.

You should take a look at lex or flex to get some ideas on
lexing. You should also stop by comp.compilers.

Here is a sample to give you the idea. It is hardly robust, but
should get you started.

/* --------------- snip --------------- */
#include <stdio.h>
#include <ctype.h>

#define UNKNOWN 1
#define NUMBER 2
#define OP 3

union val {
int num;
char op;
char ch;
};

int token(union val *val)
{
int tok = 0;
int c;

while ( (c = getchar()) != EOF ) {
if ( isspace(c) ) {
/* do nothing */
}
else if ( isdigit(c) ) {
tok = NUMBER;
val->num = c - '0';
while ( isdigit(c = getchar()) ) {
val->num *= 10;
val->num += c - '0';
}
if ( !isdigit(c) ) {
ungetc(c, stdin);
}
break;
}
else if ( c == '+' || c == '*' || c == '-' || c == '/' ) {
tok = OP;
val->op = c;
break;
}
else {
tok = UNKNOWN;
val->ch = c;
break;
}
}

return tok;
}

int main(int argc, char *argv[])
{
union val val;
int tok;

while ( (tok = token(&val)) ) {
switch (tok) {
case NUMBER:
printf("NUMBER = %d\n", val.num);
break;
case OP:
printf("OP = %c\n", val.op);
break;
case UNKNOWN:
printf("unexpected character: %c\n", val.ch);
break;
default:
printf("unknown token type: %d\n", tok);
}
}

return 0;
}
/* --------------- snip --------------- */
--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Nov 15 '05 #4
"T.M. Sommers" wrote:
Simon Morgan wrote:
I'm trying to write a function to parse a Reverse Polish Notation
string from stdin and return 1 token at a time. For those of you
who are unaware an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an
operator is encountered, each number on the stack is popped off
and calculated using that operator and the result pushed onto the
stack.

The problem I'm having is differentiating between for example,
the + character and the number 43. Here's what I have so far:
.... snip ...
There are several problems here. First, you are confusing tokens
with input characters. While a token can consist of a single
character, it is really a higher-level lexical object that has
both a type and a value. Your code tries to encode both type and
value in a single integer, but since one of your token types is
integer, you are bound to have problems. The usual solution is
to handle the token type and its value separately.

You are also confusing the characters '0'-'9' with the numbers 0-9.

You should take a look at lex or flex to get some ideas on
lexing. You should also stop by comp.compilers.

Here is a sample to give you the idea. It is hardly robust, but
should get you started.

/* --------------- snip --------------- */


Here is the beginnings of a simplified lexer I am designing. When
done I expect it will be usable for various languages (if I ever
get around to finishing it). The accompanying code is around 350
lines so far, so I am refraining from inflicting it on the
newsgroup. I could be persuaded to publish the code if someone
really wants it. Besides it doesn't work to my satisfaction yet.
When done it will return reals as strings with an appropriate
identifier, and I am not yet anywhere near satisfied about this.
So far it also doesn't handle escaped characters, trigraphs, and
such. The lexer will be the sole communication between the input
file and the program.

It is intended to work with languages that can resolve things on
the basis of one symbol and one char lookahead.

/* File lexer.h
This is intended to be a language adjustable module that will
open a specified file on initialization, and then return tokens
read from that file. The tokens may be identifiers, reserved
words, punctuation or EOF. They are accompanied by a
'spelling'.

The initialization phase specifies the language, which in turn
specifies the form of comments (to be ignored).

Users of the lexer can detect identifiers that are actually
reserved words, and take appropriate action.
*/

/* opaque type for internal use */
struct lexcontrol;

/* The user needs to free this and the contained ident */
typedef struct lextoken {
int nextch;
int sym; /* Also defines string delimiter */
int idlgh; /* A non-zero in these signals */
int numlgh; /* that an id, number, or string */
int stglgh; /* is present in ident */
char *ident; /* always return fresh in a new token */
} lextoken, *lextokenp;

typedef enum lang {unknown, C89, C99, Pascal} lexlang;

/* Initialize the system with an opened text file */
struct lexcontrol* lexinit(FILE *f, lexlang lang);

/* Return the next token, which is a pointer to a record */
lextokenp lexnext(struct lexcontrol* lx);

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!
Nov 15 '05 #5

"Simon Morgan" <me@privacy.net> wrote in message
news:pa***************************@privacy.net...
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks.

I suggest that as an approach to implementing this problem you look into
using a state machine. It's an elegant way to solve problems that are
combinatorily restricted.
Nov 15 '05 #6
On Sat, 23 Jul 2005 09:55:04 +0000, Simon Morgan wrote:
What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?


Thanks to everybody for their help, some of it is a bit above me at the
moment because I'm still learning the language and programming in general
(who isn't?) but it's food for thought.

I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.

The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;
}
break;
}
}

return token;
}

Nov 15 '05 #7
Simon Morgan wrote:
On Sat, 23 Jul 2005 09:55:04 +0000, Simon Morgan wrote:

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks to everybody for their help, some of it is a bit above me at the
moment because I'm still learning the language and programming in general
(who isn't?) but it's food for thought.

I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.

The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;


token -= '0'; /* ASCII is not the only character set */
while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;
token += next_token - '0'; /* ASCII is not the only
character set */
}
break;
}
else {
/* Put some code here to handle bad input */
}
}
return token;
}

--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.
Nov 15 '05 #8
Simon Morgan wrote:
On Sat, 23 Jul 2005 09:55:04 +0000, Simon Morgan wrote:
What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

.... snip ...
The code I have is far from perfect but obviously works a lot
better than what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;
}
break;
}
}
return token;
}


That is pretty clean newbie code - nice job. I have two comments.
First, get rid of the magic number 48. You should use '0' here,
which will not vary with the underlying char set. Second is exit
conditions.

You are basically ignoring all input except the stuff that
interests you. When you get a symbol such as '+' you return that,
and whatever follows is available for a further call. That means
you can resolve input such as "+123" correctly. But what about
"123+4", for example. You have discarded the char that follows 123
forever.

C has a handy mechanism for handling this, called ungetc. You use
it to put a char back for future reading. It is limited to a
single level, i.e. you can't push back more than one char between
gets. So add the line:

ungetc(next_token, stdin);

just before the break in the number accumulation routine. Remember
that getchar is really a call on getc(stdin) without bothering to
specify stdin file.

Other comments:

Later you can worry about detecting overflows in numeric input.

You may find that interchanging the methods of returning type and
value is more useful later. You will generally base the next
decision on the returned type, not the value. So the prototype may
work better as:

int next_token_type(int *token_read);

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Nov 15 '05 #9
Simon Morgan wrote:

I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.
You might find it worthwhile to read ahead a bit.
The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>
Don't forget:

#include <ctype.h> /* For isdigit(). */
int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {


Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Nov 15 '05 #10
"T.M. Sommers" wrote:
Simon Morgan wrote:
.... snip ...
while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {


Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).


This is an error. He wants to find the non-digit operators also.
There is no problem passing EOF to isdigit, it will reply 'no'.

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Nov 15 '05 #11
CBFalconer wrote:
"T.M. Sommers" wrote:
Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).


This is an error. He wants to find the non-digit operators also.
There is no problem passing EOF to isdigit, it will reply 'no'.


You are quite right. I was under the impression that the
argument had to be an unsigned char, but the standard also allows
EOF.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Nov 15 '05 #12
"T.M. Sommers" wrote:
CBFalconer wrote:
"T.M. Sommers" wrote:
Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).


This is an error. He wants to find the non-digit operators also.
There is no problem passing EOF to isdigit, it will reply 'no'.


You are quite right. I was under the impression that the
argument had to be an unsigned char, but the standard also allows
EOF.


The error part was that you changed the logic so that it could no
longer return the various operators, skipblanks, etc. The other is
unnecessary, rather than an error.

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Nov 15 '05 #13
CBFalconer wrote:
"T.M. Sommers" wrote:
CBFalconer wrote:
"T.M. Sommers" wrote:

Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).

This is an error. He wants to find the non-digit operators also.
There is no problem passing EOF to isdigit, it will reply 'no'.


You are quite right. I was under the impression that the
argument had to be an unsigned char, but the standard also allows
EOF.


The error part was that you changed the logic so that it could no
longer return the various operators, skipblanks, etc. The other is
unnecessary, rather than an error.


I cut and pasted from my own code, which used a different
variable for the input character ('c' vice 'next_token'), which
was an error, but other than that I don't see any error in the
above line (except that one should perhaps guarantee that c can
be cast to unsigned char). The loop controlled by the while is
over digits after the first, and will stop when and only when c
is not a digit. The only thing the EOF test changes is that
isdigit() will not be called for EOF, which, as you pointed out,
is unnecessary.

--
Thomas M. Sommers -- tm*@nj.net -- AB2SB

Nov 15 '05 #14
On Mon, 25 Jul 2005 16:28:42 +0000, CBFalconer wrote:
That is pretty clean newbie code - nice job.
Thanks.
I have two comments.


<snip>

Thanks for the heads up on the bug and ungetc and I really should know
better than hard coding the ASCII value!

--
"Being a social outcast helps you stay | 'f*******@tznvy.pbz'.encode('rot-13')
concentrated on the really important |
things, like thinking and hacking." | [ RENT THIS SPACE ]
- Eric S. Raymond |

Nov 15 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Andrew James | last post by:
Gentlemen, I'm running into a problem whilst testing the parsing of a language I've created with TPG . It seems that for some reason, TPG balks when I try to parse an expression whose first...
9
by: Peter Ammon | last post by:
As we know, due to C++'s "longest match" rule, the >> token causes headaches when working with nested templates, e.g. vector<vector<int>> will not parse correctly without inserting a space...
4
by: learning_C++ | last post by:
Hi, I try to use input = new istrstream(argv,strlen(argv)); in my code, but it always says error: " error: parse error before `(' token" please help me! Thanks, #include <map>
4
by: lisa | last post by:
I have an XML file that starts like this: <?xml version="1.0" encoding="ISO-8859-1" xmlns:fn="http://www.w3.org/2005/xpath-functions"?> <Authors> <Author> <ID>2</ID>...
4
by: jvictor118 | last post by:
I've been using the xml.sax.handler module to do event-driven parsing of XML files in this python application I'm working on. However, I keep having really pesky invalid token exceptions....
2
by: Shrutisinha | last post by:
I am getting this error in unix .. syntax error near unexpected token `(' when i am running my parser , can anybody help me plzz .. i am new to all this thanks
1
by: matias.cornejo | last post by:
I was looking de db2diag.log (DB2 V7, AIX) and I have this error and It's creating every 5 minutes the file .000 PID:312050(db2agent (idle)) Appid:none drda_as sqljp_ddm_dict_search Pdrobe:413...
2
by: =?Utf-8?B?Y2FzaGRlc2ttYWM=?= | last post by:
Hi, has anyone come across this error before: This is an unexpected token. The expected token is 'NAME' I am getting it when trying to send an xml file across a web service layer. The...
4
by: =?Utf-8?B?TmljayBBcmRsaWU=?= | last post by:
Is there a way to access the PSVI value of xs:token types using C# (.NET Framework 2.0)? This issue has been around for some time (see...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.