473,795 Members | 3,100 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Tokenizer interface

Do you see any counter-indication to a token extractor in the
following form?

typedef int token;
token ExtractToken(co nst char** String, char* Symbol);

If it finds a valid token at the start of *String, it copies it into
Symbol (unless Symbol==NULL, in which case it doesn't copy it
anywhere); it makes *String point to the first character *after* the
token, and returns a token identificator (0 if no valid token found).

Of course, there is also a function
void CreateToken(con st char* Symbol, token Id);
to add tokens to the set.

The only other function I know of that behaves like this (as the
char** is concerned) is strtol(), which can change a pointer to point
to the first character after the number parsed.

by LjL
lj****@tiscali. it
Nov 13 '05
10 2613
lj****@tiscalin et.it (Lorenzo J. Lucchini) wrote:
qe*@pobox.com (Paul Hsieh) wrote:
I'm not really a big fan of your approach. I would instead do it like
this:

int IterateTokens (const char * string,
int (*cb) (void * parm, size_t offset, size_t len),
void * parm);

With the idea that the callback function cb would be called on each
successive token. (The purpose of void * parm is to let you use a
universal callback function, while retaining specific context to a
particular string that you are tokenizing. I.e., IterateTokens just
passes on the parm it was given to the cb function.) This allows you
to be flexible about how you consume your tokens.
But see this expression:
$AF+MyVar+0x123 4==0

"$AF" is a token in my parsing token, all is fine. So is "+".
"MyVar", though, is not a token: on it, my ExtractToken() function
returns 0 (synonym for "no [significant] token"), and Symbol becomes
"".


Huh? Then how do you continue to parse? I assume you use some other logic
to match the successive characters to a previously defined variable? But in
that case, don't you need a grammar for your variables? And does your
variable matching algorithm correctly differentiate between "My" and "MyVar"?
At this point, my parsing loop decides that what follows the "+" must
be a variable (which it treats like a "free form" string), and so it
calls ExtractFreeform (), which extracts everything up to the next
valid token it encounters ("+" in this case).
Ok, well I am use a more "classical" definition of token in which all
successive sub-portions are considered tokens, which are then given
differentiating attributes describing what they are. So let's call my kind of
tokens "meta-tokens" just to avoid confusion (so you could rename my function
IterateMetaToke ns().)

The idea then would be that the callback function would first look into the
string and decide whether its one of what *you* call tokens, or something else
(a variable I suppose?) This logic would be analogous to whatever logic you
used to decide that a ... uh ... *symbol* becomes "". Then you would branch
into two cases -- one which parses your "tokens", and another which calls
a variation on ExtractFreeform (), I suppose.

So IterateMetaToke ns() would never return something analogous to "" (i.e.,
setting the len to 0 in the callback) which assumes that an empty string is
never valid in your gramar. So you would have put in the simple logic of
scanning ahead the length of a variable ("Freeform"? ) when a "nontoken" was
detected into IterateMetaToke ns().
Now, I could make variables like "MyVar" tokens to avoid embarassing
ExtractToken() and using this ExtractFreeform () function.
Well, "MyVar" would be a meta-token. Your callback would then decide how it
further sub-classified.
But besides now working well with my program, this approach couldn't
handle the next part of the expression: "0x1234" is not a token that
ExtractToken() will ever be able to extract, so I *have* to call
ExtractFreeform () here.
I don't understand. What makes you think "0x1234" couldn't be parsed into
a meta token in the IterateMetaToke n() function?
As I see it, your approach would choke on "non-tokens" that are
currently handled by a call to ExtractFreeform () following a failed
(return value == 0, Symbol is "") call to ExtractToken().
No -- if a "token"-wise parsing failed within IterateMetaToke n() you would
immediately try to extract a free form token. The role of the callback
function would be to track the semantics of whatever it is that it received.
For example, if cb returns with a negative number, or other "fail
code" then you can use that to halt the iterator early (for example
you might be able to detect that a syntax error has occurred prior to
performing tokenizing the whole string.)


Or I could use it to simulate the behavior of ExtractFreeform ().
But this would force the calling loop (which, with your approach,
wouldn't probably be a loop at all) to:
1) do ExtractFreeform ()'s job in a way not consistent with how
IterateTokens() works (it would even have to compute where in the
string parsing has stopped, unless IterateTokens returns precisely
this information)


IterateMetaToke ns would never *modify* anything (take notice of the const in
the parameter list) so its not really relevant that it retain its place after
returning, whether it failed or not. You can still know the last successful
point of parsing before it failed by tracking the last offset + len passed to
the callback function. You can store this info in an arbitrary struct which
is pointed to by the cookie variable: parm.
2) after parsing the free-form part of the expression, "resurrect"
IterateTokens to parse from after the location of the free-form stuff

This looks awfully complicated...
Well, parsing in general is complicated no matter what you do. There are
entire books written about it. The method I am describing to you follows the
general notions of a typical tokenizable grammar.
You can also make the return
value of IterateTokens just return the number of successful tokens
that were passed to the callback, or a negative number to indicate
some kind of failure.

It also might not be necessary to '\0' terminate the tokens as you
encounter them. Certainly if you were using bstrlib
(http://bstring.sf.net) you wouldn't want to. This gives you a way to
perform "tokenizati on by reference" if its possible for you to avoid
copying the tokens for further processing (for example if you have a
totally incremental parsing algorithm.)


I *do* already "tokenize by reference", in a sense.
Most of the times, ExtractToken() is called with Symbol==NULL, i.e. it
doesn't copy the token anywhere. All that is needed is the int value
ExtractToken() returns: this value uniquely identifies the token most
of the times, unless it is 0, in which case the parser does have to
check Symbol to see if it encountered a "" (end of string / unparsable
token) or a " " (space, or other separator, whose token id is 0, as
well).


Yes, because most of your grammar metaTokens are 1 character in length right?
Well that just means that in my solution most of the time len will be equal to
1, in which case you do a simple character test (a switch with a default) to
figure out if its "token" versus a variable.
Look at this code:

Operator=Extrac tToken(OpTokens , Expression, NULL);
if(Operator==OP _NONE) {
if(strlen(*Expr ession)==0) Operator=OP_END ; else {
Register=Extrac tToken(RegToken s, Expression, NULL);
if(Register==RE G_NONE) {
char *Symbol=malloc( strlen(*Express ion)+1);
ExtractFreeform (OpTokens, Expression, Symbol);
/* ... */
/* This is taken from a part of my program, but things not relevant
to the discussion have been stripped out. */
Well the call-back with my design would look something like:

int categorizeMetaT oken (void * parm, size_t offset, size_t len) {
if (len == 1) switch (((char *)parm)[offset]) {
/* A token has been received? */
case '+': /* ... etc */ ; return 0;
case '*': /* ... etc */ ; return 0;
/* ... etc */
/* Nope! Its just a 1-character long variable */
default : break;
}
{
struct tagbstring Symbol;
blk2tbstr (Symbol, (char *)parm + offset, len);
/* Symbol now contains a tabstring of the variable/symbol */
/* ... etc */
return 0; /* Success */
}
return -1; /* Some kind of error. */
}

If it bothers you that you have determine that something is a one character
token twice (once in the outer loop, and once in the callback function) you
could pass an additional parameter to the callback -- something like "int
category" or "int type". So that you wouldn't need a "default" case, and each
case of the switch could break, then go to a common "return 0" exit point.
OpTokens and RegTokens are handle to different tokenization contexts;
for simplicity, I omitted this part in the declaration I posted.
OpTokens' tokens are operators like "+" and "-", while RegToken's
tokens are register specificators like "$AF" and "$HL".

The algorithm basically goes:

1) Extract next token. If it's an operator, fine, otherwise
2) See if it's a register specification. If it's not even that, then
3) It must be a variable (numeric literals omitted from the code
above)

You can see that, in 1) and 2), the extracted lexeme (Symbol) isn't
copied anywhere, because such a copy is not needed to identify the
token.
Only when a token can *not* be extracted do we need to analyze an
actual string (which is extracted by ExtractFreeform (), not even
ExtractToken()) .
Right -- but notice how in my solution no mallocs or copies are made *no
matter what*. But it assumes that you also use "the better string library" to
be able to achieve this.
This is not the only place or the only way in my program where I use
the tokenization function, but it should give a picture.
If you actually need '\0'
terminated strings without damaging the input string, then doing a
precisely sized malloc, memcpy, and storing a terminating '\0' in your
callback function is trivial and will do the trick
Not if IterateTokens() doesn't *know* where the token ends.


Well, the idea would be that it *should* know. But I am making classical
parsing algorithm assumptions. Though I cannot quite see how your conditions
would fail such an approach.
-- it remains for
you to then design a stack or list or vector or whatever you like to
retain all these strings (all the work being done in the callback
function).

The point is, that with a mechanism like this, you are flexible enough
to employ any policy of how you want your results from the process of
tokenization. It also seperates the process of actual parsing from
the process of dealing with the results.


Yours is an interesting approach that can - I think - be most useful
is situations where the lexeme, I mean the token in its string form,
must often be copied, because an integral value (my "token id") cannot
identify it uniquely enough in an easy way. Even more useful perhaps
when all or most of the parsed tokens must be kept (in string form) in
a list somewhere.
In the latter case, your function would behave similarly to Morris
Dovey's function, except that the actual work of putting the extracted
tokens in a vector is done by the callback function in your approach,
instead of being embedded in the iterating token extractor.

But you haven't convinced me that your approach is superior in my
case.


That's because I think we have different concepts of how parsing should be
done. With me, I'm concerned that I don't know whether I want to follow a
strategy of storing the parsed metaTokens, or act on them as they are
consumed. Using a general callback mechanism like this allows me to change
that decision on a whim without modifying the code that does the actual
work of parsing metaTokens.

To me its just a program maintenance and flexibility issue. Any strategy can
be *forced* to work. But if you find out you have to change some of your
strategy which do you think will retain more code investment? A solution
which clearly seperates the parsing of metaTokens from what the meaning of
these tokens are, or a solution which intermingles the two notions?
[...] Using a callback function would be nice to eliminate the outer
parsing loop and make code concentrate on analysing tokens singularly;
however, the outer loop would eventually still be there, together with
other strange creatures, to handle the equivalent of
ExtractFreeform ().

In addition, I also use my tokenizer to parse simple interactive
commands like "run" or "info stack", besides interpreting more complex
expressions like the example I made above.
Right -- so long as the metaTokenizatio n semantics are the same, one can
simply plug in a different callback for the different contexts (in your case
top-level command line versus in-code, presumably.)
[...] For this task, it seems to
me that the (now fairly simple) structure of my interactive command
interpreter would be sort of obfuscated by introducing a callback
function and somewhere to keep parsed tokens (or, alternatively, a
*complicated* callback function without having to keep parsed tokens
somewhere).

P.S. A question: for such a short signature as the one below, would
you find it necessary to use a standard signature marker?


No. In fact I don't use a standard signature file or anything like that. I
actually type it by hand every time. That's because google doesn't have any
provisions for signature that I can tell. Also I sometimes change my
signature if I believe its appropriate for the context of the discussion.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/
Nov 13 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
2140
by: Maarten van Reeuwijk | last post by:
Hi group, I need to parse various text files in python. I was wondering if there was a general purpose tokenizer available. I know about split(), but this (otherwise very handy method does not allow me to specify a list of splitting characters, only one at the time and it removes my splitting operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried tokenize but this specifically for Python and is way too heavy for me....
2
2269
by: Angus Mackay | last post by:
I remember python having a generic tokenizer in the library. all I want is to set a list of token seperators and then read tokens out of a stream, the token seperators should be returned as themselves. is there anything like this? cheers, Angus.
5
2987
by: Knackeback | last post by:
task: - read/parse CSV file code snippet: string key,line; typedef tokenizer<char_separator<char> > tokenizer; tokenizer tok(string(""), sep); while ( getline(f, line) ){ ++lineNo; tok.assign(line, sep);
4
6446
by: Java Guy | last post by:
This must be a classical topic -- C++ stgring tokenizer. I just switched from C to C++ ( in Unix ). It turns out that there is no existing C++ string tokenizer. Searching on the Web, I found several and tried one or two of them. Not very satisfied. Any suggestions? Thx!
10
8156
by: Alex | last post by:
I'm looking for a fast way to split a string into individual tokens. The typical format of the string is token1|token2|token3|...|tokenN| and the number of tokens varies (which is why i use a vector to hold the tokens). My current implementation uses the following method to split up the string: vector<char*> tokenize(char* str, char delim) { vector<char> theString;
0
1372
by: Arthur | last post by:
Given a "linemess.py" file with inconsistent line ending: line 1 \r \r\n line \n tokenized as per: import tokenize f=open('linemess.py','r')
1
3657
by: xlar54 | last post by:
Hey guys, Im writing a simple language tokenizer and Im at that point of "10 ways to do it, which is the best"... All I need it to do, is given a string, look inside it for keywords, and replace them with an ascii value (greater than 128 - each keyword will have its own ascii token). The hitch is, obviously if the keyword is between quotes, do nothing to it. Ive gone the route of using a stack, and its ok, but seems clumsy and...
18
3498
by: Robbie Hatley | last post by:
A couple of days ago I dedecided to force myself to really learn exactly what "strtok" does, and how to use it. I figured I'd just look it up in some book and that would be that. I figured wrong! Firstly, Bjarne Stroustrup's "The C++ Programming Language" said: (nothing)
1
1827
by: Karl Kobata | last post by:
Hi Fredrik, This is exactly what I need. Thank you. I would like to do one additional function. I am not using the tokenizer to parse python code. It happens to work very well for my application. However, I would like either or both of the following variance: 1) I would like to add 2 other characters as comment designation 2) write a module that can readline, modify the line as required, and finally, this module can be used as the...
0
9672
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9519
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10435
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10213
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10163
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7538
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6779
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5563
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2920
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.