By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,587 Members | 1,643 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,587 IT Pros & Developers. It's quick & easy.

My first Python program -- a lexer

P: n/a
Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize()

def _compile( self ):
for name, regex in self.tokens.iteritems():
self.tokens[name] = re.compile( regex, re.M )

def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.iteritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.append( ( name, match, self.line ) )
self.line += match.group(0).count( "\n" )
break
else:
raise Exception(
'Syntax error in source at offset %s' %
str( self.offset ) )

def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )

# Test Example

source = r"""
Name: "Thomas", # just a comment
Age: 37
"""

tokens = {
'T_IDENTIFIER' : r'[A-Za-z_][A-Za-z0-9_]*',
'T_NUMBER' : r'[+-]?\d+',
'T_STRING' : r'"(?:\\.|[^\\"])*"',
'T_OPERATOR' : r'[=:,;]',
'T_NEWLINE' : r'\n',
'T_LWSP' : r'[ \t]+',
'T_COMMENT' : r'(?:\#|//).*$' }

print Lexer( source, tokens )

### End ###
Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Nov 8 '08 #1
Share this Question
Share on Google+
14 Replies


P: n/a
Thomas Mlynarczyk <th****@mlynarczyk-webdesign.dewrites:
Hello,

I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I'm a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.
Hi,

Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>mylexer = Lexer(tokens)
mylexer.tokenise(source)
# Later:
>>mylexer.tokenise(another_source)
--
Arnaud
Nov 9 '08 #2

P: n/a
Arnaud Delobelle schrieb:
Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>>mylexer = Lexer(tokens)
mylexer.tokenise(source)
mylexer.tokenise(another_source)
At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' )
# continue doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.

Thanks for your feedback.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Nov 9 '08 #3

P: n/a
On Nov 9, 8:34*pm, Dennis Lee Bieber <wlfr...@ix.netcom.comwrote:
On Sun, 09 Nov 2008 23:33:30 +0100, Thomas Mlynarczyk
<tho...@mlynarczyk-webdesign.dedeclaimed the following in
comp.lang.python:
Of course. For the actual message I would use at least the line number.
Still, the offset could be used to compute line/column in case of an
error, so I wouldn't really need to store line/column with each token,
but only the offset. And provide a method to "convert" offset values
into line/column tuples.

* * * * Are you forcing the use of fixed length lines then?

* * * * Otherwise, by what algorithm will you convert:
>data = """

... one
... two
... three
... four
... five
... supercalifragilisticexpialidocious
... seven
... eight
... nine""">>ix = data.index("list")
>ix
39
loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

-- Paul
Nov 10 '08 #4

P: n/a
On Sun, 09 Nov 2008 15:53:01 +0100, Thomas Mlynarczyk wrote:
Arnaud Delobelle schrieb:
>Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>>>mylexer = Lexer(tokens)
mylexer.tokenise(source)
mylexer.tokenise(another_source)

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:
You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.append`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.
token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' ) # continue
doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".
Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.send("expected
token")`` whenever you expect a special token.

See http://www.python.org/dev/peps/pep-0342/ for details.

HTH,

--
Robert "Stargaming" Lehmann
Nov 10 '08 #5

P: n/a
Robert Lehmann schrieb:
You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.append`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.
Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.send("expected
token")`` whenever you expect a special token.
See http://www.python.org/dev/peps/pep-0342/ for details.
I will try this. Thank you for the suggestion.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux * avoir tort qu'ils ont raison!
(Coluche)
Nov 10 '08 #6

P: n/a
Paul McGuire schrieb:
loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...
Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Nov 10 '08 #7

P: n/a
On Nov 10, 7:29*am, Thomas Mlynarczyk <tho...@mlynarczyk-webdesign.de>
wrote:
Paul McGuire schrieb:
loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1
prints 5,14
I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

-- Paul
Nov 10 '08 #8

P: n/a
Some pratt wrote:
BLAST YOUR AD [...]
and curse yours

Nov 10 '08 #9

P: n/a
Paul McGuire schrieb:
Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.
Another thing I had not thought of -- thanks for the hint.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Nov 10 '08 #10

P: n/a
John Machin schrieb:
Single-character tokens like "<" may be more efficiently handled by
doing a dict lookup after failing to find a match in the list of
(name, regex) tuples.
Yes, I will keep that in mind. For the time being, I will use only
regexes to keep the code simpler. Later, or when the need for a speedup
arises, I can use your suggestion for optimizing my code. Depending on
the tokens, it might even be better to use the dict lookup right away
and the regex as a secondary means for more complex stuff.

[Mutually exclusive tokens = no ambiguities = same input -same output]
So what? That is useless knowledge.
For the lexer, perhaps. Not for the user. An ambiguous lexer will be of
no use.
It is the ambiguous cases that you
need to be concerned with.
Exactly. In which way does that contradict what I said?

[Using dict]
No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.
I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Nov 10 '08 #11

P: n/a
On Nov 11, 8:35*am, Thomas Mlynarczyk <tho...@mlynarczyk-webdesign.de>
wrote:
[Using dict]
No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?
You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.
Nov 10 '08 #12

P: n/a
John Machin schrieb:
You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.
It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?

Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Nov 11 '08 #13

P: n/a
Thomas Mlynarczyk wrote:
John Machin schrieb:
>You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.

It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?
The latter.
Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.
But it's a question of the properties of the objects. Because lists are
ordered it's appropriate to use them when you want to iterate over them
and no random access is required.

If you need random access by key and no particular ordering is required
for the iteration then you should use a dict.

Forget "simplicity" until you know how the different objects are
implemented under the hood.

A good first rule is that simple, readable Python will usually be
efficient. As far as execution efficiency goes I'd be very surprised if
the dict solution was faster, but even if it were I would still prefer
the list-based solution because it's more appropriate to the problem.

"Premature optimization is the root of all evil" ...
Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.
But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.

The only time your original code needs to use the token name to retrieve
the value is for your compile() method, but assuming the passed-in
tokens was of the form [(name, pattern), ...] you could just as easily
throw the method away and in __init__() write

self.tokens = [(name, re.compile(pat)) for name, pat in tokens]

Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!

The advantage of the list-based approach is that it at least allows of
the possibility that you can observe the relative frequencies of the
tokens' appearance in real code, and then optimize the ordering to give
best (mean) performance (by putting the commonest token first) should
tokenization become a program hotspot.

With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Nov 11 '08 #14

P: n/a
Steve Holden schrieb:
>Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.
But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.
Well, I had been thinking about further reducing the number of regex
matchings needed. So I wanted to modify my lexer not to tokenize the
whole input at once, but only try to grab the next token from the input
"just in time" / on demand. For that I was thinking of having a next()
method like this:

def next( self, nameOfExpectedToken ):
regex = self.getRegexByTokenName( nameOfExpectedToken )
match = regex.match( self.source, self.offset )
if not match: return False
line = self.line
self.line += match.group(0).count( "\n" )
self.offset += len( match.group(0) )
return ( nameOfExpectedToken, match, line )

I'm not sure if this is a good idea, but it looks like one to me. The
problem is the first line of the method which retrieves the regex
associated with the given token name. Using a dict, I could simply write

regex = self.tokendict[nameOfExpectedToken]

But with a list I suppose I wouldn't get away without a loop. Which I
assume is more expensive that the dict.
Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!
That would be an option. But shouldn't it be the lexer who takes care of
optimizing its own work as much as it can do without the caller's
assistance? After all, the caller should not need to know about the
internal workings of the lexer.

[Optimizing performance by putting most frequent tokens first]
With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.
True. Still, I should be able to gain even better performance with my
above approach using a next() function, as this would completely
eliminate all "useless" matching (like trying to match FOO where no foo
is allowed).

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!
(Coluche)
Nov 12 '08 #15

This discussion thread is closed

Replies have been disabled for this discussion.