473,802 Members | 1,996 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

My first Python program -- a lexer

Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n ", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize( )

def _compile( self ):
for name, regex in self.tokens.ite ritems():
self.tokens[name] = re.compile( regex, re.M )

def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.ite ritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.app end( ( name, match, self.line ) )
self.line += match.group(0). count( "\n" )
break
else:
raise Exception(
'Syntax error in source at offset %s' %
str( self.offset ) )

def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )

# Test Example

source = r"""
Name: "Thomas", # just a comment
Age: 37
"""

tokens = {
'T_IDENTIFIER' : r'[A-Za-z_][A-Za-z0-9_]*',
'T_NUMBER' : r'[+-]?\d+',
'T_STRING' : r'"(?:\\.|[^\\"])*"',
'T_OPERATOR' : r'[=:,;]',
'T_NEWLINE' : r'\n',
'T_LWSP' : r'[ \t]+',
'T_COMMENT' : r'(?:\#|//).*$' }

print Lexer( source, tokens )

### End ###
Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Nov 8 '08
14 1847
John Machin schrieb:
Single-character tokens like "<" may be more efficiently handled by
doing a dict lookup after failing to find a match in the list of
(name, regex) tuples.
Yes, I will keep that in mind. For the time being, I will use only
regexes to keep the code simpler. Later, or when the need for a speedup
arises, I can use your suggestion for optimizing my code. Depending on
the tokens, it might even be better to use the dict lookup right away
and the regex as a secondary means for more complex stuff.

[Mutually exclusive tokens = no ambiguities = same input -same output]
So what? That is useless knowledge.
For the lexer, perhaps. Not for the user. An ambiguous lexer will be of
no use.
It is the ambiguous cases that you
need to be concerned with.
Exactly. In which way does that contradict what I said?

[Using dict]
No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.
I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Nov 10 '08 #11
On Nov 11, 8:35*am, Thomas Mlynarczyk <tho...@mlynarc zyk-webdesign.de>
wrote:
[Using dict]
No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?
You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.
Nov 10 '08 #12
John Machin schrieb:
You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.
It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?

Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Nov 11 '08 #13
Thomas Mlynarczyk wrote:
John Machin schrieb:
>You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.

It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?
The latter.
Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.
But it's a question of the properties of the objects. Because lists are
ordered it's appropriate to use them when you want to iterate over them
and no random access is required.

If you need random access by key and no particular ordering is required
for the iteration then you should use a dict.

Forget "simplicity " until you know how the different objects are
implemented under the hood.

A good first rule is that simple, readable Python will usually be
efficient. As far as execution efficiency goes I'd be very surprised if
the dict solution was faster, but even if it were I would still prefer
the list-based solution because it's more appropriate to the problem.

"Premature optimization is the root of all evil" ...
Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.
But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.

The only time your original code needs to use the token name to retrieve
the value is for your compile() method, but assuming the passed-in
tokens was of the form [(name, pattern), ...] you could just as easily
throw the method away and in __init__() write

self.tokens = [(name, re.compile(pat) ) for name, pat in tokens]

Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!

The advantage of the list-based approach is that it at least allows of
the possibility that you can observe the relative frequencies of the
tokens' appearance in real code, and then optimize the ordering to give
best (mean) performance (by putting the commonest token first) should
tokenization become a program hotspot.

With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Nov 11 '08 #14
Steve Holden schrieb:
>Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.
But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.
Well, I had been thinking about further reducing the number of regex
matchings needed. So I wanted to modify my lexer not to tokenize the
whole input at once, but only try to grab the next token from the input
"just in time" / on demand. For that I was thinking of having a next()
method like this:

def next( self, nameOfExpectedT oken ):
regex = self.getRegexBy TokenName( nameOfExpectedT oken )
match = regex.match( self.source, self.offset )
if not match: return False
line = self.line
self.line += match.group(0). count( "\n" )
self.offset += len( match.group(0) )
return ( nameOfExpectedT oken, match, line )

I'm not sure if this is a good idea, but it looks like one to me. The
problem is the first line of the method which retrieves the regex
associated with the given token name. Using a dict, I could simply write

regex = self.tokendict[nameOfExpectedT oken]

But with a list I suppose I wouldn't get away without a loop. Which I
assume is more expensive that the dict.
Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!
That would be an option. But shouldn't it be the lexer who takes care of
optimizing its own work as much as it can do without the caller's
assistance? After all, the caller should not need to know about the
internal workings of the lexer.

[Optimizing performance by putting most frequent tokens first]
With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.
True. Still, I should be able to gain even better performance with my
above approach using a next() function, as this would completely
eliminate all "useless" matching (like trying to match FOO where no foo
is allowed).

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Nov 12 '08 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3505
by: John J. Lee | last post by:
Are there any parser / lexer generators useable from both CPython and Java? I don't mind much if the Python-useable output is in Python or C (as long as the C can be wrapped automatically, of course). If I could run the same source files through two different tools, that would be just as good. I'm aware that I could use ANTLR from Jython or perhaps Python+JPE, but I was looking for something accessible from CPython without having Java...
699
34272
by: mike420 | last post by:
I think everyone who used Python will agree that its syntax is the best thing going for it. It is very readable and easy for everyone to learn. But, Python does not a have very good macro capabilities, unfortunately. I'd like to know if it may be possible to add a powerful macro system to Python, while keeping its amazing syntax, and if it could be possible to add Pythonistic syntax to Lisp or Scheme, while keeping all of the...
3
5516
by: Simon Foster | last post by:
Anyone have any experience or pointers to how to go about creating a parser lexer for assemble in Python. I was thinking of using PLY but wonder whether it's too heavyweight for what I want. Anyone have any thoughts? -- Simon Foster Somewhere in the West of England
14
2642
by: Viktor Rosenfeld | last post by:
Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages online, but didn't find anything useful for me: - PyLR seems promising but is for Python 1.5 - Yappy seems promising, but I couldn't get it to work. It doesn't even compile the main example in it's documentation - mxTexttools is way complicated. I'd like something that I can give a BNF
75
3908
by: David MacQuigg | last post by:
Seems like we need a simple way to extend Python syntax that doesn't break existing syntax or clash with any other syntax in Python, is easy to type, easy to read, and is clearly distinct from the "base" syntax. Seems like we could put the @ symbol to good use in these situations. Examples: print @(separator = None) x, y, z @x,y:x*x+y*y -- anonymous function
2
1418
by: Limin Fu | last post by:
Hello, Is there any technical description on internet of how python is designed? Or can somebody give a short description about this? I'm just curious. Thanks in advance, Limin
4
7146
by: Jerry Sievers | last post by:
Dear Pythonists; Curious if there exists in Python package(s) for use as lexer/parser for implementation of language grammars? Already using cmd.py from the standard distro for it's basic features but wishing for much more advanced capability. As such, I refer to flex/bison because though complex they are general purpose and very useful.
6
1347
by: vedrandekovic | last post by:
Hello, I am trying to make a program for 3D modelling with "programming".And I want make my own program commands, for example when user type code in my program: "<<koristiti>OS"- (THIS IS MY IMAGINARY EXAMPLE OF KEYWORD), my program must write this code in some user file, but my
8
1180
by: MartinRinehart | last post by:
I've got a pointer to a position in a line of code that contains either a digit or a period (decimal point). I've got this comment: Numbers are one of these: integers: digit+ 0xhex_digit+ decimals: digit+.digit*digit+] .digit+digit+]
0
9562
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10536
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10304
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10285
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10063
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9114
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7598
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5494
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4270
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.