473,789 Members | 2,419 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

My first Python program -- a lexer

Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n ", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize( )

def _compile( self ):
for name, regex in self.tokens.ite ritems():
self.tokens[name] = re.compile( regex, re.M )

def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.ite ritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.app end( ( name, match, self.line ) )
self.line += match.group(0). count( "\n" )
break
else:
raise Exception(
'Syntax error in source at offset %s' %
str( self.offset ) )

def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )

# Test Example

source = r"""
Name: "Thomas", # just a comment
Age: 37
"""

tokens = {
'T_IDENTIFIER' : r'[A-Za-z_][A-Za-z0-9_]*',
'T_NUMBER' : r'[+-]?\d+',
'T_STRING' : r'"(?:\\.|[^\\"])*"',
'T_OPERATOR' : r'[=:,;]',
'T_NEWLINE' : r'\n',
'T_LWSP' : r'[ \t]+',
'T_COMMENT' : r'(?:\#|//).*$' }

print Lexer( source, tokens )

### End ###
Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Nov 8 '08 #1
14 1843
Thomas Mlynarczyk <th****@mlynarc zyk-webdesign.dewri tes:
Hello,

I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I'm a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.
Hi,

Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>mylexer = Lexer(tokens)
mylexer.token ise(source)
# Later:
>>mylexer.token ise(another_sou rce)
--
Arnaud
Nov 9 '08 #2
Arnaud Delobelle schrieb:
Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>>mylexer = Lexer(tokens)
mylexer.toke nise(source)
mylexer.toke nise(another_so urce)
At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' )
# continue doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.

Thanks for your feedback.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Nov 9 '08 #3
On Nov 9, 8:34*pm, Dennis Lee Bieber <wlfr...@ix.net com.comwrote:
On Sun, 09 Nov 2008 23:33:30 +0100, Thomas Mlynarczyk
<tho...@mlynarc zyk-webdesign.dedec laimed the following in
comp.lang.pytho n:
Of course. For the actual message I would use at least the line number.
Still, the offset could be used to compute line/column in case of an
error, so I wouldn't really need to store line/column with each token,
but only the offset. And provide a method to "convert" offset values
into line/column tuples.

* * * * Are you forcing the use of fixed length lines then?

* * * * Otherwise, by what algorithm will you convert:
>data = """

... one
... two
... three
... four
... five
... supercalifragil isticexpialidoc ious
... seven
... eight
... nine""">>ix = data.index("lis t")
>ix
39
loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

-- Paul
Nov 10 '08 #4
On Sun, 09 Nov 2008 15:53:01 +0100, Thomas Mlynarczyk wrote:
Arnaud Delobelle schrieb:
>Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>>>mylexer = Lexer(tokens)
mylexer.tok enise(source)
mylexer.tok enise(another_s ource)

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:
You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.a ppend`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.
token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' ) # continue
doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".
Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.s end("expected
token")`` whenever you expect a special token.

See http://www.python.org/dev/peps/pep-0342/ for details.

HTH,

--
Robert "Stargaming " Lehmann
Nov 10 '08 #5
Robert Lehmann schrieb:
You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.a ppend`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.
Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.s end("expected
token")`` whenever you expect a special token.
See http://www.python.org/dev/peps/pep-0342/ for details.
I will try this. Thank you for the suggestion.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux Ã* avoir tort qu'ils ont raison!
(Coluche)
Nov 10 '08 #6
Paul McGuire schrieb:
loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...
Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Nov 10 '08 #7
On Nov 10, 7:29*am, Thomas Mlynarczyk <tho...@mlynarc zyk-webdesign.de>
wrote:
Paul McGuire schrieb:
loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1
prints 5,14
I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

-- Paul
Nov 10 '08 #8
Some pratt wrote:
BLAST YOUR AD [...]
and curse yours

Nov 10 '08 #9
Paul McGuire schrieb:
Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.
Another thing I had not thought of -- thanks for the hint.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Nov 10 '08 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3503
by: John J. Lee | last post by:
Are there any parser / lexer generators useable from both CPython and Java? I don't mind much if the Python-useable output is in Python or C (as long as the C can be wrapped automatically, of course). If I could run the same source files through two different tools, that would be just as good. I'm aware that I could use ANTLR from Jython or perhaps Python+JPE, but I was looking for something accessible from CPython without having Java...
699
34255
by: mike420 | last post by:
I think everyone who used Python will agree that its syntax is the best thing going for it. It is very readable and easy for everyone to learn. But, Python does not a have very good macro capabilities, unfortunately. I'd like to know if it may be possible to add a powerful macro system to Python, while keeping its amazing syntax, and if it could be possible to add Pythonistic syntax to Lisp or Scheme, while keeping all of the...
3
5511
by: Simon Foster | last post by:
Anyone have any experience or pointers to how to go about creating a parser lexer for assemble in Python. I was thinking of using PLY but wonder whether it's too heavyweight for what I want. Anyone have any thoughts? -- Simon Foster Somewhere in the West of England
14
2640
by: Viktor Rosenfeld | last post by:
Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages online, but didn't find anything useful for me: - PyLR seems promising but is for Python 1.5 - Yappy seems promising, but I couldn't get it to work. It doesn't even compile the main example in it's documentation - mxTexttools is way complicated. I'd like something that I can give a BNF
75
3904
by: David MacQuigg | last post by:
Seems like we need a simple way to extend Python syntax that doesn't break existing syntax or clash with any other syntax in Python, is easy to type, easy to read, and is clearly distinct from the "base" syntax. Seems like we could put the @ symbol to good use in these situations. Examples: print @(separator = None) x, y, z @x,y:x*x+y*y -- anonymous function
2
1418
by: Limin Fu | last post by:
Hello, Is there any technical description on internet of how python is designed? Or can somebody give a short description about this? I'm just curious. Thanks in advance, Limin
4
7146
by: Jerry Sievers | last post by:
Dear Pythonists; Curious if there exists in Python package(s) for use as lexer/parser for implementation of language grammars? Already using cmd.py from the standard distro for it's basic features but wishing for much more advanced capability. As such, I refer to flex/bison because though complex they are general purpose and very useful.
6
1346
by: vedrandekovic | last post by:
Hello, I am trying to make a program for 3D modelling with "programming".And I want make my own program commands, for example when user type code in my program: "<<koristiti>OS"- (THIS IS MY IMAGINARY EXAMPLE OF KEYWORD), my program must write this code in some user file, but my
8
1179
by: MartinRinehart | last post by:
I've got a pointer to a position in a line of code that contains either a digit or a period (decimal point). I've got this comment: Numbers are one of these: integers: digit+ 0xhex_digit+ decimals: digit+.digit*digit+] .digit+digit+]
0
9663
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9506
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10193
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10136
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9979
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7525
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6761
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5548
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2906
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.