My first Python program -- a lexer

Thomas Mlynarczyk

Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n ", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize( )

def _compile( self ):
for name, regex in self.tokens.ite ritems():
self.tokens[name] = re.compile( regex, re.M )

def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.ite ritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.app end( ( name, match, self.line ) )
self.line += match.group(0). count( "\n" )
break
else:
raise Exception(
'Syntax error in source at offset %s' %
str( self.offset ) )

def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )

# Test Example

source = r"""
Name: "Thomas", # just a comment
Age: 37
"""

tokens = {
'T_IDENTIFIER' : r'[A-Za-z_][A-Za-z0-9_]*',
'T_NUMBER' : r'[+-]?\d+',
'T_STRING' : r'"(?:\\.|[^\\"])*"',
'T_OPERATOR' : r'[=:,;]',
'T_NEWLINE' : r'\n',
'T_LWSP' : r'[ \t]+',
'T_COMMENT' : r'(?:\#|//).*$' }

print Lexer( source, tokens )

### End ###
Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

Nov 8 '08 #1

Subscribe Reply

1843

Arnaud Delobelle

Thomas Mlynarczyk <th****@mlynarc zyk-webdesign.dewri tes:

Hello,

I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I'm a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.

Hi,

Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

>>mylexer = Lexer(tokens)
mylexer.token ise(source)

# Later:

>>mylexer.token ise(another_sou rce)

--
Arnaud

Nov 9 '08 #2

Thomas Mlynarczyk

Arnaud Delobelle schrieb:

Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

>>>mylexer = Lexer(tokens)
mylexer.toke nise(source)
mylexer.toke nise(another_so urce)

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' )
# continue doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.

Thanks for your feedback.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

Nov 9 '08 #3

Paul McGuire

On Nov 9, 8:34*pm, Dennis Lee Bieber <wlfr...@ix.net com.comwrote:

On Sun, 09 Nov 2008 23:33:30 +0100, Thomas Mlynarczyk
<tho...@mlynarc zyk-webdesign.dedec laimed the following in
comp.lang.pytho n:

Of course. For the actual message I would use at least the line number.
Still, the offset could be used to compute line/column in case of an
error, so I wouldn't really need to store line/column with each token,
but only the offset. And provide a method to "convert" offset values
into line/column tuples.

* * * * Are you forcing the use of fixed length lines then?

* * * * Otherwise, by what algorithm will you convert:

>data = """

... one
... two
... three
... four
... five
... supercalifragil isticexpialidoc ious
... seven
... eight
... nine""">>ix = data.index("lis t")

>ix

39

loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

-- Paul

Nov 10 '08 #4

Robert Lehmann

On Sun, 09 Nov 2008 15:53:01 +0100, Thomas Mlynarczyk wrote:

Arnaud Delobelle schrieb:

>Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

>>>>mylexer = Lexer(tokens)
mylexer.tok enise(source)
mylexer.tok enise(another_s ource)

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.a ppend`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' ) # continue
doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.s end("expected
token")`` whenever you expect a special token.

See http://www.python.org/dev/peps/pep-0342/ for details.

HTH,

--
Robert "Stargaming " Lehmann

Nov 10 '08 #5

Thomas Mlynarczyk

Robert Lehmann schrieb:

You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.a ppend`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.

Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.s end("expected
token")`` whenever you expect a special token.

See http://www.python.org/dev/peps/pep-0342/ for details.

I will try this. Thank you for the suggestion.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux Ã* avoir tort qu'ils ont raison!
(Coluche)

Nov 10 '08 #6

Thomas Mlynarczyk

Paul McGuire schrieb:

loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

Nov 10 '08 #7

Paul McGuire

On Nov 10, 7:29*am, Thomas Mlynarczyk <tho...@mlynarc zyk-webdesign.de>
wrote:

Paul McGuire schrieb:

loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

-- Paul

Nov 10 '08 #8

Steve Holden

Some pratt wrote:

BLAST YOUR AD [...]

and curse yours

Nov 10 '08 #9

Thomas Mlynarczyk

Paul McGuire schrieb:

Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

Another thing I had not thought of -- thanks for the hint.

Greetings,
Thomas

--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)

Nov 10 '08 #10

Similar topics

3503

parsers / lexers usable from Python and Java (and C++/C)?

by: John J. Lee | last post by:

Are there any parser / lexer generators useable from both CPython and Java? I don't mind much if the Python-useable output is in Python or C (as long as the C can be wrapped automatically, of course). If I could run the same source files through two different tools, that would be just as good. I'm aware that I could use ANTLR from Jython or perhaps Python+JPE, but I was looking for something accessible from CPython without having Java...

Python

699

34255

Python syntax in Lisp and Scheme

by: mike420 | last post by:

I think everyone who used Python will agree that its syntax is the best thing going for it. It is very readable and easy for everyone to learn. But, Python does not a have very good macro capabilities, unfortunately. I'd like to know if it may be possible to add a powerful macro system to Python, while keeping its amazing syntax, and if it could be possible to add Pythonistic syntax to Lisp or Scheme, while keeping all of the...

Python

5511

Assembler Parser/Lexer in Python

by: Simon Foster | last post by:

Anyone have any experience or pointers to how to go about creating a parser lexer for assemble in Python. I was thinking of using PLY but wonder whether it's too heavyweight for what I want. Anyone have any thoughts? -- Simon Foster Somewhere in the West of England

Python

2640

Parsing library for Python?

by: Viktor Rosenfeld | last post by:

Hi, I need to create a parser for a Python project, and I'd like to use process kinda like lex/yacc. I've looked at various parsing packages online, but didn't find anything useful for me: - PyLR seems promising but is for Python 1.5 - Yappy seems promising, but I couldn't get it to work. It doesn't even compile the main example in it's documentation - mxTexttools is way complicated. I'd like something that I can give a BNF

Python

3904

Extending Python Syntax with @

by: David MacQuigg | last post by:

Seems like we need a simple way to extend Python syntax that doesn't break existing syntax or clash with any other syntax in Python, is easy to type, easy to read, and is clearly distinct from the "base" syntax. Seems like we could put the @ symbol to good use in these situations. Examples: print @(separator = None) x, y, z @x,y:x*x+y*y -- anonymous function

Python

1418

How is Python designed?

by: Limin Fu | last post by:

Hello, Is there any technical description on internet of how python is designed? Or can somebody give a short description about this? I'm just curious. Thanks in advance, Limin

Python

7146

flex/bison like module in Python?

by: Jerry Sievers | last post by:

Dear Pythonists; Curious if there exists in Python package(s) for use as lexer/parser for implementation of language grammars? Already using cmd.py from the standard distro for it's basic features but wishing for much more advanced capability. As such, I refer to flex/bison because though complex they are general purpose and very useful.

Python

1346

Changing the names of python keywords and functions

by: vedrandekovic | last post by:

Hello, I am trying to make a program for 3D modelling with "programming".And I want make my own program commands, for example when user type code in my program: "<<koristiti>OS"- (THIS IS MY IMAGINARY EXAMPLE OF KEYWORD), my program must write this code in some user file, but my

Python

1179

How to in Python

by: MartinRinehart | last post by:

I've got a pointer to a position in a line of code that contains either a digit or a period (decimal point). I've got this comment: Numbers are one of these: integers: digit+ 0xhex_digit+ decimals: digit+.digit*digit+] .digit+digit+]

Python

9663

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9506

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10193

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10136

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9979

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7525

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6761

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5548

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2906

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General