Hello,
I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.
### Begin ###
import re
class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n ", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize( )
def _compile( self ):
for name, regex in self.tokens.ite ritems():
self.tokens[name] = re.compile( regex, re.M )
def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.ite ritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.app end( ( name, match, self.line ) )
self.line += match.group(0). count( "\n" )
break
else:
raise Exception(
'Syntax error in source at offset %s' %
str( self.offset ) )
def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )
# Test Example
source = r"""
Name: "Thomas", # just a comment
Age: 37
"""
tokens = {
'T_IDENTIFIER' : r'[A-Za-z_][A-Za-z0-9_]*',
'T_NUMBER' : r'[+-]?\d+',
'T_STRING' : r'"(?:\\.|[^\\"])*"',
'T_OPERATOR' : r'[=:,;]',
'T_NEWLINE' : r'\n',
'T_LWSP' : r'[ \t]+',
'T_COMMENT' : r'(?:\#|//).*$' }
print Lexer( source, tokens )
### End ###
Greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche) 14 1843
Thomas Mlynarczyk <th****@mlynarc zyk-webdesign.dewri tes:
Hello,
I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I'm a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.
Hi,
Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>mylexer = Lexer(tokens) mylexer.token ise(source)
# Later:
>>mylexer.token ise(another_sou rce)
--
Arnaud
Arnaud Delobelle schrieb:
Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>>mylexer = Lexer(tokens) mylexer.toke nise(source) mylexer.toke nise(another_so urce)
At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:
token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' )
# continue doing something useful with token
Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".
But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.
Thanks for your feedback.
Greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
On Nov 9, 8:34*pm, Dennis Lee Bieber <wlfr...@ix.net com.comwrote:
On Sun, 09 Nov 2008 23:33:30 +0100, Thomas Mlynarczyk
<tho...@mlynarc zyk-webdesign.dedec laimed the following in
comp.lang.pytho n:
Of course. For the actual message I would use at least the line number.
Still, the offset could be used to compute line/column in case of an
error, so I wouldn't really need to store line/column with each token,
but only the offset. And provide a method to "convert" offset values
into line/column tuples.
* * * * Are you forcing the use of fixed length lines then?
* * * * Otherwise, by what algorithm will you convert:
>data = """
... one
... two
... three
... four
... five
... supercalifragil isticexpialidoc ious
... seven
... eight
... nine""">>ix = data.index("lis t")
>ix
39
loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1
prints 5,14
I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...
-- Paul
On Sun, 09 Nov 2008 15:53:01 +0100, Thomas Mlynarczyk wrote:
Arnaud Delobelle schrieb:
>Adding to John's comments, I wouldn't have source as a member of the Lexer object but as an argument of the tokenise() method (which I would make public). The tokenise method would return what you currently call self.result. So it would be used like this.
>>>>mylexer = Lexer(tokens) mylexer.tok enise(source) mylexer.tok enise(another_s ource)
At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:
You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.a ppend`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.
token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' ) # continue
doing something useful with token
Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".
Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.s end("expected
token")`` whenever you expect a special token.
See http://www.python.org/dev/peps/pep-0342/ for details.
HTH,
--
Robert "Stargaming " Lehmann
Robert Lehmann schrieb:
You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.a ppend`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.
Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.s end("expected
token")`` whenever you expect a special token.
See http://www.python.org/dev/peps/pep-0342/ for details.
I will try this. Thank you for the suggestion.
Greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux Ã* avoir tort qu'ils ont raison!
(Coluche)
Paul McGuire schrieb:
loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1
prints 5,14
I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...
Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.
Greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
On Nov 10, 7:29*am, Thomas Mlynarczyk <tho...@mlynarc zyk-webdesign.de>
wrote:
Paul McGuire schrieb:
loc = data.index("lis t")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1
prints 5,14
I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...
Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.
Greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche)
Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.
-- Paul
Some pratt wrote:
BLAST YOUR AD [...]
and curse yours
Paul McGuire schrieb:
Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.
Another thing I had not thought of -- thanks for the hint.
Greetings,
Thomas
--
Ce n'est pas parce qu'ils sont nombreux à avoir tort qu'ils ont raison!
(Coluche) This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: John J. Lee |
last post by:
Are there any parser / lexer generators useable from both CPython and
Java? I don't mind much if the Python-useable output is in Python or
C (as long as the C can be wrapped automatically, of course). If I
could run the same source files through two different tools, that
would be just as good.
I'm aware that I could use ANTLR from Jython or perhaps Python+JPE,
but I was looking for something accessible from CPython without having
Java...
|
by: mike420 |
last post by:
I think everyone who used Python will agree that its syntax is
the best thing going for it. It is very readable and easy
for everyone to learn. But, Python does not a have very good
macro capabilities, unfortunately. I'd like to know if it may
be possible to add a powerful macro system to Python, while
keeping its amazing syntax, and if it could be possible to
add Pythonistic syntax to Lisp or Scheme, while keeping all
of the...
|
by: Simon Foster |
last post by:
Anyone have any experience or pointers to how to go about creating
a parser lexer for assemble in Python. I was thinking of using PLY
but wonder whether it's too heavyweight for what I want. Anyone have
any thoughts?
--
Simon Foster
Somewhere in the West of England
|
by: Viktor Rosenfeld |
last post by:
Hi,
I need to create a parser for a Python project, and I'd like to use process
kinda like lex/yacc. I've looked at various parsing packages online, but
didn't find anything useful for me:
- PyLR seems promising but is for Python 1.5
- Yappy seems promising, but I couldn't get it to work. It doesn't even
compile the main example in it's documentation
- mxTexttools is way complicated. I'd like something that I can give a BNF
|
by: David MacQuigg |
last post by:
Seems like we need a simple way to extend Python syntax that doesn't
break existing syntax or clash with any other syntax in Python, is
easy to type, easy to read, and is clearly distinct from the "base"
syntax. Seems like we could put the @ symbol to good use in these
situations. Examples:
print @(separator = None) x, y, z
@x,y:x*x+y*y -- anonymous function
| |
by: Limin Fu |
last post by:
Hello,
Is there any technical description on internet of how
python is designed? Or can somebody give a short
description about this? I'm just curious.
Thanks in advance,
Limin
|
by: Jerry Sievers |
last post by:
Dear Pythonists;
Curious if there exists in Python package(s) for use as lexer/parser
for implementation of language grammars?
Already using cmd.py from the standard distro for it's basic features
but wishing for much more advanced capability. As such, I refer to
flex/bison because though complex they are general purpose and very
useful.
|
by: vedrandekovic |
last post by:
Hello,
I am trying to make a program for 3D modelling with "programming".And
I want make my own program commands,
for example when user type code in my program:
"<<koristiti>OS"- (THIS IS MY IMAGINARY EXAMPLE OF KEYWORD),
my program must write this code in some user file, but my
|
by: MartinRinehart |
last post by:
I've got a pointer to a position in a line of code that contains
either a digit or a period (decimal point). I've got this comment:
Numbers are one of these:
integers:
digit+
0xhex_digit+
decimals:
digit+.digit*digit+]
.digit+digit+]
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |