469,285 Members | 2,554 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,285 developers. It's quick & easy.

A simple lexer

I'm a royal n00b to writing translators, but you have to start
someplace.

In my Python project, I've decided that writing the dispatch code
to sit between the Glulx virtual machine and the Glk API will be
best done automatically, using the handy prototypes.

Below is the prototype of the lexer, and I'd like some comments
in case I'm doing something silly already.

My particular concern are:

The loop checks for each possible lexeme one at a time, and has
rather lame error checking.

I made regexes for matching a couple of really trivial cases for
the sake of consistency. In general, is there a better way to use
re module for lexing.

Ultimately, I'm going to need to build up an AST from the lines,
and use that to generate Python code to dispatch Glk functions. I
realize I'm throwing away the id of the lexeme right now;
suggestions on the best way to store that information are
welcome.

I do know of and have experimented with PyParsing, but for now I
want to use the standard modules. After I understand what I'm
doing, I think a PyParsing solution will be easy to write.

import re

def match(regex, proto, ix, lexed_line):
m = regex.match(proto, ix)
if m:
lexed_line.append(m.group())
ix = m.end()
return ix

def parse(proto):
""" Return a lexed version of the prototype string. See the
Glk specification, 0.7.0, section 11.1.4
>>parse('0:')
['0', ':']
>>parse('1:Cu')
['1', ':', 'Cu']
>>parse('2<Qb:Cn')
['2', '<', 'Qb', ':', 'Cn']
>>parse('4Iu&#![2SF]>+Iu:Is')
['4', 'Iu', '&#!', '[', '2', 'S', 'F', ']', '>+', 'Iu', ':', 'Is']
"""
arg_count = re.compile('\d+')
qualifier = re.compile('[&<>][+#!]*')
type_name = re.compile('I[us]|C[nus]|[SUF]|Q[a-z]')
o_bracket = re.compile('\\[')
c_bracket = re.compile('\\]')
colon = re.compile(':')
ix = 0
lexed_line = []
m = lambda regex, ix: match(regex, proto, ix, lexed_line)
while ix < len(proto):
old = ix
ix = m(arg_count, ix)
ix = m(qualifier, ix)
ix = m(type_name, ix)
ix = m(o_bracket, ix)
ix = m(c_bracket, ix)
ix = m(colon, ix)
if ix == old:
print "Parse error at %s of %s" % (proto[ix:], proto)
ix = len(proto)
return lexed_line

if __name__ == "__main__":
import doctest
doctest.testmod()

--
Neil Cerutti
We dispense with accuracy --sign at New York drug store
Jan 9 '07 #1
0 1096

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Simon Foster | last post: by
1 post views Thread by Thorsten Claus | last post: by
6 posts views Thread by Mike C# | last post: by
26 posts views Thread by jacob navia | last post: by
4 posts views Thread by Thomas | last post: by
14 posts views Thread by Thomas Mlynarczyk | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.