471,305 Members | 1,132 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,305 software developers and data experts.

A vote for re scanner

Every couple of months I have a use for the experimental 'scanner'
object in the re module, and when I do, as I did this morning, it's
really handy. So if anyone is counting votes for making it a standard
part of the module, here's my vote:

+1

-- Wade Leftwich
Ithaca, NY
Jul 18 '05 #1
18 2386
wa**@lightlink.com (Wade Leftwich) wrote in message news:<5b**************************@posting.google. com>...
Every couple of months I have a use for the experimental 'scanner'
object in the re module, and when I do, as I did this morning, it's
really handy. So if anyone is counting votes for making it a standard
part of the module, here's my vote:


While I don't think they're still accepting votes :), you've pointed
me to something I didn't know about until now. What kinds of things
have you been using re.Scanner for?

Jeremy
Jul 18 '05 #2
tw*********@hotmail.com (Jeremy Fincher) wrote in message news:<69**************************@posting.google. com>...
wa**@lightlink.com (Wade Leftwich) wrote in message news:<5b**************************@posting.google. com>...
Every couple of months I have a use for the experimental 'scanner'
object in the re module, and when I do, as I did this morning, it's
really handy. So if anyone is counting votes for making it a standard
part of the module, here's my vote:


While I don't think they're still accepting votes :), you've pointed
me to something I didn't know about until now. What kinds of things
have you been using re.Scanner for?

Jeremy


A scanner is constructed from a regex object and a string to be
scanned. Each call to the scanner's search() method returns the next
match object of the regex on the string. So to work on a string that
has multiple matches, it's the bee's roller skates.
Jul 18 '05 #3
On 12 Nov 2003 13:04:36 -0800, wa**@lightlink.com (Wade Leftwich)
wrote:
tw*********@hotmail.com (Jeremy Fincher) wrote in message news:<69**************************@posting.google. com>...
wa**@lightlink.com (Wade Leftwich) wrote in message news:<5b**************************@posting.google. com>...
> Every couple of months I have a use for the experimental 'scanner'
> object in the re module, and when I do, as I did this morning, it's
> really handy. So if anyone is counting votes for making it a standard
> part of the module, here's my vote:


While I don't think they're still accepting votes :), you've pointed
me to something I didn't know about until now. What kinds of things
have you been using re.Scanner for?

Jeremy


A scanner is constructed from a regex object and a string to be
scanned. Each call to the scanner's search() method returns the next
match object of the regex on the string. So to work on a string that
has multiple matches, it's the bee's roller skates.


Or in Eric's case, *the* roller skate.
--dang
Jul 18 '05 #4
Wade Leftwich wrote:
...
A scanner is constructed from a regex object and a string to be
scanned. Each call to the scanner's search() method returns the next
match object of the regex on the string. So to work on a string that
has multiple matches, it's the bee's roller skates.


....if that method's name was 'next' (and an appropriate __iter__
also present) it might be even cooler, though...
Alex

Jul 18 '05 #5
Alex Martelli <al***@aleax.it> wrote:
Wade Leftwich wrote:
...
A scanner is constructed from a regex object and a string to be
scanned. Each call to the scanner's search() method returns the next
match object of the regex on the string. So to work on a string that
has multiple matches, it's the bee's roller skates.


...if that method's name was 'next' (and an appropriate __iter__
also present) it might be even cooler, though...
Alex


Indeed:
class CoolerScanner(object): .... def __init__(self, regex, s):
.... self.scanner = regex.scanner(s)
.... def next(self):
.... m = self.scanner.search()
.... if m:
.... return m
.... else:
.... raise StopIteration
.... def __iter__(self):
.... while 1:
.... yield self.next()
.... regex = re.compile(r'(?P<before>.)a(?P<after>.)')
s = '1ab2ac3ad'
for m in CoolerScanner(regex, s): .... print m.group('before'), m.group('after')
....
1 b
2 c
3 d


-- Wade
Jul 18 '05 #6
Wade Leftwich wrote:
regex = re.compile(r'(?P<before>.)a(?P<after>.)')
s = '1ab2ac3ad'
for m in CoolerScanner(regex, s): ... print m.group('before'), m.group('after')
...
1 b
2 c
3 d

regex = re.compile(r'(?P<before>.)a(?P<after>.)')
s = '1ab2ac3ad'
for m in regex.finditer(s):

.... print m.group('before'), m.group('after')
....
1 b
2 c
3 d

</F>


Jul 18 '05 #7
Alex Martelli wrote:
Wade Leftwich wrote:
...
A scanner is constructed from a regex object and a string to be
scanned. Each call to the scanner's search() method returns the next
match object of the regex on the string. So to work on a string that
has multiple matches, it's the bee's roller skates.


...if that method's name was 'next' (and an appropriate __iter__
also present) it might be even cooler, though...


re.finditer

</F>


Jul 18 '05 #8
Fredrik Lundh wrote:
Alex Martelli wrote:
Wade Leftwich wrote:
...
> A scanner is constructed from a regex object and a string to be
> scanned. Each call to the scanner's search() method returns the next
> match object of the regex on the string. So to work on a string that
> has multiple matches, it's the bee's roller skates.


...if that method's name was 'next' (and an appropriate __iter__
also present) it might be even cooler, though...


re.finditer


Yep. So the scanner isn't warranted any longer, right?
Alex

Jul 18 '05 #9
"Fredrik Lundh" <fr*****@pythonware.com> wrote in message news:<ma************************************@pytho n.org>...
Wade Leftwich wrote:
>> regex = re.compile(r'(?P<before>.)a(?P<after>.)')
>> s = '1ab2ac3ad'
>> for m in CoolerScanner(regex, s):

... print m.group('before'), m.group('after')
...
1 b
2 c
3 d

regex = re.compile(r'(?P<before>.)a(?P<after>.)')
s = '1ab2ac3ad'
for m in regex.finditer(s):

... print m.group('before'), m.group('after')
...
1 b
2 c
3 d

</F>


There I go, reimplementing the wheel again. Guess I didn't pay enough
attention to "What's New In 2.2". Thanks for the pointer. It appears
we don't need that scanner() method after all.

However, from my point of view it was a good exercise, because now I
know how easy it is to make an iterator.

Thanks again

-- Wade
Jul 18 '05 #10
Alex Martelli wrote:
...if that method's name was 'next' (and an appropriate __iter__
also present) it might be even cooler, though...


re.finditer


Yep. So the scanner isn't warranted any longer, right?


if you remove it, you'll break re.Scanner.

</F>


Jul 18 '05 #11
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.
Any ideas would be greatly appreciated.

Allan
Jul 18 '05 #12
On Wed, 04 Feb 2004 19:35:52 GMT, allanc
<ka***********@nospamyahoo.ca> wrote:
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.


Are the key fields in fixed positions? If so, pluck them out and use
them as an index into a dictionary of functions to call. I can't tell
from your example where the keys are, so I'm assuming the first 8 are
simply a line number and the next 4 are the key.

Maybe something along these lines:

def header(x):
print 'header: %s' % x # process header

def testinstruction(x):
print 'test instruction: %s' % x # process test instruction

def lineitem(x):
print 'lineitem: %s' % x # process line item

ptable = {'0190':header, '5549': testinstruction, '2069': lineitem}

for line in file("data.dat"):
ptable[line[8:12]](line)

--dang
Jul 18 '05 #13
allanc wrote:
Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.


I've written many programs to parse data very similar to this,
until I generalized the algorithm (a line-oriented state machine)
into a module. You can find the module (internally documented)
at http://docutils.sf.net/docutils/statemachine.py.

Hope it helps!

--
David Goodger http://python.net/~goodger
For hire: http://python.net/~goodger/cv
Jul 18 '05 #14


allanc wrote:
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.
Any ideas would be greatly appreciated.

Allan

allanc,
-slices as in str[0:5] or str[5:] or str[5:-1] - get pieces of a string
-you'll probably want to strip leading/trailing spaces; see strings doc
-you may need to cast/convert
_int = int("55")
_float = float("4.2")
wes

Jul 18 '05 #15


allanc wrote:
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.
Any ideas would be greatly appreciated.

Allan


Allan,
Maybe this will help more:
line = "015083915549 SHORT ON LAST ORDER 0150839220692"
print line[0:10] 0150839155 print line [:10] 0150839155 print line[5:10] 39155 print line[-10:-1] 083922069 print int(line[-10:-1]) 83922069 print " xyz ".strip()

xyz

wes

Jul 18 '05 #16
"allanc" <ka***********@nospamyahoo.ca> wrote in message
news:Xn******************************@198.161.157. 145...
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.
Any ideas would be greatly appreciated.

Allan

Allan -

Let me put in a plug for pyparsing. I think your problem is tailor-made for
pyparsing's easy-to-use grammar definitions and execution. No special
lexx/yacc-like syntax or RE symbology to master, you assemble your grammar
using simply-named classes (such as Literal, OneOrMore, Word(wordchars),
Optional, etc.) and intuitive operators (+ for sequence, | for greedy
alternation, ^ for longest-match alternation, ~ for, um, Not-tion).

A grammar to parse "Hello, World!" might look like:
helloGrammar = Word(alphas) + "," + Word(alphas) + oneOf(". ! ? !! !!!")
which could then parse any of:
Hello, World!
Hello , World !
Hello,World!
Yo, Adrian!!!
Hey, man.
Whattup, dude?

You can associate field names with specific parse elements, so that the
fields can be extracted from the results such as:
helloGrammar = Word(alphas).setResultsName("greeting") + "," + \
Word(alphas).setResultsName("to") + oneOf(". ! ? !! !!!")
results = helloGrammar.parseString( greetingstring )
print results.greeting
print results.to

You can associate parse actions (a la SAX) to fire when matching parse
elements are matched in the input.

You can find the pyparsing home page at http://pyparsing.sourceforge.net.

-- Paul McGuire
Jul 18 '05 #17
I think one of the easiest ways to do this is to
write a class that knows how to parse each of the
unique lines. As you are reading through the file/table
and encounter a line like the first, create a new
class instance and pass it the line's contents. The
__init__ method of the class can parse the line and
place each of the field values in an attribute of the
class.

Something like (this is pseudocode):

class linetype01:
#
# Define a list that contains information about how to
# parse a single linetype. The info is fieldname,
# beginning column, ending column, fieldlength
#

_parsinginfo=[('recnum',0,8),
('linetype',8,3),
('dataitem2',11,3),
...)
def __init__(self, linetext):
self.linetext=linetext
for fieldname, begincol, fieldlength in _parsinginfo:
self.__dict__[fieldname]=linetext[begincol,
begincol+fieldlength+1]
return

you would define a class like this for each unique linetype

in main program
import sys

#
# Insert code to open file/table here
#
for line in table:
#
# See which linetype it is
#
linetype=line[8:10]
if linetype == "01":
pline=linetype01(line)
#
# Now you can extract the values by accessing attributes of
# the class.
#
recordnum=pline.recnum
tlinetype=pline.linetype
#
# Do something with the values
#
elif linetype == "55":
pline=linetype55(line)

elif linetype == "20":
pline=linetype20(line)
else:
print "ERROR-Illegal linetype encountered")
sys.exit(2)
Just one of many ways to solve this problem.

-Larry
"allanc" <ka***********@nospamyahoo.ca> wrote in message
news:Xn******************************@198.161.157. 145...
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.
Any ideas would be greatly appreciated.

Allan

Jul 18 '05 #18
> 01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400


Is the above the format of all possible lines (aside from empty lines)?

- Josiah
Jul 18 '05 #19

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Otis Green | last post: by
8 posts views Thread by William Drew | last post: by
14 posts views Thread by Otis Green | last post: by
1 post views Thread by Denis Van der Heyden | last post: by
reply views Thread by Daniel Bass | last post: by
9 posts views Thread by Dan =o\) | last post: by
7 posts views Thread by DemonWasp | last post: by
3 posts views Thread by thename1000 | last post: by
6 posts views Thread by rotaryfreak | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.