pyparsing: match empty line

Marek Kubica

Hi,

I am trying to get this stuff working, but I still fail.

I have a format which consists of three elements:
\d{4}M?-\d (4 numbers, optional M, dash, another number)
EMPTY (the <EMPTYtoken)
[Empty line] (the <PAGEBREAKtoken. The line may contain whitespaces,
but nothing else)

While the ``watchname`` and ``leaveempty`` were trivial, I cannot get
``pagebreak`` to work properly.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from pyparsing import (Word, Literal, Optional, Group, OneOrMore, Regex,
Combine, ParserElement, nums, LineStart, LineEnd, White,
replaceWith)

ParserElement.setDefaultWhitespaceChars(' \t\r')

watchseries = Word(nums, exact=4)
watchrev = Word(nums, exact=1)

watchname = Combine(watchseries + Optional('M') + '-' + watchrev)

leaveempty = Literal('EMPTY')

def breaks(s, loc, tokens):
print repr(tokens[0])
#return ['<PAGEBREAK>' for token in tokens[0]]
return ['<PAGEBREAK>']

#pagebreak = Regex('^\s*$').setParseAction(breaks)
pagebreak = LineStart() + LineEnd().setParseAction(replaceWith
('<PAGEBREAK>'))

parser = OneOrMore(watchname ^ pagebreak ^ leaveempty)

tests = [
"2134M-2",
"""3245-3
3456M-5""",
"""3256-4

4563-4""",
"""4562M-6
EMPTY
3246-5"""
]

for test in tests:
print parser.parseString(test)

The output should be:
['2134M-2']
['3245-3', '3456M-5']
['3256-4', '<PAGEBREAK>' '4563-4']
['4562M-6', '<EMPTY>', '3246-5']

Thanks in advance!
regards,
Marek

Sep 2 '08 #1

Subscribe Post Reply

5238

Paul McGuire

On Sep 2, 11:38*am, Marek Kubica <ma...@xivilization.netwrote:

Hi,

I am trying to get this stuff working, but I still fail.

I have a format which consists of three elements:
\d{4}M?-\d (4 numbers, optional M, dash, another number)
EMPTY (the <EMPTYtoken)
[Empty line] (the <PAGEBREAKtoken. The line may contain whitespaces,
but nothing else)

<snip>

Marek -

Here are some refinements to your program that will get you closer to
your posted results.

1) Well done in resetting the default whitespace characters, since you
are doing some parsing that is dependent on the presence of line
ends. When you do this, it is useful to define an expression for end
of line so that you can reference it where you explicitly expect to
find line ends:

EOL = LineEnd().suppress()
2) Your second test fails because there is an EOL between the two
watchnames. Since you have removed EOL from the set of default
whitespace characters (that is, whitespace that pyparsing will
automatically skip over), then pyparsing will stop after reading the
first watchname. I think that you want EOLs to get parsed if nothing
else matches, so you can add it to the end of your grammar definition:

parser = OneOrMore(watchname ^ pagebreak ^ leaveempty ^ EOL)

This will now permit the second test to pass.
3) Your definition of pagebreak looks okay now, but I don't understand
why your test containing 2 blank lines is only supposed to generate a
single <PAGEBREAK>.

pagebreak = LineStart() +
LineEnd().setParseAction(replaceWith('<PAGEBREAK>' ))

If you really want to only get a single <PAGEBREAKfrom your test
case, than change pagebreak to:

pagebreak = OneOrMore(LineStart() +
LineEnd()).setParseAction(replaceWith('<PAGEBREAK> '))
4) leaveempty probably needs this parse action to be attached to it:

leaveempty =
Literal('EMPTY').setParseAction(replaceWith('<EMPT Y>'))
5) (optional) Your definition of parser uses '^' operators, which
translate into Or expressions. Or expressions evaluate all the
alternatives, and then choose the longest match. The expressions you
have don't really have any ambiguity to them, and could be evaluated
using:

parser = OneOrMore(watchname | pagebreak | leaveempty | EOL)

'|' operators generate MatchFirst expressions. MatchFirst will do
short-circuit evaluation - the first expression that matches will be
the one chosen as the matching alternative.
If you have more pyparsing questions, you can also post them on the
pyparsing wiki - the Discussion tab on the wiki Home page has become a
running support forum - and there is also a Help/Discussion mailing
list.

Cheers,
-- Paul

Sep 3 '08 #2

Marek Kubica

Hi,

First of all a big thank you for your excellent library and of course
also for your extensive and enlightening answer!

1) Well done in resetting the default whitespace characters, since you
are doing some parsing that is dependent on the presence of line ends.
When you do this, it is useful to define an expression for end of line
so that you can reference it where you explicitly expect to find line
ends:

EOL = LineEnd().suppress()

Ok, I didn't think about this. But as my program is not only a parser but
a long-running process and setDefaultWhitespace modifies a global
variable I don't feel too comfortable with it. I could set the whitespace
on every element, but that is as you surely agree quite ugly. Do you
accept patches? I'm thinking about some kind of factory-class which would
automatically set the whitespaces:

>>factory = TokenFactory(' \t\r')
word = Factory.Word(alphas)

That way, one wouldn't need to set a grobal value which might interfere
with other pyparsers running in the same process.

parser = OneOrMore(watchname ^ pagebreak ^ leaveempty ^ EOL)

This will now permit the second test to pass.

Right. Seems that working with whitespace requires a bit better
understanding than I had.

3) Your definition of pagebreak looks okay now, but I don't understand
why your test containing 2 blank lines is only supposed to generate a
single <PAGEBREAK>.

No, it should be one <PAGEBREAKper blank line, now it works as expected.

4) leaveempty probably needs this parse action to be attached to it:

leaveempty =
Literal('EMPTY').setParseAction(replaceWith('<EMPT Y>'))

I added this in the meantime. replaceWith is really a handy helper.

parser = OneOrMore(watchname | pagebreak | leaveempty | EOL)

'|' operators generate MatchFirst expressions. MatchFirst will do
short-circuit evaluation - the first expression that matches will be the
one chosen as the matching alternative.

Okay, adjusted it.

If you have more pyparsing questions, you can also post them on the
pyparsing wiki - the Discussion tab on the wiki Home page has become a
running support forum - and there is also a Help/Discussion mailing
list.

Which of these two would you prefer?

Thanks again, it works now just as I imagined!

regards,
Marek

Sep 3 '08 #3

Paul McGuire

On Sep 3, 4:26 am, Marek Kubica <ma...@xivilization.netwrote:

Hi,

First of all a big thank you for your excellent library and of course
also for your extensive and enlightening answer!

I'm glad pyparsing has been of help to you. Pyparsing is building its
own momentum these days. I have a new release in SVN that I'll put
out in the next week or so.

Ok, I didn't think about this. But as my program is not only a parser but
a long-running process and setDefaultWhitespace modifies a global
variable I don't feel too comfortable with it.

Pyparsing isn't really all that thread-friendly. You definitely
should not have multiple threads using the same grammar. The
approaches I've seen people use in multithread applications are: 1)
synchronize access to a single parser across multiple threads, and 2)
create a parser per-thread, or use a pool of parsers. Pyparsing
parsers can be pickled, so a quick way to reconstitute a parser is to
create the parser at startup time and pickle it to a string, then
unpickle a new parser as needed.

I could set the whitespace
on every element, but that is as you surely agree quite ugly. Do you
accept patches? I'm thinking about some kind of factory-class which would
automatically set the whitespaces:

>factory = TokenFactory(' \t\r')
word = Factory.Word(alphas)

That way, one wouldn't need to set a grobal value which might interfere
with other pyparsers running in the same process.

I tried to prototype up your TokenFactory class, but once I got as far
as implementing __getattribute__ to return the corresponding pyparsing
class, I couldn't see how to grab the object generated for that class,
and modify its whitespace values. I did cook up this, though:

class SetWhitespace(object):
def __init__(self, whitespacechars):
self.whitespacechars = whitespacechars

def __call__(self,pyparsing_expr):
pyparsing_expr.setWhitespace(self.whitespacechars)
return pyparsing_expr

noNLskipping = SetWhitespace(' \t\r')
word = noNLskipping(Word(alphas))

I'll post this on the wiki and see what kind of comments we get.

By the way, setDefaultWhitespace only updates global variables that
are used at parser definition time, *not* at parser parse time. So,
again, you can manage this class attribute at the initialization of
your program, before any incoming requests need to make use of one
parser or another.

4) leaveempty probably needs this parse action to be attached to it:

leaveempty =
Literal('EMPTY').setParseAction(replaceWith('<EMPT Y>'))

I added this in the meantime. replaceWith is really a handy helper.

After I released replaceWith, I received a parser from someone who
hadn't read down to the 'R's yet in the documentation, and he
implemented the same thing with this simple format:

leaveempty = Literal('EMPTY').setParseAction(lambda : '<EMPTY>')

These are pretty much equivalent, I was just struck at how easy Python
makes things for us, too!

If you have more pyparsing questions, you can also post them on the
pyparsing wiki - the Discussion tab on the wiki Home page has become a
running support forum - and there is also a Help/Discussion mailing
list.

Which of these two would you prefer?

They are equivalent, I monitor them both, and you can browse through
previous discussions using the Discussion tab online threads, or the
mailing list archive on SF. Use whichever is easier for you to work
with.

Cheers, and Welcome to Pyparsing!
-- Paul

Sep 3 '08 #4

Marek Kubica

On Wed, 03 Sep 2008 06:12:47 -0700, Paul McGuire wrote:

On Sep 3, 4:26 am, Marek Kubica <ma...@xivilization.netwrote:

>I could set the whitespace
on every element, but that is as you surely agree quite ugly. Do you
accept patches? I'm thinking about some kind of factory-class which
would automatically set the whitespaces:

>>factory = TokenFactory(' \t\r')
word = Factory.Word(alphas)

That way, one wouldn't need to set a grobal value which might interfere
with other pyparsers running in the same process.

I tried to prototype up your TokenFactory class, but once I got as far
as implementing __getattribute__ to return the corresponding pyparsing
class, I couldn't see how to grab the object generated for that class,
and modify its whitespace values.

I have had the same problem, until I remembered that I can fake __init__
using a function closure.

I have imported pyparsing.py into a hg repository with a patchstack, here
is my first patch:

diff -r 12e2bbff259e pyparsing.py
--- a/pyparsing.py Wed Sep 03 09:40:09 2008 +0000
+++ b/pyparsing.py Wed Sep 03 14:08:15 2008 +0000
@@ -1400,9 +1400,38 @@
def __req__(self,other):
return self == other

+class TokenFinder(type):
+ """Collects all classes that are derived from Token"""
+ token_classes = dict()
+ def __init__(cls, name, bases, dict):
+ # save the class
+ TokenFinder.token_classes[cls.__name__] = cls
+
+class WhitespaceTokenFactory(object):
+ def __init__(self, whitespace):
+ self._whitespace = whitespace
+
+ def __getattr__(self, name):
+ """Get an attribute of this class"""
+ # check whether there is such a Token
+ if name in TokenFinder.token_classes:
+ token = TokenFinder.token_classes[name]
+ # construct a closure which fakes the constructor
+ def _callable(*args, **kwargs):
+ obj = token(*args, **kwargs)
+ # set the whitespace on the token
+ obj.setWhitespaceChars(self._whitespace)
+ return obj
+ # return the function which returns an instance of the Token
+ return _callable
+ else:
+ raise AttributeError("'%s' object has no attribute '%s'" % (
+ WhitespaceTokenFactory.__name__, name))

class Token(ParserElement):
"""Abstract ParserElement subclass, for defining atomic matching
patterns."""
+ __metaclass__ = TokenFinder
+
def __init__( self ):

I used metaclasses for getting all Token-subclasses so new classes that
are created are automatically accessible via the factory, without any
additional registration.

Oh and yes, more patches will follow. I'm currently editing the second
patch, but I better mail it directly to you as it is not really
interesting for this list.

regards,
Marek

Sep 3 '08 #5

by: James Dyer | last post by:

I'm having problems getting a regex to work. Basically, given two search parameters ($search1 and $search2), it should allow me to filter a log file such that lines with the $search1 string in are...

Perl

no empty line after <UL>

by: Timo Nentwig | last post by:

Hi! Is it possible that there's no empty line inserted after an <UL>? Timo

HTML / CSS

empty line with H tag, how to get rid of it with CSS

by: autogoor | last post by:

The empty line that is inserted by most browsers between H1/H2/H3 heading tags and a following P block is very annoying and I would really like to get rid of it (without having to give up on using...

HTML / CSS

How Do We Avoid the Extra Empty Line at the End of the Output File?

by: mary | last post by:

When we use string line; while (getline(in,line)) { out.write(line.c_str(),line.size()); out.put('\n'); } in.close();

C / C++

ignore empty line is xml

by: juli jul | last post by:

Hello, How can I read xml file but ignore the empty lines in it : is there some kind of function in C# that can read everything except for the empty lines? Thanks a lot! *** Sent via...

.NET Framework

regex how to match a line that doesn't have a certain char string

by: larry | last post by:

I guess I'm missing some obvious stuff. But I'm having trouble doing the following: Delete all lines of text that does not contain "myString". Any help would be appreciated. Thanks

C# / C Sharp

Split string on empty line

by: Sen Haerens | last post by:

I'm using string.split(/^$/m, 2) on a curl output to separate header and body. Thereâ€™s an empty line between them. ^$ doesnâ€™t seem to work... Example curl output: HTTP/1.1 404 Not Found...

Javascript

Remove empty line from string

by: LEM | last post by:

Hi, I'm trying to remove any empty lines from a string, and I am doing the following: String pp; pp = "\r\n\r\n1\r\n23\r\n\r\n4"; pp = pp.Replace("\r\n\r\n", "\r\n");

C# / C Sharp

How do I tell the difference between the end of a text file, and an empty line in a text file?

by: walterbyrd | last post by:

Python's lack of an EOF character is giving me a hard time. I've tried: ----- s = f.readline() while s: .. .. s = f.readline()

Python

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

pyparsing: match empty line

Similar topics