Regex help needed

rh0dium

Hi all,

I am using python to drive another tool using pexpect. The values
which I get back I would like to automatically put into a list if there
is more than one return value. They provide me a way to see that the
data is in set by parenthesising it.

This is all generated as I said using pexpect - Here is how I use it..
child = pexpect.spawn( _buildCadenceExe(), timeout=timeout)
child.sendline("somefunction()")
child.expect("> ")
data=child.before

Given this data can take on several shapes:

Single return value -- THIS IS THE ONE I CAN'T GET TO WORK..
data = 'somefunction()\r\n"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005
23:36 (cicln01) $"\r\n'

Multiple return value
data = 'somefunction()\r\n("." "~"
"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile")\r\n'

It may take up several lines...
data = 'somefunction()\r\n("." "~"
\r\n"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"\r\n"foo")\r\n'

So if you're still reading this I want to parse out data. Here are the
rules...
- Line 1 ALWAYS is the calling function whatever is there (except
"\r\n") should be kept as "original"
- Anything may occur inside the quotations - I don't care what's in
there per se but it must be maintained.
- Parenthesed items I want to be pushed into a list. I haven't run
into a case where you have nested paren's but that not to say it won't
happen...

So here is my code.. Pardon my hack job..

import os,re

def main(data=None):

# Get rid of the annoying \r's
dat=data.split("\r")
data="".join(dat)

# Remove the first line - that is the original call
dat = data.split("\n")
original=dat[0]
del dat[0]

print "Original", original
# Now join all of the remaining lines
retl="".join(dat)

# self.logger.debug("Original = \'%s\'" % original)

try:
# Get rid of the parenthesis
parmatcher = re.compile( r'$([^()]*)$' )
parmatch = parmatcher.search(retl)

# Get rid of the first and last quotes
qrmatcher = re.compile( r'\"([^()]*)\"' )
qrmatch = qrmatcher.search(parmatch.group(1))

# Split the items
qmatch=re.compile(r'\"\s+\"')
results = qmatch.split(qrmatch.group(1))
except:
qrmatcher = re.compile( r'\"([^()]*)\"' )
qrmatch = qrmatcher.search(retl)

# Split the items
qmatch=re.compile(r'\"\s+\"')
results = qmatch.split(qrmatch.group(1))

print "Orig", original, "Results", results
return original,results
# General run..
if __name__ == '__main__':
# data = 'someFunction\r\n "test" "foo"\r\n'
# data = 'someFunction\r\n "test foo"\r\n'
data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0
05/22/2005 23:36 (cicln01) $"\r\n'
# data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n
"newline" "test2")\r\n'

main(data)

CAN SOMEONE PLEASE CLEAN THIS UP?

Jan 10 '06 #1

Subscribe Reply

1668

Paul McGuire

"rh0dium" <sk****@pointcircle.com> wrote in message
news:11**********************@g49g2000cwa.googlegr oups.com...

Hi all,

I am using python to drive another tool using pexpect. The values
which I get back I would like to automatically put into a list if there
is more than one return value. They provide me a way to see that the
data is in set by parenthesising it.

<snip>

Well, you asked for regex help, but a pyparsing rendition may be easier to
read and maintain.

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)
# test data strings
test1 = """somefunction()
"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $"
"""

test2 = """somefunction()
("." "~"
"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"
"foo")
"""

test3 = """somefunctionWithNestedlist()
("." "~"
"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"
("Hey!"
"this is a nested"
"list")
"foo")
"""

"""
So if you're still reading this I want to parse out data. Here are the
rules...
- Line 1 ALWAYS is the calling function whatever is there (except
"\r\n") should be kept as "original"
- Anything may occur inside the quotations - I don't care what's in
there per se but it must be maintained.
- Parenthesed items I want to be pushed into a list. I haven't run
into a case where you have nested paren's but that not to say it won't
happen...
"""

from pyparsing import Literal, Word, alphas, alphanums, \
dblQuotedString, OneOrMore, Group, Forward

LPAR = Literal("(")
RPAR = Literal(")")

# assume function identifiers must start with alphas, followed by zero or
more
# alphas, numbers, or '_' - expand this defn as needed
ident = Word(alphas,alphanums+"_")

# define a list as one or more quoted strings, inside ()'s - we'll tackle
nesting
# in a minute
quoteList = Group( LPAR.suppress() +
OneOrMore(dblQuotedString) +
RPAR.suppress() )

# define format of a line of data - don't bother with \n's or \r's,
# pyparsing just skips 'em
dataFormat = ident + LPAR + RPAR + ( dblQuotedString | quoteList )

def test(t):
print dataFormat.parseString(t)

print "Parse flat lists"
test(test1)
test(test2)

# modifications for nested lists
quoteList = Forward()
quoteList << Group( LPAR.suppress() +
OneOrMore(dblQuotedString | quoteList) +
RPAR.suppress() )
dataFormat = ident + LPAR + RPAR + ( dblQuotedString | quoteList )

print
print "Parse using nested lists"
test(test1)
test(test2)
test(test3)

Parsing results:
Parse flat lists
['somefunction', '(', ')', '"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005
23:36 (cicln01) $"']
['somefunction', '(', ')', ['"."', '"~"',
'"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', '"foo"']]

Parse using nested lists
['somefunction', '(', ')', '"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005
23:36 (cicln01) $"']
['somefunction', '(', ')', ['"."', '"~"',
'"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', '"foo"']]
['somefunctionWithNestedlist', '(', ')', ['"."', '"~"',
'"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', ['"Hey!"',
'"this is a nested"', '"list"'], '"foo"']]

Jan 10 '06 #2

rh0dium

Paul McGuire wrote:

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)

Done.
Hey this is pretty cool! I have one small problem that I don't know
how to resolve. I want the entire contents (whatever it is) of line 1
to be the ident. Now digging into the code showed a method line,
lineno and LineStart LineEnd. I tried to use all three but it didn't
work for a few reasons ( line = type issues, lineno - I needed the data
and could't get it to work, LineStart/End - I think it matches every
line and I need the scope to line 1 )

So here is my rendition of the code - But this is REALLY slick..

I think the problem is the parens on line one....

def main(data=None):

LPAR = Literal("(")
RPAR = Literal(")")

# assume function identifiers must start with alphas, followed by
zero or more
# alphas, numbers, or '_' - expand this defn as needed
ident = LineStart + LineEnd

# define a list as one or more quoted strings, inside ()'s - we'll
tackle nesting
# in a minute
quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) +
RPAR.suppress())

# define format of a line of data - don't bother with \n's or \r's,

# pyparsing just skips 'em
dataFormat = ident + ( dblQuotedString | quoteList )

return dataFormat.parseString(data)
# General run..
if __name__ == '__main__':
# data = 'someFunction\r\n "test" "foo"\r\n'
# data = 'someFunction\r\n "test foo"\r\n'
data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0
05/22/2005 23:36 (cicln01) $"\r\n'
# data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n
"newline" "test2")\r\n'

foo = main(data)

print foo

Jan 10 '06 #3

Paul McGuire

"rh0dium" <sk****@pointcircle.com> wrote in message
news:11**********************@z14g2000cwz.googlegr oups.com...

Paul McGuire wrote:
-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)

Done.
Hey this is pretty cool! I have one small problem that I don't know
how to resolve. I want the entire contents (whatever it is) of line 1
to be the ident. Now digging into the code showed a method line,
lineno and LineStart LineEnd. I tried to use all three but it didn't
work for a few reasons ( line = type issues, lineno - I needed the data
and could't get it to work, LineStart/End - I think it matches every
line and I need the scope to line 1 )

So here is my rendition of the code - But this is REALLY slick..

I think the problem is the parens on line one....

def main(data=None):

LPAR = Literal("(")
RPAR = Literal(")")

# assume function identifiers must start with alphas, followed by
zero or more
# alphas, numbers, or '_' - expand this defn as needed
ident = LineStart + LineEnd

# define a list as one or more quoted strings, inside ()'s - we'll
tackle nesting
# in a minute
quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) +
RPAR.suppress())

# define format of a line of data - don't bother with \n's or \r's,

# pyparsing just skips 'em
dataFormat = ident + ( dblQuotedString | quoteList )

return dataFormat.parseString(data)
# General run..
if __name__ == '__main__':
# data = 'someFunction\r\n "test" "foo"\r\n'
# data = 'someFunction\r\n "test foo"\r\n'
data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0
05/22/2005 23:36 (cicln01) $"\r\n'
# data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n
"newline" "test2")\r\n'

foo = main(data)

print foo

LineStart() + LineEnd() will only match an empty line.
If you describe in words what you want ident to be, it may be more natural
to translate to pyparsing.

"A word starting with an alpha, followed by zero or more alphas, numbers, or
'_'s, with a trailing pair of parens"

ident = Word(alpha,alphanums+"_") + LPAR + RPAR
If you want the ident all combined into a single token, use:

ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR )
LineStart and LineEnd are geared more for line-oriented or
whitespace-sensitive grammars. Your example doesn't really need them, I
don't think.

If you *really* want everything on the first line to be the ident, try this:

ident = Word(alpha,alphanums+"_") + restOfLine
or
ident = Combine( Word(alpha,alphanums+"_") + restOfLine )
Now the next step is to assign field names to the results:

dataFormat = ident.setResultsName("ident") + ( dblQuotedString |
quoteList ).setResultsName("contents")

test = "blah blah test string"

results = dataFormat.parseString(test)
print results.ident, results.contents

I'm glad pyparsing is working out for you! There should be a number of
examples that ship with pyparsing that may give you some more ideas on how
to proceed from here.

-- Paul

Jan 10 '06 #4

Michael Spencer

rh0dium wrote:

Hi all,

I am using python to drive another tool using pexpect. The values
which I get back I would like to automatically put into a list if there
is more than one return value. They provide me a way to see that the
data is in set by parenthesising it.
....

CAN SOMEONE PLEASE CLEAN THIS UP?

How about using the Python tokenizer rather than re:

import cStringIO, tokenize ... def get_tokens(source): ... allowed_tokens = (tokenize.STRING, tokenize.OP)
... src = cStringIO.StringIO(source).readline
... src = tokenize.generate_tokens(src)
... return (token[1] for token in src if token[0] in allowed_tokens)
... def rest_eval(tokens): ... output = []
... for token in tokens:
... if token == "(":
... output.append(rest_eval(tokens))
... elif token == ")":
... return output
... else:
... output.append(token[1:-1])
... return output
... def parse(source): ... source = source.splitlines()
... original, rest = source[0], "\n".join(source[1:])
... return original, rest_eval(get_tokens(rest))
... sources = [ ... 'someFunction\r\n "test" "foo"\r\n',
... 'someFunction\r\n "test foo"\r\n',
... 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36
(cicln01) $"\r\n',
... 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n "newline"
"test2")\r\n']
for data in sources: parse(data) ...
('someFunction', ['test', 'foo'])
('someFunction', ['test foo'])
('getVersion()', ['@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01)
$'])
('someFunction', [['test', 'test1', 'foo aasdfasdf', 'newline', 'test2']])

Cheers

Michael

Jan 10 '06 #5

rh0dium

Paul McGuire wrote:

ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR )
This will only work for a word with a parentheses ( ie. somefunction()
)
If you *really* want everything on the first line to be the ident, try this:

ident = Word(alpha,alphanums+"_") + restOfLine
or
ident = Combine( Word(alpha,alphanums+"_") + restOfLine )
This nicely grabs the "\r".. How can I get around it?
Now the next step is to assign field names to the results:

dataFormat = ident.setResultsName("ident") + ( dblQuotedString |
quoteList ).setResultsName("contents")

This is super cool!!

So let's take this for example

test= 'fprintf( outFile "leSetInstSelectable( t )\n" )\r\n ("test"
"test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n'

Now I want the ident to pull out 'fprintf( outFile
"leSetInstSelectable( t )\n" )' so I tried to do this?

ident = Forward()
ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore(
dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR)

Borrowing from the example listed previously. But it bombs out cause
it wants a ")" but it has one.. Forward() ROCKS!!

Also how does it know to do this for just the first line? It would
seem that this will work for every line - No?

Jan 10 '06 #6

rh0dium

Michael Spencer wrote:

>>> def parse(source):
... source = source.splitlines()
... original, rest = source[0], "\n".join(source[1:])
... return original, rest_eval(get_tokens(rest))

This is a very clean and elegant way to separate them - Very nice!! I
like this alot - I will definately use this in the future!!

Cheers

Michael

Jan 10 '06 #7

Paul McGuire

"rh0dium" <sk****@pointcircle.com> wrote in message
news:11*********************@g44g2000cwa.googlegro ups.com...

Paul McGuire wrote:
ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR )

This will only work for a word with a parentheses ( ie. somefunction()
)
If you *really* want everything on the first line to be the ident, try this:
ident = Word(alpha,alphanums+"_") + restOfLine
or
ident = Combine( Word(alpha,alphanums+"_") + restOfLine )

This nicely grabs the "\r".. How can I get around it?
Now the next step is to assign field names to the results:

dataFormat = ident.setResultsName("ident") + ( dblQuotedString |
quoteList ).setResultsName("contents")

This is super cool!!

So let's take this for example

test= 'fprintf( outFile "leSetInstSelectable( t )\n" )\r\n ("test"
"test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n'

Now I want the ident to pull out 'fprintf( outFile
"leSetInstSelectable( t )\n" )' so I tried to do this?

ident = Forward()
ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore(
dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR)

Borrowing from the example listed previously. But it bombs out cause
it wants a ")" but it has one.. Forward() ROCKS!!

Also how does it know to do this for just the first line? It would
seem that this will work for every line - No?

This works for me:

test4 = r"""fprintf( outFile "leSetInstSelectable( t )\n" )
("test"
"test1" "foo aasdfasdf"
"newline" "test2")
"""

ident = Forward()
ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore(
dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR)
dataFormat = ident + ( dblQuotedString | quoteList )

print dataFormat.parseString(test4)

Prints:
[['fprintf', '(', 'outFile', '"leSetInstSelectable( t )\\n"', ')'],
['"test"', '"test1"', '"foo aasdfasdf"', '"newline"', '"test2"']]
1. Is there supposed to be a real line break in the string
"leSetInstSelectable( t )\n", or just a slash-n at the end? pyparsing
quoted strings do not accept multiline quotes, but they do accept escaped
characters such as "\t" "\n", etc. That is, to pyparsing:

"\n this is a valid \t \n string"

"this is not
a valid string"

Part of the confusion is that your examples include explicit \r\n
characters. I'm assuming this is to reflect what you see when listing out
the Python variable containing the string. (Are you opening a text file
with "rb" to read in binary? Try opening with just "r", and this may
resolve your \r\n problems.)

2. If restOfLine is still giving you \r's at the end, you can redefine
restOfLine to not include them, or to include and suppress them. Or (this
is easier) define a parse action for restOfLine that strips trailing \r's:

def stripTrailingCRs(st,loc,toks):
try:
if toks[0][-1] == '\r':
return toks[0][:-1]
except:
pass

restOfLine.setParseAction( stripTrailingCRs )
3. How does it know to only do it for the first line? Presumably you told
it to do so. pyparsing's parseString method starts at the beginning of the
input string, and matches expressions until it finds a mismatch, or runs out
of expressions to match - even if there is more input string to process,
pyparsing does not continue. To search through the whole file looking for
idents, try using scanString which returns a generator; for each match, the
generator gives a tuple containing:
- tokens - the matched tokens
- start - the start location of the match
- end - the end location of the match

If your input file consists *only* of these constructs, you can also just
expand dataFormat.parseString to OneOrMore(dataFormat).parseString.
-- Paul

Jan 11 '06 #8

Michael Spencer

rh0dium wrote:

Michael Spencer wrote:
>>> def parse(source):

... source = source.splitlines()
... original, rest = source[0], "\n".join(source[1:])
... return original, rest_eval(get_tokens(rest))

This is a very clean and elegant way to separate them - Very nice!! I
like this alot - I will definately use this in the future!!
Cheers

Michael

On reflection, this simplifies further (to 9 lines), at least for the test cases
your provide, which don't involve any nested parens:

import cStringIO, tokenize ... def get_tokens2(source): ... src = cStringIO.StringIO(source).readline
... src = tokenize.generate_tokens(src)
... return [token[1][1:-1] for token in src if token[0] == tokenize.STRING]
... def parse2(source): ... source = source.splitlines()
... original, rest = source[0], "\n".join(source[1:])
... return original, get_tokens2(rest)
...
This matches your main function for the three tests where main works...
for source in sources[:3]: #matches your main function where it works ... assert parse2(source) == main(source)
...
Original someFunction
Orig someFunction Results ['test', 'foo']
Original someFunction
Orig someFunction Results ['test foo']
Original someFunction
Orig someFunction Results ['test', 'test1', 'foo aasdfasdf', 'newline', 'test2']

....and handles the case where main fails (I think correctly, although I'm not
entirely sure what your desired output is in this case: parse2(sources[3]) ('getVersion()', ['@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01)
$'])

If you really do need nested parens, then you'd need the slightly longer version
I posted earlier

Cheers

Michael

Jan 11 '06 #9

Similar topics

5712

How can I embed the *regex* engine into C program?

by: alphatan | last post by:

Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning...

C / C++

1501

Can't get RegEx to work, pls help

by: H | last post by:

This is kind of an followup on oneof my previous questions, and it has with RegEx to do. I have a string containing of several words. What would a good regex expression looklike to get one match...

C# / C Sharp

6833

RegEx how do I do unique?

by: D | last post by:

My first attempt at this and I'm searching formulas like so RIGHT(TEXT(A15,'yy'),1)*1000+A15-CONCATENATE(1,'-','jan','-',TEXT(A15,'yy'))+1 I want to extract the row / col coordinates (A15 in...

C# / C Sharp

1633

regex pro

by: steve | last post by:

here's the deal...cvs, tick encapsulted data. trying to use regex's to validate records. here's an example row: 'AD,'BF','132465','06/09/2004','','BNSF','A','TYPE','1278','','BR','2999',''...

Visual Basic .NET

3062

Getting variable in a pattern using Regex

by: Ya Ya | last post by:

Hi, I have a string with some fixed text and variable text. For example: "this is a fixed text THE NEEDED INFO more more fixed text". How do I get the the variable text (THE NEEDED INFO) from this...

Visual Basic .NET

1760

Regex Expression to split on a quote

by: lgbjr | last post by:

Hello All, I have the following type of string: "X:Y\Z.exe" "123" What I need is an array of strings with the information from within each set of quotes. I was trying to use a Regex.Split, but...

Visual Basic .NET

2341

Making a smart regex

by: Chris Lieb | last post by:

I am trying to write a regex that will parse BBcode into HTML using JavaScript. Everything was going smoothly using the string class replace() operator with regex's until I got to the list tag....

Javascript

2785

The Regex problem

by: ad | last post by:

I am useing VS2005 to develop wep application. I use a RegularExpress both in RegularExpressionValidator and Regex class to validate a value. The RegularExpress is 20|\-9|\-1|?\d{1} When I...

C# / C Sharp

1906

string/regex: extracting the context of a string match around the found search term?

by: | last post by:

I'm analyzing large strings and finding matches using the Regex class. I want to find the context those matches are found in and to display excerpts of that context, just as a search engine might....

C# / C Sharp

1934

Regex with quotes

by: Flomo Togba Kwele | last post by:

I am having difficulty writing a Regex constructor. A line has a quote(") at its beginning and its end. I need to strip both characters off. If the line looks like "1", I need the result to be 1....

.NET Framework

7115

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7321

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7377

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7036

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

5047

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4705

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3191

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3179

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

414

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General