By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,034 Members | 2,000 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,034 IT Pros & Developers. It's quick & easy.

string parsing / regexp question

P: n/a
I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.

Any suggestions?

I would prefer a regexp solution, but am open to other approaches.

Thanks,

Ryan
Nov 28 '07 #1
Share this Question
Share on Google+
5 Replies


P: n/a
On Nov 28, 11:32 am, "Ryan Krauss" <ryanli...@gmail.comwrote:
I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.
As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr. Here is
the whole program to parse your example:

from pyparsing import *

data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""

PMATRIX = Literal(r"\pmatrix")
nestedBraces = nestedExpr("{","}")
grammar = "$$" + PMATRIX + nestedBraces + "=" + \
PMATRIX + nestedBraces + \
"$$"
res = grammar.parseString(data)
print res

This prints the following:

['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']

Okay, maybe this looks a bit messy. But believe it or not, the
returned results give you access to each grammar element as:

['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']

Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting. (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)

You could use tuple assignment to get at the individual fields:
dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res

Or you could access the fields in res using list indexing:
lhs_args, rhs_args = res[2],res[5]

But both of these methods will break if you decide to extend the
grammar with additional or optional fields.

A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:

grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":

res = grammar.parseString(data)
print res["lhs_args"]
print res["rhs_args"]
print res.lhs_args
print res.rhs_args

Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.

If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:

nestedBraces.setParseAction(keepOriginalText)
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

res = grammar.parseString(data)
print res
print res.lhs_args
print res.rhs_args

Now this program returns the original text for the nested brace
expressions:

['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']

You can find more info on pyparsing at http://pyparsing.wikispaces.com.

Cheers!
-- Paul
Nov 28 '07 #2

P: n/a
On Nov 28, 1:23 pm, Paul McGuire <pt...@austin.rr.comwrote:
As Tim Grove points out, ...
s/Grove/Chase/

Sorry, Tim!

-- Paul
Nov 28 '07 #3

P: n/a
Paul McGuire wrote:
On Nov 28, 1:23 pm, Paul McGuire <pt...@austin.rr.comwrote:
>As Tim Grove points out, ...

s/Grove/Chase/

Sorry, Tim!
No problem...it's not like there aren't enough Tim's on the list
as it is. :)

-tkc


Nov 28 '07 #4

P: n/a
Interesting. Thanks Paul and Tim. This looks very promising.

Ryan

On Nov 28, 2007 1:23 PM, Paul McGuire <pt***@austin.rr.comwrote:
On Nov 28, 11:32 am, "Ryan Krauss" <ryanli...@gmail.comwrote:
I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.

As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr. Here is
the whole program to parse your example:

from pyparsing import *

data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""

PMATRIX = Literal(r"\pmatrix")
nestedBraces = nestedExpr("{","}")
grammar = "$$" + PMATRIX + nestedBraces + "=" + \
PMATRIX + nestedBraces + \
"$$"
res = grammar.parseString(data)
print res

This prints the following:

['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']

Okay, maybe this looks a bit messy. But believe it or not, the
returned results give you access to each grammar element as:

['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']

Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting. (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)

You could use tuple assignment to get at the individual fields:
dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res

Or you could access the fields in res using list indexing:
lhs_args, rhs_args = res[2],res[5]

But both of these methods will break if you decide to extend the
grammar with additional or optional fields.

A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:

grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":

res = grammar.parseString(data)
print res["lhs_args"]
print res["rhs_args"]
print res.lhs_args
print res.rhs_args

Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.

If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:

nestedBraces.setParseAction(keepOriginalText)
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

res = grammar.parseString(data)
print res
print res.lhs_args
print res.rhs_args

Now this program returns the original text for the nested brace
expressions:

['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']

You can find more info on pyparsing at http://pyparsing.wikispaces.com.

Cheers!
-- Paul
--
http://mail.python.org/mailman/listinfo/python-list
Nov 28 '07 #5

P: n/a
On Nov 28, 2007 1:23 PM, Paul McGuire <pt***@austin.rr.comwrote:
On Nov 28, 11:32 am, "Ryan Krauss" <ryanli...@gmail.comwrote:
I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.

As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr. Here is
the whole program to parse your example:

from pyparsing import *

data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""

PMATRIX = Literal(r"\pmatrix")
nestedBraces = nestedExpr("{","}")
grammar = "$$" + PMATRIX + nestedBraces + "=" + \
PMATRIX + nestedBraces + \
"$$"
res = grammar.parseString(data)
print res

This prints the following:

['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']

Okay, maybe this looks a bit messy. But believe it or not, the
returned results give you access to each grammar element as:

['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']

Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting. (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)

You could use tuple assignment to get at the individual fields:
dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res

Or you could access the fields in res using list indexing:
lhs_args, rhs_args = res[2],res[5]

But both of these methods will break if you decide to extend the
grammar with additional or optional fields.

A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:

grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":

res = grammar.parseString(data)
print res["lhs_args"]
print res["rhs_args"]
print res.lhs_args
print res.rhs_args

Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.

If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:

nestedBraces.setParseAction(keepOriginalText)
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

res = grammar.parseString(data)
print res
print res.lhs_args
print res.rhs_args

Now this program returns the original text for the nested brace
expressions:

['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']

You can find more info on pyparsing at http://pyparsing.wikispaces.com.

Cheers!
-- Paul
--
http://mail.python.org/mailman/listinfo/python-list

I can't seem to access pyparsing on wikispaces. Is there something
wrong with the website right now?
Nov 28 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.