469,924 Members | 1,343 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,924 developers. It's quick & easy.

regexp for sequence of quoted strings

gry
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"

but how do I represent a *sequence* of these separated
by commas? I guess I can artificially tack a comma on the
end of the input string and do:

r"\{('([^']|\\')*',)\}"

but that seems like an ugly hack...

I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.

-- George

Jul 19 '05 #1
8 1418
gr*@ll.mit.edu wrote:
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}
[snip]
I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.

py> s = "{'the','dog\'s','bite'}"
py> s
"{'the','dog's','bite'}"
py> s[1:-1]
"'the','dog's','bite'"
py> s[1:-1].split(',')
["'the'", "'dog's'", "'bite'"]
py> [item[1:-1] for item in s[1:-1].split(',')]
['the', "dog's", 'bite']

py> s = "{'the'}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['the']

py> s = "{}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['']

Not sure what you want in the last case, but if you want an empty list,
you can probably add a simple if-statement to check if s[1:-1] is non-empty.

HTH,

STeVe
Jul 19 '05 #2
gr*@ll.mit.edu writes:
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"


what about {'dog \\', ...} ?

If you don't need to validate anything you can just forget about the commas
etc and extract all the 'strings' with findall,

The regexp below is a bit too complicated (adapted from something else) but I
think will work:

In [90]:rex = re.compile(r"'(?:[^\n]|(?<!\\)(?:\\)(?:\\\\)*\n)*?(?<!\\)(?:\\\\)*?'")

In [91]:rex.findall(r"{'the','dog\'s','bite'}")
Out[91]:["'the'", "'dog\\'s'", "'bite'"]

Otherwise just add something like ",|}$" to deal with the final } instead of a
comma.

Alternatively, you could also write a regexp to split on the "','" bit and trim
the first and the last split.

'as


Jul 19 '05 #3
Pyparsing includes some built-in quoted string support that might
simplify this problem. Of course, if you prefer regexp's, I'm by no
means offended!

Check out my Python console session below. (You may need to expand the
unquote method to do more handling of backslash escapes.)

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)
from pyparsing import delimitedList, sglQuotedString
text = r"'the','dog\'s','bite'"
def unquote(s,l,t): .... t2 = t[0][1:-1]
.... return t2.replace("\\'","'")
.... sglQuotedString.setParseAction(unquote)
g = delimitedList( sglQuotedString )
g.parseString(text).asList()

['the', "dog's", 'bite']

Jul 19 '05 #4
Paul McGuire wrote:
text = r"'the','dog\'s','bite'"
def unquote(s,l,t):


... t2 = t[0][1:-1]
... return t2.replace("\\'","'")
...


Note also, that the codec 'string-escape' can be used to do what's done
with str.replace in this example:

py> s
"'the','dog\\'s','bite'"
py> s.replace("\\'", "'")
"'the','dog's','bite'"
py> s.decode('string-escape')
"'the','dog's','bite'"

Using str.decode() is a little more general as it will also decode other
escaped characters. This may be good or bad depending on your needs.

STeVe
Jul 19 '05 #5
Ah, this is much better than my crude replace technique. I forgot
about str.decode().

Thanks!
-- Paul

Jul 19 '05 #6
gry
PyParsing rocks! Here's what I ended up with:

def unpack_sql_array(s):
import pyparsing as pp
withquotes = pp.dblQuotedString.setParseAction(pp.removeQuotes)
withoutquotes = pp.CharsNotIn('",')
parser = pp.StringStart() + \
pp.Word('{').suppress() + \
pp.delimitedList(withquotes ^ withoutquotes) + \
pp.Word('}').suppress() + \
pp.StringEnd()
return parser.parseString(s).asList()

unpack_sql_array('{the,dog\'s,"foo,"}')
['the', "dog's", 'foo,']

[[Yes, this input is not what I stated originally. Someday, when I
reach a higher plane of existance, I will post a *complete* and
*correct* query to usenet...]]

Does the above seem fragile or questionable in any way?
Thanks all for your comments!

-- George

Jul 19 '05 #7
George -

Thanks for your enthusiastic endorsement!

Here are some quibbles about your pyparsing grammar (but really, not
bad for a first timer):
1. The Word class is used to define "words" or collective groups of
characters, by specifying what sets of characters are valid as leading
and/or body chars, as in:
integer = Word(digitsFrom0to9)
firstName = Word(upcaseAlphas, lowcaseAlphas)
In your parser, I think you want the Literal class instead, to match
the literal string '{'.

2. I don't think there is any chance to confuse a withQuotes with a
withoutQuotes, so you can try using the "match first" operator '|',
rather than the greedy matching "match longest" operator '^'.

3. Lastly, don't be too quick to use asList() to convert parse results
into lists - parse results already have most of the list accessors
people would need to access the returned matched tokens. asList() just
cleans up the output a bit.

Good luck, and thanks for trying pyparsing!
-- Paul

Jul 19 '05 #8
gr*@ll.mit.edu wrote:
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{} .... I want to end up with a python array of strings like:

['the', "dog's", 'bite']


Assuming that you trust the input, you could always use eval,
but since it seems fairly easy to solve anyway, that might
not be the best (at least not safest) solution.
strings = [r'''{'the','dog\'s','bite'}''', '''{'the'}''', '''{}''']
for s in strings:

.... print eval('['+s[1:-1]+']')
....
['the', "dog's", 'bite']
['the']
[]
Jul 19 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

5 posts views Thread by Lukas Holcik | last post: by
8 posts views Thread by B. | last post: by
10 posts views Thread by Andrew DeFaria | last post: by
12 posts views Thread by Dag Sunde | last post: by
43 posts views Thread by Roger L. Cauvin | last post: by
4 posts views Thread by r | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.