473,396 Members | 1,748 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

regexp for sequence of quoted strings

gry
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"

but how do I represent a *sequence* of these separated
by commas? I guess I can artificially tack a comma on the
end of the input string and do:

r"\{('([^']|\\')*',)\}"

but that seems like an ugly hack...

I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.

-- George

Jul 19 '05 #1
8 1491
gr*@ll.mit.edu wrote:
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}
[snip]
I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.

py> s = "{'the','dog\'s','bite'}"
py> s
"{'the','dog's','bite'}"
py> s[1:-1]
"'the','dog's','bite'"
py> s[1:-1].split(',')
["'the'", "'dog's'", "'bite'"]
py> [item[1:-1] for item in s[1:-1].split(',')]
['the', "dog's", 'bite']

py> s = "{'the'}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['the']

py> s = "{}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['']

Not sure what you want in the last case, but if you want an empty list,
you can probably add a simple if-statement to check if s[1:-1] is non-empty.

HTH,

STeVe
Jul 19 '05 #2
gr*@ll.mit.edu writes:
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"


what about {'dog \\', ...} ?

If you don't need to validate anything you can just forget about the commas
etc and extract all the 'strings' with findall,

The regexp below is a bit too complicated (adapted from something else) but I
think will work:

In [90]:rex = re.compile(r"'(?:[^\n]|(?<!\\)(?:\\)(?:\\\\)*\n)*?(?<!\\)(?:\\\\)*?'")

In [91]:rex.findall(r"{'the','dog\'s','bite'}")
Out[91]:["'the'", "'dog\\'s'", "'bite'"]

Otherwise just add something like ",|}$" to deal with the final } instead of a
comma.

Alternatively, you could also write a regexp to split on the "','" bit and trim
the first and the last split.

'as


Jul 19 '05 #3
Pyparsing includes some built-in quoted string support that might
simplify this problem. Of course, if you prefer regexp's, I'm by no
means offended!

Check out my Python console session below. (You may need to expand the
unquote method to do more handling of backslash escapes.)

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)
from pyparsing import delimitedList, sglQuotedString
text = r"'the','dog\'s','bite'"
def unquote(s,l,t): .... t2 = t[0][1:-1]
.... return t2.replace("\\'","'")
.... sglQuotedString.setParseAction(unquote)
g = delimitedList( sglQuotedString )
g.parseString(text).asList()

['the', "dog's", 'bite']

Jul 19 '05 #4
Paul McGuire wrote:
text = r"'the','dog\'s','bite'"
def unquote(s,l,t):


... t2 = t[0][1:-1]
... return t2.replace("\\'","'")
...


Note also, that the codec 'string-escape' can be used to do what's done
with str.replace in this example:

py> s
"'the','dog\\'s','bite'"
py> s.replace("\\'", "'")
"'the','dog's','bite'"
py> s.decode('string-escape')
"'the','dog's','bite'"

Using str.decode() is a little more general as it will also decode other
escaped characters. This may be good or bad depending on your needs.

STeVe
Jul 19 '05 #5
Ah, this is much better than my crude replace technique. I forgot
about str.decode().

Thanks!
-- Paul

Jul 19 '05 #6
gry
PyParsing rocks! Here's what I ended up with:

def unpack_sql_array(s):
import pyparsing as pp
withquotes = pp.dblQuotedString.setParseAction(pp.removeQuotes)
withoutquotes = pp.CharsNotIn('",')
parser = pp.StringStart() + \
pp.Word('{').suppress() + \
pp.delimitedList(withquotes ^ withoutquotes) + \
pp.Word('}').suppress() + \
pp.StringEnd()
return parser.parseString(s).asList()

unpack_sql_array('{the,dog\'s,"foo,"}')
['the', "dog's", 'foo,']

[[Yes, this input is not what I stated originally. Someday, when I
reach a higher plane of existance, I will post a *complete* and
*correct* query to usenet...]]

Does the above seem fragile or questionable in any way?
Thanks all for your comments!

-- George

Jul 19 '05 #7
George -

Thanks for your enthusiastic endorsement!

Here are some quibbles about your pyparsing grammar (but really, not
bad for a first timer):
1. The Word class is used to define "words" or collective groups of
characters, by specifying what sets of characters are valid as leading
and/or body chars, as in:
integer = Word(digitsFrom0to9)
firstName = Word(upcaseAlphas, lowcaseAlphas)
In your parser, I think you want the Literal class instead, to match
the literal string '{'.

2. I don't think there is any chance to confuse a withQuotes with a
withoutQuotes, so you can try using the "match first" operator '|',
rather than the greedy matching "match longest" operator '^'.

3. Lastly, don't be too quick to use asList() to convert parse results
into lists - parse results already have most of the list accessors
people would need to access the returned matched tokens. asList() just
cleans up the output a bit.

Good luck, and thanks for trying pyparsing!
-- Paul

Jul 19 '05 #8
gr*@ll.mit.edu wrote:
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{} .... I want to end up with a python array of strings like:

['the', "dog's", 'bite']


Assuming that you trust the input, you could always use eval,
but since it seems fairly easy to solve anyway, that might
not be the best (at least not safest) solution.
strings = [r'''{'the','dog\'s','bite'}''', '''{'the'}''', '''{}''']
for s in strings:

.... print eval('['+s[1:-1]+']')
....
['the', "dog's", 'bite']
['the']
[]
Jul 19 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Lukas Holcik | last post by:
Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could....
8
by: B. | last post by:
Hello, I've got the following problem: Suppose you have the strings contains: "xxxx aaa { 111, 222, 333} bbb {111, 222,333} yyyy" "xxxx aaa {1112, 2223, 3334} bbb {11112, 22223,33334,44445}...
10
by: Andrew DeFaria | last post by:
I was reading my O'Reilly JavaScript The Definitive Guide when I came across RegExp and thought I could tighten up my JavaScript code that checks for a valid email address. Why does the following...
4
by: lorinh | last post by:
Hi Folks, I'm trying to strip C/C++ style comments (/* ... */ or // ) from source code using Python regexps. If I don't have to worry about comments embedded in strings, it seems pretty...
12
by: Dag Sunde | last post by:
My understanding of regular expressions is rudimentary, at best. I have this RegExp to to a very simple validation of an email-address, but it turns out that it refuses to accept mail-addresses...
43
by: Roger L. Cauvin | last post by:
Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g. "xyz123aaabbaabbbbababbbbaaabb" I'm...
9
by: vbfoobar | last post by:
Hello I am looking for python code that takes as input a list of strings (most similar, but not necessarily, and rather short: say not longer than 50 chars) and that computes and outputs the...
3
by: jgarrard | last post by:
Hi, I have an array of strings which are regular expressions in the PERL syntax (ie / / delimeters). I wish to create a RegExp in order to do some useful work, but am stuck for a way of...
4
by: r | last post by:
Hello, It seems delimiters can cause trouble sometimes. Look at this : <script type="text/javascript"> function isDigit(s) { var DECIMAL = '\\.'; var exp = '/(^?0(' + DECIMAL
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.