regexp for sequence of quoted strings

gry

I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"

but how do I represent a *sequence* of these separated
by commas? I guess I can artificially tack a comma on the
end of the input string and do:

r"\{('([^']|\\')*',)\}"

but that seems like an ugly hack...

I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.

-- George

Jul 19 '05 #1

Subscribe Post Reply

1491

Steven Bethard

gr*@ll.mit.edu wrote:

I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}
[snip]
I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.

py> s = "{'the','dog\'s','bite'}"
py> s
"{'the','dog's','bite'}"
py> s[1:-1]
"'the','dog's','bite'"
py> s[1:-1].split(',')
["'the'", "'dog's'", "'bite'"]
py> [item[1:-1] for item in s[1:-1].split(',')]
['the', "dog's", 'bite']

py> s = "{'the'}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['the']

py> s = "{}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['']

Not sure what you want in the last case, but if you want an empty list,
you can probably add a simple if-statement to check if s[1:-1] is non-empty.

HTH,

STeVe

Jul 19 '05 #2

Alexander Schmolck

gr*@ll.mit.edu writes:

I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"

what about {'dog \\', ...} ?

If you don't need to validate anything you can just forget about the commas
etc and extract all the 'strings' with findall,

The regexp below is a bit too complicated (adapted from something else) but I
think will work:

In [90]:rex = re.compile(r"'(?:[^\n]|(?<!\\)(?:\\)(?:\\\\)*\n)*?(?<!\\)(?:\\\\)*?'")

In [91]:rex.findall(r"{'the','dog\'s','bite'}")
Out[91]:["'the'", "'dog\\'s'", "'bite'"]

Otherwise just add something like ",|}$" to deal with the final } instead of a
comma.

Alternatively, you could also write a regexp to split on the "','" bit and trim
the first and the last split.

'as

Jul 19 '05 #3

Paul McGuire

Pyparsing includes some built-in quoted string support that might
simplify this problem. Of course, if you prefer regexp's, I'm by no
means offended!

Check out my Python console session below. (You may need to expand the
unquote method to do more handling of backslash escapes.)

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)

from pyparsing import delimitedList, sglQuotedString
text = r"'the','dog\'s','bite'"
def unquote(s,l,t): .... t2 = t[0][1:-1]
.... return t2.replace("\\'","'")
.... sglQuotedString.setParseAction(unquote)
g = delimitedList( sglQuotedString )
g.parseString(text).asList()

['the', "dog's", 'bite']

Jul 19 '05 #4

Steven Bethard

Paul McGuire wrote:

text = r"'the','dog\'s','bite'"
def unquote(s,l,t):

... t2 = t[0][1:-1]
... return t2.replace("\\'","'")
...

Note also, that the codec 'string-escape' can be used to do what's done
with str.replace in this example:

py> s
"'the','dog\\'s','bite'"
py> s.replace("\\'", "'")
"'the','dog's','bite'"
py> s.decode('string-escape')
"'the','dog's','bite'"

Using str.decode() is a little more general as it will also decode other
escaped characters. This may be good or bad depending on your needs.

STeVe

Jul 19 '05 #5

Paul McGuire

Ah, this is much better than my crude replace technique. I forgot
about str.decode().

Thanks!
-- Paul

Jul 19 '05 #6

gry

PyParsing rocks! Here's what I ended up with:

def unpack_sql_array(s):
import pyparsing as pp
withquotes = pp.dblQuotedString.setParseAction(pp.removeQuotes)
withoutquotes = pp.CharsNotIn('",')
parser = pp.StringStart() + \
pp.Word('{').suppress() + \
pp.delimitedList(withquotes ^ withoutquotes) + \
pp.Word('}').suppress() + \
pp.StringEnd()
return parser.parseString(s).asList()

unpack_sql_array('{the,dog\'s,"foo,"}')
['the', "dog's", 'foo,']

[[Yes, this input is not what I stated originally. Someday, when I
reach a higher plane of existance, I will post a *complete* and
*correct* query to usenet...]]

Does the above seem fragile or questionable in any way?
Thanks all for your comments!

-- George

Jul 19 '05 #7

Paul McGuire

George -

Thanks for your enthusiastic endorsement!

Here are some quibbles about your pyparsing grammar (but really, not
bad for a first timer):
1. The Word class is used to define "words" or collective groups of
characters, by specifying what sets of characters are valid as leading
and/or body chars, as in:
integer = Word(digitsFrom0to9)
firstName = Word(upcaseAlphas, lowcaseAlphas)
In your parser, I think you want the Literal class instead, to match
the literal string '{'.

2. I don't think there is any chance to confuse a withQuotes with a
withoutQuotes, so you can try using the "match first" operator '|',
rather than the greedy matching "match longest" operator '^'.

3. Lastly, don't be too quick to use asList() to convert parse results
into lists - parse results already have most of the list accessors
people would need to access the returned matched tokens. asList() just
cleans up the output a bit.

Good luck, and thanks for trying pyparsing!
-- Paul

Jul 19 '05 #8

Magnus Lycka

gr*@ll.mit.edu wrote:

I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{} .... I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Assuming that you trust the input, you could always use eval,
but since it seems fairly easy to solve anyway, that might
not be the best (at least not safest) solution.

strings = [r'''{'the','dog\'s','bite'}''', '''{'the'}''', '''{}''']
for s in strings:

.... print eval('['+s[1:-1]+']')
....
['the', "dog's", 'bite']
['the']
[]

Jul 19 '05 #9

Similar topics

Saving search results in a dictionary

by: Lukas Holcik | last post by:

Hi everyone! How can I simply search text for regexps (lets say <a href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a dictionary { name : URL}? In a single pass if it could....

Python

Help me with REGEXP please !

by: B. | last post by:

Hello, I've got the following problem: Suppose you have the strings contains: "xxxx aaa { 111, 222, 333} bbb {111, 222,333} yyyy" "xxxx aaa {1112, 2223, 3334} bbb {11112, 22223,33334,44445}...

MySQL Database

RegExp to validate an email address

by: Andrew DeFaria | last post by:

I was reading my O'Reilly JavaScript The Definitive Guide when I came across RegExp and thought I could tighten up my JavaScript code that checks for a valid email address. Why does the following...

Javascript

Stripping C-style comments using a Python regexp

by: lorinh | last post by:

Hi Folks, I'm trying to strip C/C++ style comments (/* ... */ or // ) from source code using Python regexps. If I don't have to worry about comments embedded in strings, it seems pretty...

Python

Validate EMail with RegExp...

by: Dag Sunde | last post by:

My understanding of regular expressions is rudimentary, at best. I have this RegExp to to a very simple validation of an email-address, but it turns out that it refuses to accept mail-addresses...

Javascript

Match First Sequence in Regular Expression?

by: Roger L. Cauvin | last post by:

Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g. "xyz123aaabbaabbbbababbbbaaabb" I'm...

Python

Looking for a regexp generator based on a set of known string representative of a string set

by: vbfoobar | last post by:

Hello I am looking for python code that takes as input a list of strings (most similar, but not necessarily, and rather short: say not longer than 50 chars) and that computes and outputs the...

Python

Assign a PERL style regex in the RegExp constructor?

by: jgarrard | last post by:

Hi, I have an array of strings which are regular expressions in the PERL syntax (ie / / delimeters). I wish to create a RegExp in order to do some useful work, but am stuck for a way of...

Javascript

Strange RegExp problem

by: r | last post by:

Hello, It seems delimiters can cause trouble sometimes. Look at this : <script type="text/javascript"> function isDigit(s) { var DECIMAL = '\\.'; var exp = '/(^?0(' + DECIMAL

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice