By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,132 Members | 1,417 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,132 IT Pros & Developers. It's quick & easy.

simple regular expression problem

P: n/a
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
>>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

Greetings Arjen

Sep 17 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
On Sep 17, 9:00 am, duikboot <dijkstra.ar...@gmail.comwrote:
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L

['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.
The less obvious thing that you're missing is that regular expressions
is not the best solution to every text-related problem. Thinking at a
higher level helps sometimes; for example here you don't want to
extract "a list of strings from a text", you want to extract specific
elements from an XML data source. There are several standard and non
standard python packages for XML processing, look for them online.
Here's how to do it using the (3rd party) BeautyfulSoup module:
>>from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup(s).findAll('organisatie')
[<organisatie>
<profiel_id>28996</profiel_id>
</organisatie>, <organisatie>
<profiel_id>28997</profiel_id>
</organisatie>]
HTH,
George

Sep 17 '07 #2

P: n/a
duikboot a écrit :
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
>>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.
wrt/ regexp, Jason gave you the answer. Another point is that, when
dealing with XML, it's sometime better to use an XML parser.

Q&D :
>>from xml.etree import ElementTree as ET
s = "<root>" + s + "</root>"
tree = ET.fromstring(s)
tree
<Element root at b795b2ac>
>>tree.findall("organisatie/Profiel_Id")
[<Element Profiel_Id at b795b32c>, <Element Profiel_Id at b795b3ec>]
>>_[0].text
'28996'
>>[it.text for it in tree.findall("organisatie/Profiel_Id")]
['28996', '28997']
>>>
HTH

Sep 17 '07 #3

P: n/a
duikboot wrote:
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
>>>>s = """
\n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.
Don't use regular expressions to process XML. It's not the right tool for
the job, and even if simple cases as yours often can made work initially,
the longer you work with it, the more complex and troublesome the code
gets.

Instead, use the right tool, for example lxml. That has e.g.
XPath-expressions build in, that do the job:
from lxml import etree

tree =
etree.fromstring("""<root><organisatie>\n<Profiel_ Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie></root>""")

for feld in tree.xpath('//organisatie/Profiel_Id'):
print feld.text

Diez
Sep 17 '07 #4

P: n/a
In article <11*********************@n39g2000hsh.googlegroups. com>,
duikboot <di************@gmail.comwrote:
>
I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
To emphasize the other answers you got about avoiding regexps, here's a
nice quote from my .sig database:

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.' --Jamie Zawinski
--
Aahz (aa**@pythoncraft.com) <* http://www.pythoncraft.com/

The best way to get information on Usenet is not to ask a question, but
to post the wrong information.
Sep 18 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.