473,236 Members | 1,758 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,236 software developers and data experts.

simple regular expression problem

Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
>>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

Greetings Arjen

Sep 17 '07 #1
4 1318
On Sep 17, 9:00 am, duikboot <dijkstra.ar...@gmail.comwrote:
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L

['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.
The less obvious thing that you're missing is that regular expressions
is not the best solution to every text-related problem. Thinking at a
higher level helps sometimes; for example here you don't want to
extract "a list of strings from a text", you want to extract specific
elements from an XML data source. There are several standard and non
standard python packages for XML processing, look for them online.
Here's how to do it using the (3rd party) BeautyfulSoup module:
>>from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup(s).findAll('organisatie')
[<organisatie>
<profiel_id>28996</profiel_id>
</organisatie>, <organisatie>
<profiel_id>28997</profiel_id>
</organisatie>]
HTH,
George

Sep 17 '07 #2
duikboot a écrit :
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
>>>s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.
wrt/ regexp, Jason gave you the answer. Another point is that, when
dealing with XML, it's sometime better to use an XML parser.

Q&D :
>>from xml.etree import ElementTree as ET
s = "<root>" + s + "</root>"
tree = ET.fromstring(s)
tree
<Element root at b795b2ac>
>>tree.findall("organisatie/Profiel_Id")
[<Element Profiel_Id at b795b32c>, <Element Profiel_Id at b795b3ec>]
>>_[0].text
'28996'
>>[it.text for it in tree.findall("organisatie/Profiel_Id")]
['28996', '28997']
>>>
HTH

Sep 17 '07 #3
duikboot wrote:
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
>>>>s = """
\n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.
Don't use regular expressions to process XML. It's not the right tool for
the job, and even if simple cases as yours often can made work initially,
the longer you work with it, the more complex and troublesome the code
gets.

Instead, use the right tool, for example lxml. That has e.g.
XPath-expressions build in, that do the job:
from lxml import etree

tree =
etree.fromstring("""<root><organisatie>\n<Profiel_ Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie></root>""")

for feld in tree.xpath('//organisatie/Profiel_Id'):
print feld.text

Diez
Sep 17 '07 #4
In article <11*********************@n39g2000hsh.googlegroups. com>,
duikboot <di************@gmail.comwrote:
>
I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
To emphasize the other answers you got about avoiding regexps, here's a
nice quote from my .sig database:

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.' --Jamie Zawinski
--
Aahz (aa**@pythoncraft.com) <* http://www.pythoncraft.com/

The best way to get information on Usenet is not to ask a question, but
to post the wrong information.
Sep 18 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Reckless | last post by:
I've got a file with this in it: The data I'd like extracted is within the quotes: Some string data I can read the file out and extract (using string positions) the data I'd like but it...
4
by: peterbe | last post by:
I want to match a word against a string such that 'peter' is found in "peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or "hey peterbe," because the word has to stand on its own....
3
by: EFP | last post by:
Can anyone help me with a simple regular expression problem. All that I want to do is take a list of known data and extract a particular section of the string to form a new list. Here is my...
9
by: Harry | last post by:
Hi there, does anyone know how I can build a regular expression e.g. for the string.search() function on runtime, depending on the content of variables? Should be something like this: var...
11
by: Dimitris Georgakopuolos | last post by:
Hello, I have a text file that I load up to a string. The text includes certain expression like {firstName} or {userName} that I want to match and then replace with a new expression. However,...
18
by: Q. John Chen | last post by:
I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required date be entered in "MM/DD/YYYY" format: //+4 How...
7
by: Billa | last post by:
Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...
25
by: Mike | last post by:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...
4
by: drasko | last post by:
Hi all. I need to code simple and fast int regexp_match(char *regexp, char *string) function that will follow the expression regexp, and see if there is a matching in the string. If there is, it...
5
by: shawnmkramer | last post by:
Anyone every heard of the Regex.IsMatch and Regex.Match methods just hanging and eventually getting a message "Requested Service not found"? I have the following pattern: ^(?<OrgCity>(+)+),...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.