472,141 Members | 1,412 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,141 software developers and data experts.

why does this call to re.findall() loop forever?

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?

The string contains three records (list items in a html page). Notice
that NONE of them matches the regexp: these records do not contain the
"title" element which the regexp expects inside '<span
class="date">'.

The weird thing is that removing any of the three records makes
findall() immediately return an empty list, while if I pass all three
records to findall() it never returns. Why does this happen?

This is using python 2.6.

Thanks so much for any help

-james

s="""<li class="post" key="4994199a0b80136cb3174e9e875c545e">
<h4 class="desc"><a href="http://www.sluggy.com/"
rel="nofollow">Sluggy Freelance</a>
</h4>
<div class="commands"&nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.sluggy.com%2F&amp;title=Sluggy
%20Freelance&amp;copyuser=crowebert&amp;copytags=i mported%2BRSS
%2BComics%2Bhumor%2Bdaily%2Bwebcomics&amp;jump=no& amp;partner=del"
class="copy" rel="nofollow">save this</a></div<div class="meta">to
<a class="tag" href="/crowebert/imported">imported</a<a class="tag"
href="/crowebert/RSS">RSS</a<a class="tag" href="/crowebert/
Comics">Comics</a<a class="tag" href="/crowebert/humor">humor</a<a
class="tag" href="/crowebert/daily">daily</a<a class="tag" href="/
crowebert/webcomics">webcomics</a... <a class="pop" href="/url/
ac655d3fe17873b31abeb29a1043e439" style="padding: 0 0.2em; background-
color: rgb(100%, 66%, 66%);">saved by 983 other people</a<span
class="date">1945-07-18</span</div>
</li>

<li class="post" key="65d66f4197fc7eba5c214fe85ed77725">
<h4 class="desc"><a href="http://www.snackbar-games.com/
gbacovers.php" rel="nofollow">Snackbar-Games.com :: GBA DS Cover
Project</a>
</h4>
<div class="commands"&nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.snackbar-games.com%2Fgbacovers.php&amp;title=Snackbar-Games.com
%20%3A%3A%20GBA%20DS%20Cover
%20Project&amp;copyuser=crowebert&amp;copytags=imp orted%2BBookmarkMenu
%2BGameStuff%2Bart%2BGBA%2Bgames
%2Bnintendo&amp;jump=no&amp;partner=del" class="copy"
rel="nofollow">save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a<a class="tag" href="/
crowebert/BookmarkMenu">BookmarkMenu</a<a class="tag" href="/
crowebert/GameStuff">GameStuff</a<a class="tag" href="/crowebert/
art">art</a<a class="tag" href="/crowebert/GBA">GBA</a<a
class="tag" href="/crowebert/games">games</a<a class="tag" href="/
crowebert/nintendo">nintendo</a... <a class="pop" href="/url/
a65a4a0ebe813ec6e9c881331e3f9583" style="padding: 0 0.2em; background-
color: rgb(100%, 84%, 84%);">saved by 26 other people</a<span
class="date">1948-12-31</span</div>
</li>

<li class="post" key="690ace1f465ae419dee8145ad3871024">
<h4 class="desc"><a href="http://www.megatokyo.com/"
rel="nofollow">MegaTokyo</a>
</h4>
<div class="commands"&nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.megatokyo.com
%2F&amp;title=MegaTokyo&amp;copyuser=crowebert&amp ;copytags=imported
%2BBookmarkBar%2BWeekendComics%2Bcomics%2Bmanga%2B humor
%2Bwebcomics&amp;jump=no&amp;partner=del" class="copy"
rel="nofollow">save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a<a class="tag" href="/
crowebert/BookmarkBar">BookmarkBar</a<a class="tag" href="/crowebert/
WeekendComics">WeekendComics</a<a class="tag" href="/crowebert/
comics">comics</a<a class="tag" href="/crowebert/manga">manga</a<a
class="tag" href="/crowebert/humor">humor</a<a class="tag" href="/
crowebert/webcomics">webcomics</a... <a class="pop" href="/url/
94843244f0c6d80f1c6806ed5c0abec7" style="padding: 0 0.2em; background-
color: rgb(100%, 60%, 60%);">saved by 2784 other people</a<span
class="date">1946-01-28</span</div>
</li>"""

regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)
Nov 9 '08 #1
3 2899
My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>""", re.DOTALL)
Nov 9 '08 #2
ja***********@gmail.com wrote:
Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?
[snip] html/xml string
regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)
Python have several modules for parsing and working with xml. Do you
not know of them or is there some reason they won't work?

Nov 9 '08 #3
ja***********@gmail.com <ja***********@gmail.comwrote:
My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>""", re.DOTALL)
Some regular expressions can't be searched in a reasonable length of
time. Not sure whether this is your problem but it might be! Search
for "exponential time regular expression" if you want some examples.

Eg http://bugs.python.org/issue1515829

I'd attack this problem using beatifulsoup probably rather than
regexps!

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick
Nov 10 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Will Stuyvesant | last post: by
32 posts views Thread by Toby Newman | last post: by
34 posts views Thread by NewToCPP | last post: by
4 posts views Thread by Dachshund Digital | last post: by
2 posts views Thread by =?Utf-8?B?bWdvbnphbGVzMw==?= | last post: by
4 posts views Thread by Jerry Spence1 | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.