473,378 Members | 1,489 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

why does this call to re.findall() loop forever?

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?

The string contains three records (list items in a html page). Notice
that NONE of them matches the regexp: these records do not contain the
"title" element which the regexp expects inside '<span
class="date">'.

The weird thing is that removing any of the three records makes
findall() immediately return an empty list, while if I pass all three
records to findall() it never returns. Why does this happen?

This is using python 2.6.

Thanks so much for any help

-james

s="""<li class="post" key="4994199a0b80136cb3174e9e875c545e">
<h4 class="desc"><a href="http://www.sluggy.com/"
rel="nofollow">Sluggy Freelance</a>
</h4>
<div class="commands"&nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.sluggy.com%2F&amp;title=Sluggy
%20Freelance&amp;copyuser=crowebert&amp;copytags=i mported%2BRSS
%2BComics%2Bhumor%2Bdaily%2Bwebcomics&amp;jump=no& amp;partner=del"
class="copy" rel="nofollow">save this</a></div<div class="meta">to
<a class="tag" href="/crowebert/imported">imported</a<a class="tag"
href="/crowebert/RSS">RSS</a<a class="tag" href="/crowebert/
Comics">Comics</a<a class="tag" href="/crowebert/humor">humor</a<a
class="tag" href="/crowebert/daily">daily</a<a class="tag" href="/
crowebert/webcomics">webcomics</a... <a class="pop" href="/url/
ac655d3fe17873b31abeb29a1043e439" style="padding: 0 0.2em; background-
color: rgb(100%, 66%, 66%);">saved by 983 other people</a<span
class="date">1945-07-18</span</div>
</li>

<li class="post" key="65d66f4197fc7eba5c214fe85ed77725">
<h4 class="desc"><a href="http://www.snackbar-games.com/
gbacovers.php" rel="nofollow">Snackbar-Games.com :: GBA DS Cover
Project</a>
</h4>
<div class="commands"&nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.snackbar-games.com%2Fgbacovers.php&amp;title=Snackbar-Games.com
%20%3A%3A%20GBA%20DS%20Cover
%20Project&amp;copyuser=crowebert&amp;copytags=imp orted%2BBookmarkMenu
%2BGameStuff%2Bart%2BGBA%2Bgames
%2Bnintendo&amp;jump=no&amp;partner=del" class="copy"
rel="nofollow">save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a<a class="tag" href="/
crowebert/BookmarkMenu">BookmarkMenu</a<a class="tag" href="/
crowebert/GameStuff">GameStuff</a<a class="tag" href="/crowebert/
art">art</a<a class="tag" href="/crowebert/GBA">GBA</a<a
class="tag" href="/crowebert/games">games</a<a class="tag" href="/
crowebert/nintendo">nintendo</a... <a class="pop" href="/url/
a65a4a0ebe813ec6e9c881331e3f9583" style="padding: 0 0.2em; background-
color: rgb(100%, 84%, 84%);">saved by 26 other people</a<span
class="date">1948-12-31</span</div>
</li>

<li class="post" key="690ace1f465ae419dee8145ad3871024">
<h4 class="desc"><a href="http://www.megatokyo.com/"
rel="nofollow">MegaTokyo</a>
</h4>
<div class="commands"&nbsp;<a save href="/post?url=http%3A%2F
%2Fwww.megatokyo.com
%2F&amp;title=MegaTokyo&amp;copyuser=crowebert&amp ;copytags=imported
%2BBookmarkBar%2BWeekendComics%2Bcomics%2Bmanga%2B humor
%2Bwebcomics&amp;jump=no&amp;partner=del" class="copy"
rel="nofollow">save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a<a class="tag" href="/
crowebert/BookmarkBar">BookmarkBar</a<a class="tag" href="/crowebert/
WeekendComics">WeekendComics</a<a class="tag" href="/crowebert/
comics">comics</a<a class="tag" href="/crowebert/manga">manga</a<a
class="tag" href="/crowebert/humor">humor</a<a class="tag" href="/
crowebert/webcomics">webcomics</a... <a class="pop" href="/url/
94843244f0c6d80f1c6806ed5c0abec7" style="padding: 0 0.2em; background-
color: rgb(100%, 60%, 60%);">saved by 2784 other people</a<span
class="date">1946-01-28</span</div>
</li>"""

regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)
Nov 9 '08 #1
3 2975
My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>""", re.DOTALL)
Nov 9 '08 #2
ja***********@gmail.com wrote:
Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?
[snip] html/xml string
regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)
Python have several modules for parsing and working with xml. Do you
not know of them or is there some reason they won't work?

Nov 9 '08 #3
ja***********@gmail.com <ja***********@gmail.comwrote:
My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>""", re.DOTALL)
Some regular expressions can't be searched in a reasonable length of
time. Not sure whether this is your problem but it might be! Search
for "exponential time regular expression" if you want some examples.

Eg http://bugs.python.org/issue1515829

I'd attack this problem using beatifulsoup probably rather than
regexps!

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick
Nov 10 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Will Stuyvesant | last post by:
A question about the findall function in the re module, and I also would be happy with pointers to online documentation with which I could have found a solution myself (if it even exists!). I...
5
by: Shay | last post by:
essentially I am trying to do some counts based on some assumptions in the recordset. So I get the RS back, put the values into a variable, move to the next record in the RS and compare what is in...
32
by: Toby Newman | last post by:
At the page: http://www.strath.ac.uk/IT/Docs/Ccourse/subsection3_8_3.html#SECTION0008300000000000000 or http://tinyurl.com/4ptzs the author warns: "The for loop is frequently used, usually...
58
by: Larry David | last post by:
Ok, first of all, let's get the obvious stuff out of the way. I'm an idiot. So please indulge me for a moment. Consider it an act of "community service".... What does "64bit" mean to your friendly...
34
by: NewToCPP | last post by:
Hi, Why does a C/C++ programs crash? When there is access to a null pointer or some thing like that programs crash, but why do they crash? Thanks.
4
by: Dachshund Digital | last post by:
Why does Join method call sit there forever? This code works, including the delegate call, but if the join method is ever called, it seems the main thread blocks, and it is hung. HELP! This is...
2
by: =?Utf-8?B?bWdvbnphbGVzMw==?= | last post by:
I have a List<tobject consisting of objects which in themselves consist of BindingListViews of objects. When I want to search for a object value I normally create a foreach loop and increment a...
3
by: nghivo | last post by:
I attempted to synchronize async Ajax calls using the following JS blocks: ==================================================== function getXMLHTTPRequest() { try { req =...
4
by: Jerry Spence1 | last post by:
I'm using VB2005 I have an array of some 500,000 items which are of the form 080715_175327_312.jpg. These are camera frames of the form YYMMDD_HHmmss_ms.jpg I want to create another array...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.