why does this call to re.findall() loop forever?

james.kirin40

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?

The string contains three records (list items in a html page). Notice
that NONE of them matches the regexp: these records do not contain the
"title" element which the regexp expects inside '<span
class="date">'.

The weird thing is that removing any of the three records makes
findall() immediately return an empty list, while if I pass all three
records to findall() it never returns. Why does this happen?

This is using python 2.6.

Thanks so much for any help

-james

s="""<li class="post" key="4994199a0b80136cb3174e9e875c545e">
<h4 class="desc"><a href="http://www.sluggy.com/"
rel="nofollow">Sluggy Freelance</a>
</h4>
<div class="commands" <a save href="/post?url=http%3A%2F
%2Fwww.sluggy.com%2F&title=Sluggy
%20Freelance&copyuser=crowebert&copytags=i mported%2BRSS
%2BComics%2Bhumor%2Bdaily%2Bwebcomics&jump=no& amp;partner=del"
class="copy" rel="nofollow">save this</a></div<div class="meta">to
<a class="tag" href="/crowebert/imported">imported</a<a class="tag"
href="/crowebert/RSS">RSS</a<a class="tag" href="/crowebert/
Comics">Comics</a<a class="tag" href="/crowebert/humor">humor</a<a
class="tag" href="/crowebert/daily">daily</a<a class="tag" href="/
crowebert/webcomics">webcomics</a... <a class="pop" href="/url/
ac655d3fe17873b31abeb29a1043e439" style="padding: 0 0.2em; background-
color: rgb(100%, 66%, 66%);">saved by 983 other people</a<span
class="date">1945-07-18</span</div>
</li>

<li class="post" key="65d66f4197fc7eba5c214fe85ed77725">
<h4 class="desc"><a href="http://www.snackbar-games.com/
gbacovers.php" rel="nofollow">Snackbar-Games.com :: GBA DS Cover
Project</a>
</h4>
<div class="commands" <a save href="/post?url=http%3A%2F
%2Fwww.snackbar-games.com%2Fgbacovers.php&title=Snackbar-Games.com
%20%3A%3A%20GBA%20DS%20Cover
%20Project&copyuser=crowebert&copytags=imp orted%2BBookmarkMenu
%2BGameStuff%2Bart%2BGBA%2Bgames
%2Bnintendo&jump=no&partner=del" class="copy"
rel="nofollow">save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a<a class="tag" href="/
crowebert/BookmarkMenu">BookmarkMenu</a<a class="tag" href="/
crowebert/GameStuff">GameStuff</a<a class="tag" href="/crowebert/
art">art</a<a class="tag" href="/crowebert/GBA">GBA</a<a
class="tag" href="/crowebert/games">games</a<a class="tag" href="/
crowebert/nintendo">nintendo</a... <a class="pop" href="/url/
a65a4a0ebe813ec6e9c881331e3f9583" style="padding: 0 0.2em; background-
color: rgb(100%, 84%, 84%);">saved by 26 other people</a<span
class="date">1948-12-31</span</div>
</li>

<li class="post" key="690ace1f465ae419dee8145ad3871024">
<h4 class="desc"><a href="http://www.megatokyo.com/"
rel="nofollow">MegaTokyo</a>
</h4>
<div class="commands" <a save href="/post?url=http%3A%2F
%2Fwww.megatokyo.com
%2F&title=MegaTokyo&copyuser=crowebert&amp ;copytags=imported
%2BBookmarkBar%2BWeekendComics%2Bcomics%2Bmanga%2B humor
%2Bwebcomics&jump=no&partner=del" class="copy"
rel="nofollow">save this</a></div<div class="meta">to <a class="tag"
href="/crowebert/imported">imported</a<a class="tag" href="/
crowebert/BookmarkBar">BookmarkBar</a<a class="tag" href="/crowebert/
WeekendComics">WeekendComics</a<a class="tag" href="/crowebert/
comics">comics</a<a class="tag" href="/crowebert/manga">manga</a<a
class="tag" href="/crowebert/humor">humor</a<a class="tag" href="/
crowebert/webcomics">webcomics</a... <a class="pop" href="/url/
94843244f0c6d80f1c6806ed5c0abec7" style="padding: 0 0.2em; background-
color: rgb(100%, 60%, 60%);">saved by 2784 other people</a<span
class="date">1946-01-28</span</div>
</li>"""

regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)

Nov 9 '08 #1

Subscribe Post Reply

2975

james.kirin40

My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>""", re.DOTALL)

Nov 9 '08 #2

Terry Reedy

ja***********@gmail.com wrote:

Hi everyone,

I am using Python's re module to extract some data from html. The
following code never returns, and I was wondering if someone can
explain to me why. Is this a problem with my regexp (I tried really
hard to find it?)?

[snip] html/xml string

regexp = re.compile("<li class=\"post\".*?<h4 class=\"desc\"><a href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>", re.DOTALL)

re.findall(regexp, s)

Python have several modules for parsing and working with xml. Do you
not know of them or is there some reason they won't work?

Nov 9 '08 #3

Nick Craig-Wood

ja***********@gmail.com <ja***********@gmail.comwrote:

My apologies, given that Google Groups messes up the formatting, the
regexp should read

regexp = re.compile("""<li class=\"post\".*?<h4 class=\"desc\"><a
href=
\"(.*?)\" rel=\"nofollow\">(.*?)</a>.*?</div>\s*(?:<p class=\"notes
\">(.*?)</p>)?.*?<div class=\"meta\">(?:to ((?:<a class=\"tag\".*?)
+))*.*?<span class=\"date\" title=\"(.*?)\">.*?</span>\s*</div>.*?</
li>""", re.DOTALL)

Some regular expressions can't be searched in a reasonable length of
time. Not sure whether this is your problem but it might be! Search
for "exponential time regular expression" if you want some examples.

Eg http://bugs.python.org/issue1515829

I'd attack this problem using beatifulsoup probably rather than
regexps!

--
Nick Craig-Wood <ni**@craig-wood.com-- http://www.craig-wood.com/nick

Nov 10 '08 #4

Similar topics

re.findall

by: Will Stuyvesant | last post by:

A question about the findall function in the re module, and I also would be happy with pointers to online documentation with which I could have found a solution myself (if it even exists!). I...

Python

why does this comparison script run so slow??

by: Shay | last post by:

essentially I am trying to do some counts based on some assumptions in the recordset. So I get the RS back, put the values into a variable, move to the next record in the RS and compare what is in...

ASP / Active Server Pages

How not to abuse a "for loop": examples?

by: Toby Newman | last post by:

At the page: http://www.strath.ac.uk/IT/Docs/Ccourse/subsection3_8_3.html#SECTION0008300000000000000 or http://tinyurl.com/4ptzs the author warns: "The for loop is frequently used, usually...

C / C++

What does '64 bit' mean? Lame question, but hear me out :)

by: Larry David | last post by:

Ok, first of all, let's get the obvious stuff out of the way. I'm an idiot. So please indulge me for a moment. Consider it an act of "community service".... What does "64bit" mean to your friendly...

C# / C Sharp

Why does C/C++ programs crash

by: NewToCPP | last post by:

Hi, Why does a C/C++ programs crash? When there is access to a null pointer or some thing like that programs crash, but why do they crash? Thanks.

C / C++

Why does Join method call sit there forever?

by: Dachshund Digital | last post by:

Why does Join method call sit there forever? This code works, including the delegate call, but if the join method is ever called, it seems the main thread blocks, and it is hung. HELP! This is...

Visual Basic .NET

List<t> findall question - predicates?

by: =?Utf-8?B?bWdvbnphbGVzMw==?= | last post by:

I have a List<tobject consisting of objects which in themselves consist of BindingListViews of objects. When I want to search for a object value I normally create a foreach loop and increment a...

C# / C Sharp

Ajax Call Tracking failed attempt

by: nghivo | last post by:

I attempted to synchronize async Ajax calls using the following JS blocks: ==================================================== function getXMLHTTPRequest() { try { req =...

Javascript

Using Array.FindAll

by: Jerry Spence1 | last post by:

I'm using VB2005 I have an array of some 500,000 items which are of the form 080715_175327_312.jpg. These are camera frames of the form YYMMDD_HHmmss_ms.jpg I want to create another array...

Visual Basic .NET

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++