472,958 Members | 2,016 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,958 software developers and data experts.

Web page from hell breaks BeautifulSoup, almost

This web page:


parses OK with BeautifulSoup, but "prettify" will hit the
recursion limit if you try to display it. I raised the
recursion limit to a large number, and it was converted
to 5MB of text successfully, in about a minute.

The page has real problems. 1901 errors from the W3C validator,
and that's after forcing an encoding and a doctype. "body" tags
nested 3 deep. "head" element inside two "body" tags. Tags
opened with an upper case tag and closed with a lower case tag.
All "font" tags unclosed. Hundreds of "li" tags outside a
"ol" or "ul". Yet Firefox is quite happy to display it.
It looks even better in IE, according to comments on the page.

The page consists of a long list of classified ads, all with
unclosed tags. So the maximum depth is huge.

Worst HTML I've seen in a while.

(We use BeautifulSoup to parse hostile web sites in bulk,
so we tend to discover more hard cases than most users.)

John Nagle
Dec 11 '07 #1
0 1028

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

by: Ben | last post by:
I have a page for my company that I need assistance with: http://www.eastex.net/ben/NewETN/subPage.htm When you resize the window (NS or IE) small enough that you have to scroll horizontally...
by: livin | last post by:
I'm hoping someone knows of an example script I can see to help me build mine. I'm looking for an easy way to automate the below web site browsing and pull the data I'm searching for. Here's...
by: Tempo | last post by:
In my last post I received some advice to use urllib.read() to get a whole html page as a string, which will then allow me to use BeautifulSoup to do what I want with the string. But when I was...
by: William Xu | last post by:
Hi, all, This piece of code used to work well. i guess the error occurs after some upgrade. >>> import urllib >>> from BeautifulSoup import BeautifulSoup >>> url = 'http://www.google.com'...
by: John Nagle | last post by:
This, which is from a real web site, went into BeautifulSoup: <param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer fantastic rates for selected weeks or days!!&blinkt=Click...
by: Dan Stromberg - Datallegro | last post by:
Is there a method, with python, of screenscraping a web page, if that web page uses javascript? I know about BeautifulSoup, but AFAIK at this time, BeautifulSoup is for HTML that doesn't have...
by: Larry Bates | last post by:
Info: Python version: ActivePython Platform: Windows I wanted to install BeautifulSoup today for a small project and decided to use easy_install. I can install other packages just...
by: bsagert | last post by:
I downloaded BeautifulSoup.py from http://www.crummy.com/software/BeautifulSoup/ and being a n00bie, I just placed it in my Windows c:\python25\lib\ file. When I type "import beautifulsoup" from...
by: academicedgar | last post by:
Hi I would appreciate some help. I am trying to learn Python and want to use BeautifulSoup to pull some data from tables. I was really psyched earlier tonight when I discovered that I could do...
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
by: Aliciasmith | last post by:
In an age dominated by smartphones, having a mobile app for your business is no longer an option; it's a necessity. Whether you're a startup or an established enterprise, finding the right mobile app...
by: NeoPa | last post by:
Hello everyone. I find myself stuck trying to find the VBA way to get Access to create a PDF of the currently-selected (and open) object (Form or Report). I know it can be done by selecting :...
by: NeoPa | last post by:
Introduction For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
by: Teri B | last post by:
Hi, I have created a sub-form Roles. In my course form the user selects the roles assigned to the course. 0ne-to-many. One course many roles. Then I created a report based on the Course form and...
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM) Please note that the UK and Europe revert to winter time on...
by: nia12 | last post by:
Hi there, I am very new to Access so apologies if any of this is obvious/not clear. I am creating a data collection tool for health care employees to complete. It consists of a number of...
by: NeoPa | last post by:
Introduction For this article I'll be focusing on the Report (clsReport) class. This simply handles making the calling Form invisible until all of the Reports opened by it have been closed, when it...
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.