473,395 Members | 1,679 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Web page from hell breaks BeautifulSoup, almost

This web page:

http://azultralights.com/ulclass.html

parses OK with BeautifulSoup, but "prettify" will hit the
recursion limit if you try to display it. I raised the
recursion limit to a large number, and it was converted
to 5MB of text successfully, in about a minute.

The page has real problems. 1901 errors from the W3C validator,
and that's after forcing an encoding and a doctype. "body" tags
nested 3 deep. "head" element inside two "body" tags. Tags
opened with an upper case tag and closed with a lower case tag.
All "font" tags unclosed. Hundreds of "li" tags outside a
"ol" or "ul". Yet Firefox is quite happy to display it.
It looks even better in IE, according to comments on the page.

The page consists of a long list of classified ads, all with
unclosed tags. So the maximum depth is huge.

Worst HTML I've seen in a while.

(We use BeautifulSoup to parse hostile web sites in bulk,
so we tend to discover more hard cases than most users.)

John Nagle
SiteTruth
Dec 11 '07 #1
0 1037

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Ben | last post by:
I have a page for my company that I need assistance with: http://www.eastex.net/ben/NewETN/subPage.htm When you resize the window (NS or IE) small enough that you have to scroll horizontally...
1
by: livin | last post by:
I'm hoping someone knows of an example script I can see to help me build mine. I'm looking for an easy way to automate the below web site browsing and pull the data I'm searching for. Here's...
3
by: Tempo | last post by:
In my last post I received some advice to use urllib.read() to get a whole html page as a string, which will then allow me to use BeautifulSoup to do what I want with the string. But when I was...
4
by: William Xu | last post by:
Hi, all, This piece of code used to work well. i guess the error occurs after some upgrade. >>> import urllib >>> from BeautifulSoup import BeautifulSoup >>> url = 'http://www.google.com'...
5
by: John Nagle | last post by:
This, which is from a real web site, went into BeautifulSoup: <param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer fantastic rates for selected weeks or days!!&blinkt=Click...
1
by: Dan Stromberg - Datallegro | last post by:
Is there a method, with python, of screenscraping a web page, if that web page uses javascript? I know about BeautifulSoup, but AFAIK at this time, BeautifulSoup is for HTML that doesn't have...
5
by: Larry Bates | last post by:
Info: Python version: ActivePython 2.5.1.1 Platform: Windows I wanted to install BeautifulSoup today for a small project and decided to use easy_install. I can install other packages just...
3
by: bsagert | last post by:
I downloaded BeautifulSoup.py from http://www.crummy.com/software/BeautifulSoup/ and being a n00bie, I just placed it in my Windows c:\python25\lib\ file. When I type "import beautifulsoup" from...
2
by: academicedgar | last post by:
Hi I would appreciate some help. I am trying to learn Python and want to use BeautifulSoup to pull some data from tables. I was really psyched earlier tonight when I discovered that I could do...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.