471,317 Members | 2,590 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,317 software developers and data experts.

Web page from hell breaks BeautifulSoup, almost

This web page:

http://azultralights.com/ulclass.html

parses OK with BeautifulSoup, but "prettify" will hit the
recursion limit if you try to display it. I raised the
recursion limit to a large number, and it was converted
to 5MB of text successfully, in about a minute.

The page has real problems. 1901 errors from the W3C validator,
and that's after forcing an encoding and a doctype. "body" tags
nested 3 deep. "head" element inside two "body" tags. Tags
opened with an upper case tag and closed with a lower case tag.
All "font" tags unclosed. Hundreds of "li" tags outside a
"ol" or "ul". Yet Firefox is quite happy to display it.
It looks even better in IE, according to comments on the page.

The page consists of a long list of classified ads, all with
unclosed tags. So the maximum depth is huge.

Worst HTML I've seen in a while.

(We use BeautifulSoup to parse hostile web sites in bulk,
so we tend to discover more hard cases than most users.)

John Nagle
SiteTruth
Dec 11 '07 #1
0 989

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Tempo | last post: by
4 posts views Thread by William Xu | last post: by
1 post views Thread by Dan Stromberg - Datallegro | last post: by
5 posts views Thread by Larry Bates | last post: by
3 posts views Thread by bsagert | last post: by
2 posts views Thread by academicedgar | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.