473,408 Members | 1,739 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,408 software developers and data experts.

HTML purifier using BeautifulSoup?


Has anyone tried to construct an HTML janitor script using BeautifulSoup?

My situation:

I'm trying to convert a series of web pages from .html to palmdoc format,
using plucker, which is written in python. The plucker project suggests
passing html through "tidy", to get well-formed html for plucker to work
with.

However, some of the pages I want to convert are so bad that even tidy
pukes on them.

I was thinking that BeautifulSoup might be more tolerant of really bad
html... Which led me to the question this article started out with. :)

Thanks!
Jul 18 '05 #1
1 2134
Dan Stromberg wrote:
Has anyone tried to construct an HTML janitor script using BeautifulSoup?
My situation:

I'm trying to convert a series of web pages from .html to palmdoc format, using plucker, which is written in python. The plucker project suggests passing html through "tidy", to get well-formed html for plucker to work with.

However, some of the pages I want to convert are so bad that even tidy pukes on them.

I was thinking that BeautifulSoup might be more tolerant of really bad html... Which led me to the question this article started out with. :)
Thanks!


I have used BeautifulSoup for screen scraping, pulling html into
structured form (using XML). Is that similar to a janitor script? I
used it because tidy was puking on some html. BS has been excellent.

Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Anders Eriksson | last post by:
Hello! I want to extract some info from a some specific HTML pages, Microsofts International Word list (e.g. http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I want to...
7
by: Dan Stromberg | last post by:
I'm working on writing a program that will synchronize one database with another. For the source database, we can just use the python sybase API; that's nice and normal. For the target...
3
by: Tempo | last post by:
In my last post I received some advice to use urllib.read() to get a whole html page as a string, which will then allow me to use BeautifulSoup to do what I want with the string. But when I was...
10
by: pak.andrei | last post by:
Here is my script: from mechanize import * from BeautifulSoup import * import StringIO b = Browser() f = b.open("http://www.translate.ru/text.asp?lang=ru") b.select_form(nr=0) b = "hello...
15
by: Francach | last post by:
Hi, I'm trying to use the Beautiful Soup package to parse through the "bookmarks.html" file which Firefox exports all your bookmarks into. I've been struggling with the documentation trying to...
2
by: s. d. rose | last post by:
Hello All. I am learning Python, and have never worked with HTML. However, I would like to write a simple script to audit my 100+ Netware servers via their web portal. I was reading Chapter 8...
6
by: Tina I | last post by:
Hi everyone, I have a small, probably trivial even, problem. I have the following HTML: I need to make this into a dictionary like this: dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999...
7
by: mark | last post by:
Hi All, Apologies for the newbie question but I've searched and tried all sorts for a few days and I'm pulling my hair out ; Please feel free to teach me to suck eggs because it's all new to me...
11
by: John Nagle | last post by:
The syntax that browsers understand as HTML comments is much less restrictive than what BeautifulSoup understands. I keep running into sites with formally incorrect HTML comments which are parsed...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.