473,695 Members | 2,907 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

HTML purifier using BeautifulSoup?


Has anyone tried to construct an HTML janitor script using BeautifulSoup?

My situation:

I'm trying to convert a series of web pages from .html to palmdoc format,
using plucker, which is written in python. The plucker project suggests
passing html through "tidy", to get well-formed html for plucker to work
with.

However, some of the pages I want to convert are so bad that even tidy
pukes on them.

I was thinking that BeautifulSoup might be more tolerant of really bad
html... Which led me to the question this article started out with. :)

Thanks!
Jul 18 '05 #1
1 2160
Dan Stromberg wrote:
Has anyone tried to construct an HTML janitor script using BeautifulSoup?
My situation:

I'm trying to convert a series of web pages from .html to palmdoc format, using plucker, which is written in python. The plucker project suggests passing html through "tidy", to get well-formed html for plucker to work with.

However, some of the pages I want to convert are so bad that even tidy pukes on them.

I was thinking that BeautifulSoup might be more tolerant of really bad html... Which led me to the question this article started out with. :)
Thanks!


I have used BeautifulSoup for screen scraping, pulling html into
structured form (using XML). Is that similar to a janitor script? I
used it because tidy was puking on some html. BS has been excellent.

Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
4222
by: Anders Eriksson | last post by:
Hello! I want to extract some info from a some specific HTML pages, Microsofts International Word list (e.g. http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I want to take all the words, both English and the other language and create a dictionary. so that I can look up About and get Om as the answer. How is the best way to do this?
7
2570
by: Dan Stromberg | last post by:
I'm working on writing a program that will synchronize one database with another. For the source database, we can just use the python sybase API; that's nice and normal. For the target database, unfortunately, the only interface we have access to, is http+html based. There's same javascript involved too, but hopefully we won't have to interact with that. So, I've got Basic AUTH going with http, but now I'm faced with the following...
3
1577
by: Tempo | last post by:
In my last post I received some advice to use urllib.read() to get a whole html page as a string, which will then allow me to use BeautifulSoup to do what I want with the string. But when I was researching the 'urllib' module I couldn't find anything about its sub-section '.read()' ? Is that the right module to get a html page into a string? Or am I completely missing something here? I'll take this as the more likely of the two cases....
10
4839
by: pak.andrei | last post by:
Here is my script: from mechanize import * from BeautifulSoup import * import StringIO b = Browser() f = b.open("http://www.translate.ru/text.asp?lang=ru") b.select_form(nr=0) b = "hello python" html = b.submit().get_data()
15
5984
by: Francach | last post by:
Hi, I'm trying to use the Beautiful Soup package to parse through the "bookmarks.html" file which Firefox exports all your bookmarks into. I've been struggling with the documentation trying to figure out how to extract all the urls. Has anybody got a couple of longer examples using Beautiful Soup I could play around with? Thanks, Martin.
2
1846
by: s. d. rose | last post by:
Hello All. I am learning Python, and have never worked with HTML. However, I would like to write a simple script to audit my 100+ Netware servers via their web portal. I was reading Chapter 8 of Dive into Python, which deals with this topic. In the web portal of the server, there is a section similar to this: -- clients and <A href="http://eugenia.blogsome.com/?s=ipkall">clever</aservices. <--
6
10207
by: Tina I | last post by:
Hi everyone, I have a small, probably trivial even, problem. I have the following HTML: I need to make this into a dictionary like this: dictionary = {"METAR:" : "ENBR 270920Z 00000KT 9999 FEW018 02/M01 Q1004 NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT 9999 FEW020 SCT040" , "long-Taf:" : "ENBR 271212 VRB05KT 9999 FEW020 BKN030 TEMPO 2012 2000 SNRA VV010 BECMG 2124 15012KT"}
7
5098
by: mark | last post by:
Hi All, Apologies for the newbie question but I've searched and tried all sorts for a few days and I'm pulling my hair out ; Please feel free to teach me to suck eggs because it's all new to me :) Thanks in advance, Mark.
11
3170
by: John Nagle | last post by:
The syntax that browsers understand as HTML comments is much less restrictive than what BeautifulSoup understands. I keep running into sites with formally incorrect HTML comments which are parsed happily by browsers. Here's yet another example, this one from "http://www.webdirectory.com". The page starts like this: <!Hello there! Welcome to The Environment Directory!> <!Not too much exciting HTML code here but it does the job! >...
0
8640
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8582
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
8860
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8832
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6498
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5841
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4348
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4587
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
1984
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.