Removing certain tags from html files

sebzzz

Hi,

I'm doing a little script with the help of the BeautifulSoup HTML
parser and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given
directory (and it's subdirectories) clean the code with Tidy (removes
deprecated tags, change the output to be xhtml) and than BeautifulSoup
removes a couple of things that I don't want in the files (Because I'm
stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new
layout will be in css for positioning). Now, there is tables to layout
things on the page and tables to represent tabular data, but I think
it would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

Now, I'm quite new to the concept of regular expressions, but would it
ressemble something like this: re.compile("<table.*>")?

Thanks for the help.

Jul 27 '07 #1

Subscribe Post Reply

6373

Marc 'BlackJack' Rintsch

On Fri, 27 Jul 2007 17:40:23 +0000, sebzzz wrote:

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Than take a hold on the content and add it to the parent. Somthing like
this should work:

from BeautifulSoup import BeautifulSoup
def remove(soup, tagname):
for tag in soup.findAll(tagname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(tag)
def main():
source = '<a><b>This is a <c>Test</c></b></a>'
soup = BeautifulSoup(source)
print soup
remove(soup, 'b')
print soup

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

No regular expressions are not a very good idea. They get very
complicated very quickly while often still miss some corner cases.

Ciao,
Marc 'BlackJack' Rintsch

Jul 27 '07 #2

sebzzz

Than take a hold on the content and add it to the parent. Somthing like
this should work:

from BeautifulSoup import BeautifulSoup

def remove(soup, tagname):
for tag in soup.findAll(tagname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(tag)

def main():
source = '<a><b>This is a <c>Test</c></b></a>'
soup = BeautifulSoup(source)
print soup
remove(soup, 'b')
print soup

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

No regular expressions are not a very good idea. They get very
complicated very quickly while often still miss some corner cases.

Thanks a lot for that.

It's true that regular expressions could give me headaches (especially
to find where the tag ends).

Jul 27 '07 #3

Stefan Behnel

se****@gmail.com wrote:

I'm doing a little script with the help of the BeautifulSoup HTML
parser and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given
directory (and it's subdirectories) clean the code with Tidy (removes
deprecated tags, change the output to be xhtml) and than BeautifulSoup
removes a couple of things that I don't want in the files (Because I'm
stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new
layout will be in css for positioning). Now, there is tables to layout
things on the page and tables to represent tabular data, but I think
it would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Use lxml.html. Honestly, you can't have HTML cleanup simpler than that.

It's not released yet (lxml is, but lxml.html is just close), but you can
build it from an SVN branch:

http://codespeak.net/svn/lxml/branch/html/

Looks like you're on Linux, so that's a simple run of setup.py.

Then, use the dedicated "clean" module for your job. See the "Cleaning up
HTML" section in the docs for some examples:

http://codespeak.net/svn/lxml/branch...c/lxmlhtml.txt

and the docstring of the Cleaner class to see all the available options:

http://codespeak.net/svn/lxml/branch.../html/clean.py

In case you still prefer BeautifulSoup for parsing (just in case you're not
dealing with HTML-like pages, but just with real tag soup), you can also use
the ElementSoup parser:

http://codespeak.net/svn/lxml/branch...ElementSoup.py

but lxml is generally quite good in dealing with broken HTML already.

Have fun,
Stefan

Jul 28 '07 #4

by: chotiwallah | last post by:

i have a little database driven content managment system. people can load up html-docs. some of them use ms word as their html-editor, which resultes in loads of "class" and "style" attributes -...

PHP

html tags within meta tags allowed?

by: Donald Firesmith | last post by:

Are html tags allowed within meta tags? Specifically, if I have html tags within a <definition> tag within XML, can I use the definition as the content within the <meta content="description> tag? ...

.NET Framework

removing content between specified tokens using java script

by: rajarao | last post by:

hi I want to remove the content embedded in <script> and </script> tags submitted via text box. My java script should remove the content embedded between <script> and </script> tag. my current...

Javascript

removing text from HTML but keeping HTML intact

by: Raja Kannan | last post by:

Is there a way to remove text portion from the HTML keeping the HTML Tags using the browser, say javascript RegEx or something ? I have seen lot of examples removing HTML tags to get the text...

Javascript

Removing and event from the html code.

by: graham.reeds | last post by:

I am updating a website that uses a countdown script embedded on the page. When the page is served the var's are set to how long the countdown has left in minutes and seconds, but the rest of the...

Javascript

Filter to HTML Decode only certain HTML tags

by: cdonyi | last post by:

Hi I am looking for a clean way to scrub HTML encoded strings and display only certain tags back to the browser. I am thinking of using HttpUtility.HTMLEncode/Decode methods. My plan is to...

ASP.NET

Visual Studio keeps removing my closing </li> tags

by: Nathan Sokalski | last post by:

I have a section in my ASP.NET code where I have an HTML unordered list. Visual Studio keeps removing the closing list item tags, except for the last list item. In other words, Visual Studio makes...

ASP.NET

body tags in an include?

by: Big Bill | last post by:

http://www.promcars.co.uk/pages/bonnie.php I don't believe they should be there, can I take them out without stopping the includes from functioning? I'm the (hapless) optimiser on this one... I...

PHP

Regex Validator - detect all but certain HTML tags

by: Barry L. Camp | last post by:

Hi all... hope someone can help out. Not a unique situation, but my search for a solution has not yielded what I need yet. I'm trying to come up with a regular expression for a...

Visual Basic .NET

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Removing certain tags from html files

Similar topics