Removing certain tags from html files

sebzzz

Hi,

I'm doing a little script with the help of the BeautifulSoup HTML
parser and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given
directory (and it's subdirectories) clean the code with Tidy (removes
deprecated tags, change the output to be xhtml) and than BeautifulSoup
removes a couple of things that I don't want in the files (Because I'm
stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new
layout will be in css for positioning). Now, there is tables to layout
things on the page and tables to represent tabular data, but I think
it would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

Now, I'm quite new to the concept of regular expressions, but would it
ressemble something like this: re.compile("<ta ble.*>")?

Thanks for the help.

Jul 27 '07 #1

Subscribe Reply

6416

Marc 'BlackJack' Rintsch

On Fri, 27 Jul 2007 17:40:23 +0000, sebzzz wrote:

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Than take a hold on the content and add it to the parent. Somthing like
this should work:

from BeautifulSoup import BeautifulSoup
def remove(soup, tagname):
for tag in soup.findAll(ta gname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(t ag)
def main():
source = '<a><b>This is a <c>Test</c></b></a>'
soup = BeautifulSoup(s ource)
print soup
remove(soup, 'b')
print soup

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

No regular expressions are not a very good idea. They get very
complicated very quickly while often still miss some corner cases.

Ciao,
Marc 'BlackJack' Rintsch

Jul 27 '07 #2

sebzzz

Than take a hold on the content and add it to the parent. Somthing like
this should work:

from BeautifulSoup import BeautifulSoup

def remove(soup, tagname):
for tag in soup.findAll(ta gname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(t ag)

def main():
source = '<a><b>This is a <c>Test</c></b></a>'
soup = BeautifulSoup(s ource)
print soup
remove(soup, 'b')
print soup

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

No regular expressions are not a very good idea. They get very
complicated very quickly while often still miss some corner cases.

Thanks a lot for that.

It's true that regular expressions could give me headaches (especially
to find where the tag ends).

Jul 27 '07 #3

Stefan Behnel

se****@gmail.co m wrote:

I'm doing a little script with the help of the BeautifulSoup HTML
parser and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given
directory (and it's subdirectories) clean the code with Tidy (removes
deprecated tags, change the output to be xhtml) and than BeautifulSoup
removes a couple of things that I don't want in the files (Because I'm
stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new
layout will be in css for positioning). Now, there is tables to layout
things on the page and tables to represent tabular data, but I think
it would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Use lxml.html. Honestly, you can't have HTML cleanup simpler than that.

It's not released yet (lxml is, but lxml.html is just close), but you can
build it from an SVN branch:

http://codespeak.net/svn/lxml/branch/html/

Looks like you're on Linux, so that's a simple run of setup.py.

Then, use the dedicated "clean" module for your job. See the "Cleaning up
HTML" section in the docs for some examples:

http://codespeak.net/svn/lxml/branch...c/lxmlhtml.txt

and the docstring of the Cleaner class to see all the available options:

http://codespeak.net/svn/lxml/branch.../html/clean.py

In case you still prefer BeautifulSoup for parsing (just in case you're not
dealing with HTML-like pages, but just with real tag soup), you can also use
the ElementSoup parser:

http://codespeak.net/svn/lxml/branch...ElementSoup.py

but lxml is generally quite good in dealing with broken HTML already.

Have fun,
Stefan

Jul 28 '07 #4

Similar topics

3720

reg exp question: removing class and style from html

by: chotiwallah | last post by:

i have a little database driven content managment system. people can load up html-docs. some of them use ms word as their html-editor, which resultes in loads of "class" and "style" attributes - like this: <p class="MsoNormal">Some text</p> now i'd like to remove them (the attributes, not the people, that is). i know reg exp is the way, but somehow the solution avoids me. just pointing me to some more advanced tutorial (pref. in...

PHP

3795

html tags within meta tags allowed?

by: Donald Firesmith | last post by:

Are html tags allowed within meta tags? Specifically, if I have html tags within a <definition> tag within XML, can I use the definition as the content within the <meta content="description> tag? If not, is there an easy way to strip the html tags from the <definition> content before inserting the content into the meta tag?

.NET Framework

3082

removing content between specified tokens using java script

by: rajarao | last post by:

hi I want to remove the content embedded in <script> and </script> tags submitted via text box. My java script should remove the content embedded between <script> and </script> tag. my current code is function RemoveHTMLScript(strText) { var regEx = /<script\w*<\/script>/g

Javascript

2699

removing text from HTML but keeping HTML intact

by: Raja Kannan | last post by:

Is there a way to remove text portion from the HTML keeping the HTML Tags using the browser, say javascript RegEx or something ? I have seen lot of examples removing HTML tags to get the text but how the reverse of it ? any sample code or any suggestion would be appreciated.

Javascript

2060

Removing and event from the html code.

by: graham.reeds | last post by:

I am updating a website that uses a countdown script embedded on the page. When the page is served the var's are set to how long the countdown has left in minutes and seconds, but the rest of the script is left untouched. However I want to take the script out of the page and have it as a seperate file that can be cached, reducing serving costs - the page gets hit a couple of thousand times per day, sometimes as high a 5K, so any...

Javascript

2386

Filter to HTML Decode only certain HTML tags

by: cdonyi | last post by:

Hi I am looking for a clean way to scrub HTML encoded strings and display only certain tags back to the browser. I am thinking of using HttpUtility.HTMLEncode/Decode methods. My plan is to Encode any HTML input submitted via the browser. I want to display this output as HTML but want to guard against Cross Site scripting and hence only display certain tags (or not display certain tags). What is the cleanest way to do this? I was thinking...

ASP.NET

1943

Visual Studio keeps removing my closing </li> tags

by: Nathan Sokalski | last post by:

I have a section in my ASP.NET code where I have an HTML unordered list. Visual Studio keeps removing the closing list item tags, except for the last list item. In other words, Visual Studio makes my code look like the following: <ul> <li>adasf <li>asdfsa <li>asdfd <li>adfsdf</li>

ASP.NET

3103

body tags in an include?

by: Big Bill | last post by:

http://www.promcars.co.uk/pages/bonnie.php I don't believe they should be there, can I take them out without stopping the includes from functioning? I'm the (hapless) optimiser on this one... I have to correct where they've spelled my name wrong too...sigh... BB --

PHP

3117

Regex Validator - detect all but certain HTML tags

by: Barry L. Camp | last post by:

Hi all... hope someone can help out. Not a unique situation, but my search for a solution has not yielded what I need yet. I'm trying to come up with a regular expression for a RegularExpressionValidator that will allow certain HTML tags: <a>, <b>, <blockquote>, <br>, <i>, <img>, <li>, <ol>, <p>, <quote>, <ul>

Visual Basic .NET

9680

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9528

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10455

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10228

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

6788

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5441

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5573

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4116

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2925

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General