473,508 Members | 2,303 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Use BeautifulSoup to delete certain tag while keeping its content

Dear all,

I have the following html code:

<td valign="top" headers="col1">
<font size="2">
Center Bank
<br />
Los Angeles, CA
</font>
</td>

<td valign="top" headers="col1">
<font size="2">
Salisbury
Bank and Trust Company
<font face="arial, helvetica" size="2" color="#0000000">
<br />
Lakeville, CT
</font>
</font>
</td>

How should I delete the 'font' tags while keeping the content inside?
Ideally I want to get:

<td valign="top" headers="col1">
Center Bank
<br />
Los Angeles, CA
</td>

<td valign="top" headers="col1">
Salisbury
Bank and Trust Company
<br />
Lakeville, CT
</td>

Thank you.

Jackie
Sep 6 '08 #1
3 3279
On 6 Sep, 17:11, "Jackie Wang" <jackie.pyt...@gmail.comwrote:
>
I have the following html code:

<td valign="top" headers="col1">
<font size="2">
Center Bank
<br />
Los Angeles, CA
</font>
</td>

<td valign="top" headers="col1">
<font size="2">
Salisbury
Bank and Trust Company
<font face="arial, helvetica" size="2" color="#0000000">
<br />
Lakeville, CT
</font>
</font>
</td>

How should I delete the 'font' tags while keeping the content inside?
This sounds like an editing exercise, really. If you're comfortable
learning a new tool, I can recommend XSLT for this kind of job. Here's
the stylesheet:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">

<xsl:template match="font">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

This just describes two things: firstly, that you want to recognise
font elements and to include their contents, not each element's start
and end tags; secondly, that all other parts of the document should be
copied.

You can apply stylesheets using a number of XSL processors. The
xsltproc program is usually available where libxslt is installed, and
although I'm sure others will be along to tell you all about their
favourite libraries and tools, here's how I use mine within Python:

# XSLTools: http://www.python.org/pypi/XSLTools
# libxml2dom: http://www.python.org/pypi/libxml2dom
import XSLTools.XSLOutput
import libxml2dom
# If s is the document text...
d = libxml2dom.parseString(s)
# Save the above stylesheet to a file somewhere, then...
proc = XSLTools.XSLOutput.Processor(["/tmp/no-font.xsl"])
# Get the result document
d2 = proc.get_result(d)

Anyway, this is just one option of many to deal with this kind of
problem.

Paul
Sep 7 '08 #2
[fixing the subject appropriately]

Jackie Wang wrote:
How should I delete the 'font' tags while keeping the content inside?
Amongst many other goodies for working with HTML, the Elements in lxml.html
have a ".drop_tag()" method specifically for that purpose.

http://codespeak.net/lxml/

Stefan
Sep 7 '08 #3
Jackie Wang wrote:
Dear all,

I have the following html code:

<td valign="top" headers="col1">
<font size="2">
Center Bank
<br />
Los Angeles, CA
</font>
</td>

<td valign="top" headers="col1">
<font size="2">
Salisbury
Bank and Trust Company
<font face="arial, helvetica" size="2" color="#0000000">
<br />
Lakeville, CT
</font>
</font>
</td>

How should I delete the 'font' tags while keeping the content inside?
See the BeautifulSoup documentation. Find the font tags with findAll,
make a list, then go in and use "extract" and "replaceWith" appropriately.

John Nagle

Sep 8 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
3105
by: Steve Young | last post by:
I tried using BeautifulSoup to make changes to the url links on html pages, but when the page was displayed, it was garbled up and didn't look right (even when I didn't actually change anything on...
7
8571
by: Gonzillaaa | last post by:
I'm trying to get the data on the "Central London Property Price Guide" box at the left hand side of this page http://www.findaproperty.com/regi0018.html I have managed to get the data :) but...
7
4598
by: John Nagle | last post by:
I've been parsing existing HTML with BeautifulSoup, and occasionally hit content which has something like "Design & Advertising", that is, an "&" instead of an "&amp;". Is there some way I can get...
5
2633
by: John Nagle | last post by:
This, which is from a real web site, went into BeautifulSoup: <param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer fantastic rates for selected weeks or days!!&blinkt=Click...
3
1295
by: crybaby | last post by:
I need to traverse a html page with big table that has many row and columns. For example, how to go 35th td tag and do regex to retireve the content. After that is done, you move down to 15th td...
5
3761
by: Larry Bates | last post by:
Info: Python version: ActivePython 2.5.1.1 Platform: Windows I wanted to install BeautifulSoup today for a small project and decided to use easy_install. I can install other packages just...
2
6349
by: Alexnb | last post by:
Okay, I am not sure if there is a better way of doing this than findAll() but that is how I am doing it right now. I am making an app that screen scapes dictionary.com for definitions. However, I...
1
1360
by: Alexnb | last post by:
Okay, what I want to do with this code is to got to thesaurus.reference.com and then search for a word and get the syns for it. Now, I can get the syns, but they are still in html form and some are...
0
928
by: bruce | last post by:
hi jackie, if you don't mind... can i ask what you're looking to accomplish? are you looking to simply get the text/string data, or something else??? -----Original Message----- From:...
0
7123
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7324
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
7042
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
5627
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5052
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4707
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3181
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1556
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
418
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.