473,399 Members | 3,106 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

Stripping HTML with RE

I am currently stripping HTML from a string with the following code.
(I know it's not the best way to strip HTML but bear with me)

re.compile("<.*?>")

I wanted to allow all H1 and H2 tags so i changed it to:

re.compile("<[^H1|^H2]*?>")

This seemed to work but it also allowed the HTML tag(basically anythin
with an H or a 1 or a 2) How can I get this to strip all tags except
H1 and H2. Any Help you could give would be great.

Steve
Jul 18 '05 #1
3 1709
Steveo <stephen_p_barrett <at> hotmail.com> writes:

I wanted to allow all H1 and H2 tags so i changed it to:

re.compile("<[^H1|^H2]*?>")

This seemed to work but it also allowed the HTML tag(basically anythin
with an H or a 1 or a 2) How can I get this to strip all tags except
H1 and H2. Any Help you could give would be great.


You probably want a lookahead assertion. From the docs at
http://docs.python.org/lib/re-syntax.html:

(?!...)
Matches if ... doesn't match next. This is a negative lookahead assertion.
For example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
'Asimov'.

So I would write your example something like:
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>') 'sdfsa' re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</a>') '<H1>sdfsa' re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<H1>sdfsa</H2>')

'<H1>sdfsa</H2>'

(I was too lazy to compile the re, but of course that's what you'd normally want
to do.)

Steve

Jul 18 '05 #2
Steveo wrote:
I am currently stripping HTML from a string with the following code.
(I know it's not the best way to strip HTML but bear with me)
[...]


Instead of using REs, you might consider the StrippingParser
from the Python Cookbook:

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52281

It allows you to specify explicitly which tags you want to leave
intact, so you'll be able to change your mind later without futzing
about with a complex RE...
Miles
Jul 18 '05 #3
I wrote:
re.sub(r'</?(?!H1|H2|/H1|/H2)[^>]*>', r'', '<a>sdfsa</a>') 'sdfsa'


Maybe slightly better:
re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<a>sdfsa</a>') 'sdfsa' re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</a>') '<H1>sdfsa' re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H1>sdfsa</H2>') '<H1>sdfsa</H2>' re.sub(r'<(?!/?(?:H1|H2))[^>]*>', r'', '<H2>sdfsa</H2>')

'<H2>sdfsa</H2>'

I've just grouped things a bit differently so that I only have to write H1 and
H2 once.

Steve

Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Margaret MacDonald | last post by:
I've been going mad trying to figure out how to do this--it should be easy! Allow the user to enter '\_sometext\_', i.e., literal backslash, underscore, some text, literal backslash, underscore...
2
by: Patrick | last post by:
Hello, after learning that I was taking a class in VB.NET, I have been drafted to solve all my companies VB/scripting problems - hey, I should know everything; I've already taken 6 classes ;) I...
258
by: Terry Andersen | last post by:
If I have: struct one_{ unsigned int one_1; unsigned short one_2; unsigned short one_3; }; struct two_{ unsigned int two_1;
4
by: Lance | last post by:
Hi, What way could I strip certain tags (like HTML comments) from the HTML being delivered to the client? I don't mean what regexp to use, but where do I put this stripping code? I'm thinking...
4
by: Lu | last post by:
Hi, i am currently working on ASP.Net v1.0 and is encountering the following problem. In javascript, I'm passing in: "somepage.aspx?QSParameter=<RowID>Chèques</RowID>" as part of the query...
5
by: David Sawyer | last post by:
I am trying to read in an HTML file and strip out the HTML code so that all I have left is the text of the body. Does anyone have any suggestions for doing this? Any HTML stripping routines or...
4
by: Spondishy | last post by:
Hi, I'm looking for help with a regular expression and c#. I want to remove all tags from a piece of html except the following. <a> <b> <h1> <h2>
7
by: FFMG | last post by:
Hi, I have a form that allows users to comment, add entries and so on. But what a lot of them do is copy and paste directly from MS Word to my forms. almost all browsers will accept the post...
3
by: Michel Bouwmans | last post by:
Hey everyone, I'm trying to strip all script-blocks from a HTML-file using regex. I tried the following in Python: testfile = open('testfile') testhtml = testfile.read() regex =...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.