473,322 Members | 1,736 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

How to convert markup text to plain text in python?

I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")

Regards,
Geoff
Feb 1 '08 #1
8 6159
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")

Well, if all you want to do is remove everything from a "<" to a
">", you can use
>>s = "<B>Today</Bis <U>Friday</U>"
import re
r = re.compile('<[^>]*>')
print r.sub('', s)
Today is Friday

it should even work for semi-pathological cases such as

s = """You can find my <a
href='http://example.com'>thesis</a
online"""
where the tag contents are split across lines. There are more
pathological cases where tags aren't well-formed, e.g.

s ="This <tag>has a sign in it and <odd<ly>-nestedtags"

in which case you get what you deserve for making such
pathological conditions ;-)

-tkc

Feb 1 '08 #2
ph
On 01-Feb-2008, geoffbache wrote:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
Quick but very dirty way:

data=urllib.urlopen('http://google.com').read()
data=''.join([x.split('>',1)[-1] for x in data.split('<')])

Feb 1 '08 #3
Tim Chase wrote:
>I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")


Well, if all you want to do is remove everything from a "<" to a
">", you can use
>>s = "<B>Today</Bis <U>Friday</U>"
>>import re
>>r = re.compile('<[^>]*>')
>>print r.sub('', s)
Today is Friday

it should even work for semi-pathological cases such as

s = """You can find my <a
href='http://example.com'>thesis</a
online"""

where the tag contents are split across lines. There are more
pathological cases where tags aren't well-formed, e.g.

s ="This <tag>has a sign in it and <odd<ly>-nestedtags"

in which case you get what you deserve for making such
pathological conditions ;-)
The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/BeautifulSoup/

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Feb 1 '08 #4
>Well, if all you want to do is remove everything from a "<" to a
>">", you can use
> >>s = "<B>Today</Bis <U>Friday</U>"
>>import re
>>r = re.compile('<[^>]*>')
>>print r.sub('', s)
Today is Friday
[Tim's ramblings about pathological cases snipped]
>
The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/BeautifulSoup/
Yes, for more pathological cases, BS does a great job of parsing
junk :)

However, as BS isn't batteries-included [Aside: BS and pyparsing
are two common solutions to problems that would make great
additions to the standard library], using a RE to make a
best-effort guess is a good first approximation of a solution
without needing to download extra packages--no matter how useful
those extra packages may be.

-tkc

Feb 1 '08 #5
On Feb 1, 10:54*am, Tim Chase <python.l...@tim.thechases.comwrote:
Well, if all you want to do is remove everything from a "<" to a
">", you can use
* >>s = "<B>Today</Bis <U>Friday</U>"
* >>import re
* >>r = re.compile('<[^>]*>')
* >>print r.sub('', s)
* Today is Friday

[Tim's ramblings about pathological cases snipped]
pyparsing includes an example script for stripping tags from HTML
source. See it on the wiki at http://pyparsing.wikispaces.com/spac...tmlStripper.py.

-- Paul
Feb 1 '08 #6
On Feb 1, 8:07 am, geoffbache <geoff.ba...@pobox.comwrote:
I have some marked up text and would like to convert it to plain text,
If this is just a quick and dirty problem, you can also use one of the
lynx/elinks/links2 browsers and dump the contents to a file. On Linux
it would be
lynx -dump http://www.etc text.txt
Lynx is also available for MS Windows, but am not sure about the other
two.
Feb 2 '08 #7
geoffbache wrote:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
>>import lxml.etree as et
doc = et.HTML("<b>Today</bis <u>Friday</u>")
et.tostring(doc, method='text', encoding=unicode)
u'Today is Friday'
http://codespeak.net/lxml

Stefan
Feb 3 '08 #8
geoffbache wrote:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.

I've looked around a bit but failed to find anything, any tips?

(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
This might be of interest:

http://pypi.python.org/pypi/haufe.stripml

Stefan
Feb 11 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Sander Voerman | last post by:
Hi, I am going to make a forum using PHP. However, I have a very simple question to begin with. Is it at all possible to create input fields where users can type their messages using italic,...
7
by: Jacek Generowicz | last post by:
:::Title::: A simple text markup utility :::/Title::: :::Section Introduction ::: I'm looking for something to help developers wirte documentation for bits of software they are writing. The...
4
by: chris | last post by:
hi, i am looking for a way to structure the text in XML comments to produce a XHTML doc format for the XMLs (in this case XSLT). so is was thinking about using a wiki like text format. e.g....
30
by: Reinhold Birkenfeld | last post by:
Hello, I know that there are different YAML engines for Python out there (Syck, PyYaml, more?). Which one do you use, and why? For those of you who don't know what YAML is: visit...
4
by: Ted | last post by:
I have a SQL Server 2000 table with a few fields of "text" data type that contain rich text. I have to downstream this data and the recipient cannot handle rich text. I need to figure out a way...
1
by: r.shimmin | last post by:
There exist a number of related informal markup languages whose design philosophy is to use terse, easily human-entered and human-read tags, that are intended to be converted by software into some...
3
by: Michal A. Valasek | last post by:
Hello, I want to transform text with HTML markup to plain text. Is there some simple way how to do it? I can surely write my own function, which would simply strip everything with < and >....
7
by: 一首诗 | last post by:
Is there any simple way to solve this problem?
11
by: Tim Arnold | last post by:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. <br />) to plain html...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.