I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.
I've looked around a bit but failed to find anything, any tips?
(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
Regards,
Geoff 8 6159
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.
I've looked around a bit but failed to find anything, any tips?
(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
Well, if all you want to do is remove everything from a "<" to a
">", you can use
>>s = "<B>Today</Bis <U>Friday</U>" import re r = re.compile('<[^>]*>') print r.sub('', s)
Today is Friday
it should even work for semi-pathological cases such as
s = """You can find my <a
href='http://example.com'>thesis</a
online"""
where the tag contents are split across lines. There are more
pathological cases where tags aren't well-formed, e.g.
s ="This <tag>has a sign in it and <odd<ly>-nestedtags"
in which case you get what you deserve for making such
pathological conditions ;-)
-tkc
On 01-Feb-2008, geoffbache wrote:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.
I've looked around a bit but failed to find anything, any tips?
(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
Quick but very dirty way:
data=urllib.urlopen('http://google.com').read()
data=''.join([x.split('>',1)[-1] for x in data.split('<')])
Tim Chase wrote:
>I have some marked up text and would like to convert it to plain text, by simply removing all the tags. Of course I can do it from first principles but I felt that among all Python's markup tools there must be something that would do this simply, without having to create an XML parser etc.
I've looked around a bit but failed to find anything, any tips?
(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
Well, if all you want to do is remove everything from a "<" to a
">", you can use
>>s = "<B>Today</Bis <U>Friday</U>"
>>import re
>>r = re.compile('<[^>]*>')
>>print r.sub('', s)
Today is Friday
it should even work for semi-pathological cases such as
s = """You can find my <a
href='http://example.com'>thesis</a
online"""
where the tag contents are split across lines. There are more
pathological cases where tags aren't well-formed, e.g.
s ="This <tag>has a sign in it and <odd<ly>-nestedtags"
in which case you get what you deserve for making such
pathological conditions ;-)
The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/BeautifulSoup/
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
>Well, if all you want to do is remove everything from a "<" to a
>">", you can use
> >>s = "<B>Today</Bis <U>Friday</U>" >>import re >>r = re.compile('<[^>]*>') >>print r.sub('', s)
Today is Friday
[Tim's ramblings about pathological cases snipped]
>
The real answer to this question is "learn how to use Beautiful Soup" --
see http://www.crummy.com/software/BeautifulSoup/
Yes, for more pathological cases, BS does a great job of parsing
junk :)
However, as BS isn't batteries-included [Aside: BS and pyparsing
are two common solutions to problems that would make great
additions to the standard library], using a RE to make a
best-effort guess is a good first approximation of a solution
without needing to download extra packages--no matter how useful
those extra packages may be.
-tkc
On Feb 1, 10:54*am, Tim Chase <python.l...@tim.thechases.comwrote:
Well, if all you want to do is remove everything from a "<" to a
">", you can use
* >>s = "<B>Today</Bis <U>Friday</U>"
* >>import re
* >>r = re.compile('<[^>]*>')
* >>print r.sub('', s)
* Today is Friday
[Tim's ramblings about pathological cases snipped]
pyparsing includes an example script for stripping tags from HTML
source. See it on the wiki at http://pyparsing.wikispaces.com/spac...tmlStripper.py.
-- Paul
On Feb 1, 8:07 am, geoffbache <geoff.ba...@pobox.comwrote:
I have some marked up text and would like to convert it to plain text,
If this is just a quick and dirty problem, you can also use one of the
lynx/elinks/links2 browsers and dump the contents to a file. On Linux
it would be
lynx -dump http://www.etc text.txt
Lynx is also available for MS Windows, but am not sure about the other
two.
geoffbache wrote:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.
I've looked around a bit but failed to find anything, any tips?
(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
>>import lxml.etree as et doc = et.HTML("<b>Today</bis <u>Friday</u>") et.tostring(doc, method='text', encoding=unicode)
u'Today is Friday' http://codespeak.net/lxml
Stefan
geoffbache wrote:
I have some marked up text and would like to convert it to plain text,
by simply removing all the tags. Of course I can do it from first
principles but I felt that among all Python's markup tools there must
be something that would do this simply, without having to create an
XML parser etc.
I've looked around a bit but failed to find anything, any tips?
(e.g. convert "<B>Today</Bis <U>Friday</U>" to "Today is Friday")
This might be of interest: http://pypi.python.org/pypi/haufe.stripml
Stefan This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Sander Voerman |
last post by:
Hi,
I am going to make a forum using PHP. However, I have a very simple question
to begin with. Is it at all possible to create input fields where users can
type their messages using italic,...
|
by: Jacek Generowicz |
last post by:
:::Title::: A simple text markup utility :::/Title:::
:::Section Introduction :::
I'm looking for something to help developers wirte documentation for
bits of software they are writing. The...
|
by: chris |
last post by:
hi,
i am looking for a way to structure the text in XML comments to produce
a XHTML doc format for the XMLs (in this case XSLT).
so is was thinking about using a wiki like text format. e.g....
|
by: Reinhold Birkenfeld |
last post by:
Hello,
I know that there are different YAML engines for Python out there (Syck,
PyYaml, more?).
Which one do you use, and why?
For those of you who don't know what YAML is: visit...
|
by: Ted |
last post by:
I have a SQL Server 2000 table with a few fields of "text" data type
that contain rich text. I have to downstream this data and the
recipient cannot handle rich text. I need to figure out a way...
|
by: r.shimmin |
last post by:
There exist a number of related informal markup languages whose design
philosophy is to use terse, easily human-entered and human-read tags,
that are intended to be converted by software into some...
|
by: Michal A. Valasek |
last post by:
Hello,
I want to transform text with HTML markup to plain text. Is there some
simple way how to do it?
I can surely write my own function, which would simply strip everything with
< and >....
|
by: 一首诗 |
last post by:
Is there any simple way to solve this problem?
|
by: Tim Arnold |
last post by:
hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
create CHM files. That application really hates xhtml, so I need to convert
self-ending tags (e.g. <br />) to plain html...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: jfyes |
last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Defcon1945 |
last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
|
by: Shællîpôpï 09 |
last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
|
by: af34tf |
last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
| |