By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,957 Members | 2,017 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,957 IT Pros & Developers. It's quick & easy.

How do I convert escaped HTML into a string?

P: n/a
I've done a google search on this but, amazingly, I'm the first guy to
ever need this! Everyone else seems to need the reverse of this. Actually,
I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.
So, how do I convert HTML to plaintext? Something like this:
<div>This&nbsp;is&nbsp;a&nbsp;string.</div>
...into:
This is a string.
Actually, the ideal would be a function that takes an HTML string and
convert it into a string that the HTML would correspond to. For instance,
converting:
<div>This &amp; that
or the other thing.</div>
...into:
This & that or the other thing.
...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
Surely, Python can already do this, right?
Thank you...
Nov 24 '07 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Just Another Victim of the Ambient Morality wrote:
I've done a google search on this but, amazingly, I'm the first guy to
ever need this!
You cannot infer that from a Google search.

So, how do I convert HTML to plaintext? Something like this:

<div>This&nbsp;is&nbsp;a&nbsp;string.</div>

...into:

This is a string.

Actually, the ideal would be a function that takes an HTML string and
convert it into a string that the HTML would correspond to. For instance,
converting:

<div>This &amp; that
or the other thing.</div>

...into:

This & that or the other thing.

...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
So what you want to do is parse HTML and extract the text content. There are
quite a few ways to do that, including lxml.html:

http://codespeak.net/lxml/dev/lxmlhtml.html
>>htmldata = """<div>This &amp; that
... or the other thing.</div>
>>from lxml import html
print html.fragment_fromstring(htmldata).text_content()
Stefan
Nov 24 '07 #2

P: n/a
This may help:

http://effbot.org/zone/re-sub.htm#strip-html

You should take care that there are several issues about going from html to txt

1) <pWhat should <b>we</b>do about<br />this?</p>
You need to strip all tags..

2) &quot;, &amp;, &lt;, and &gt... and I could keep going.. we need to
convert all those

3) we need to remove all whitespace.. tab, new lines, etc. (Maybe
breaks should be considered as new lines in the new text?)

The link above solve several of this issues, it can serve as a good
starting point.

Best,
Sergio
On Nov 24, 2007 12:42 AM, Just Another Victim of the Ambient Morality
<ih*******@hotmail.comwrote:
I've done a google search on this but, amazingly, I'm the first guy to
ever need this! Everyone else seems to need the reverse of this. Actually,
I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.
So, how do I convert HTML to plaintext? Something like this:
<div>Thisisastring.</div>
...into:
This is a string.
Actually, the ideal would be a function that takes an HTML string and
convert it into a string that the HTML would correspond to. For instance,
converting:
<div>This & that
or the other thing.</div>
...into:
This & that or the other thing.
...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
Surely, Python can already do this, right?
Thank you...
--
http://mail.python.org/mailman/listinfo/python-list
Nov 24 '07 #3

P: n/a
On Sat, 24 Nov 2007 05:42:06 +0000, Just Another Victim of the Ambient
Morality wrote:
...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
Not really. Just imagine how web pages would look like if whitespace is
preserved. What matters is the actual text in the source, not the
formatting. That's left to the browser.

Ciao,
Marc 'BlackJack' Rintsch
Nov 24 '07 #4

P: n/a
le**@citymutual.com wrote:
On 24 Nov, 05:42, "Just Another Victim of the Ambient Morality"
<ihates...@hotmail.comwrote:
>I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.

Replace "python" with "c++" and would that seem a reasonable belief?
That's different, as Python comes with batteries included.

Stefan
Nov 24 '07 #5

P: n/a
Stefan Behnel a écrit :
le**@citymutual.com wrote:
>>On 24 Nov, 05:42, "Just Another Victim of the Ambient Morality"
<ihates...@hotmail.comwrote:

>>>I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.

Replace "python" with "c++" and would that seem a reasonable belief?


That's different, as Python comes with batteries included.
Unfortunately, you still have to write a couple lines of code every once
in a while !-)

Nov 24 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.