473,238 Members | 1,770 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,238 software developers and data experts.

How do I convert escaped HTML into a string?

I've done a google search on this but, amazingly, I'm the first guy to
ever need this! Everyone else seems to need the reverse of this. Actually,
I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.
So, how do I convert HTML to plaintext? Something like this:
<div>This&nbsp;is&nbsp;a&nbsp;string.</div>
...into:
This is a string.
Actually, the ideal would be a function that takes an HTML string and
convert it into a string that the HTML would correspond to. For instance,
converting:
<div>This &amp; that
or the other thing.</div>
...into:
This & that or the other thing.
...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
Surely, Python can already do this, right?
Thank you...
Nov 24 '07 #1
5 3620
Just Another Victim of the Ambient Morality wrote:
I've done a google search on this but, amazingly, I'm the first guy to
ever need this!
You cannot infer that from a Google search.

So, how do I convert HTML to plaintext? Something like this:

<div>This&nbsp;is&nbsp;a&nbsp;string.</div>

...into:

This is a string.

Actually, the ideal would be a function that takes an HTML string and
convert it into a string that the HTML would correspond to. For instance,
converting:

<div>This &amp; that
or the other thing.</div>

...into:

This & that or the other thing.

...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
So what you want to do is parse HTML and extract the text content. There are
quite a few ways to do that, including lxml.html:

http://codespeak.net/lxml/dev/lxmlhtml.html
>>htmldata = """<div>This &amp; that
... or the other thing.</div>
>>from lxml import html
print html.fragment_fromstring(htmldata).text_content()
Stefan
Nov 24 '07 #2
This may help:

http://effbot.org/zone/re-sub.htm#strip-html

You should take care that there are several issues about going from html to txt

1) <pWhat should <b>we</b>do about<br />this?</p>
You need to strip all tags..

2) &quot;, &amp;, &lt;, and &gt... and I could keep going.. we need to
convert all those

3) we need to remove all whitespace.. tab, new lines, etc. (Maybe
breaks should be considered as new lines in the new text?)

The link above solve several of this issues, it can serve as a good
starting point.

Best,
Sergio
On Nov 24, 2007 12:42 AM, Just Another Victim of the Ambient Morality
<ih*******@hotmail.comwrote:
I've done a google search on this but, amazingly, I'm the first guy to
ever need this! Everyone else seems to need the reverse of this. Actually,
I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.
So, how do I convert HTML to plaintext? Something like this:
<div>Thisisastring.</div>
...into:
This is a string.
Actually, the ideal would be a function that takes an HTML string and
convert it into a string that the HTML would correspond to. For instance,
converting:
<div>This & that
or the other thing.</div>
...into:
This & that or the other thing.
...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
Surely, Python can already do this, right?
Thank you...
--
http://mail.python.org/mailman/listinfo/python-list
Nov 24 '07 #3
On Sat, 24 Nov 2007 05:42:06 +0000, Just Another Victim of the Ambient
Morality wrote:
...since HTML seems to convert any amount and type of whitespace into a
single space (a bizarre design choice if I've ever seen one).
Not really. Just imagine how web pages would look like if whitespace is
preserved. What matters is the actual text in the source, not the
formatting. That's left to the browser.

Ciao,
Marc 'BlackJack' Rintsch
Nov 24 '07 #4
le**@citymutual.com wrote:
On 24 Nov, 05:42, "Just Another Victim of the Ambient Morality"
<ihates...@hotmail.comwrote:
>I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.

Replace "python" with "c++" and would that seem a reasonable belief?
That's different, as Python comes with batteries included.

Stefan
Nov 24 '07 #5
Stefan Behnel a écrit :
le**@citymutual.com wrote:
>>On 24 Nov, 05:42, "Just Another Victim of the Ambient Morality"
<ihates...@hotmail.comwrote:

>>>I did find some people who complained about this and rolled their own
solution but I refuse to believe that Python doesn't have a built-in
solution to what must be a very common problem.

Replace "python" with "c++" and would that seem a reasonable belief?


That's different, as Python comes with batteries included.
Unfortunately, you still have to write a couple lines of code every once
in a while !-)

Nov 24 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: joe_rattz | last post by:
I need to convert a text string ("Dewey & Cheatham & Howe") to an XML encoded string ("Dewey &amp; Cheatham &amp; Howe"). I am not building an XML document, I am just trying to convert a single string. I...
2
by: name | last post by:
The piece of code is for a Web Form Page. Who can tell me why? Thanks a lot! ------------------------------------------------ VB.Net Code: Protected Overrides Sub AddParsedSubObject(ByVal obj As...
1
by: dongxm | last post by:
Is there a function can convert "abc" to "\u0097\u0098\u0099" in dotnet(c#)
3
by: Petr Jakes | last post by:
Hi, I am trying to convert string to the "escaped string". example: from "0xf" I need "\0xf" I am able to do it like: a="0xf" escaped_a=("\%s" % a ).decode("string_escape") But it looks a...
4
by: Trev | last post by:
Hi everyone, Thanks to all who have helped with various issues in the past. I've come up with a new one though: I've run some html through a javascript converter; basically it takes the html and...
27
by: comp.lang.tcl | last post by:
My TCL proc, XML_GET_ALL_ELEMENT_ATTRS, is supposed to convert an XML file into a TCL list as follows: attr1 {val1} attr2 {val2} ... attrN {valN} This is the TCL code that does this: set...
3
by: ldng | last post by:
Hi, I'm looking for a way to convert en unicode string encoded in UTF-8 to a raw string escaped with HTML Entities. I can't seem to find an easy way to do it. Quote from urllib will only work...
12
by: Torsten Bronger | last post by:
Hallöchen! I need some help with finding matches in a string that has some characters which are marked as escaped (in a separate list of indices). Escaped means that they must not be part of...
0
by: stefcollect | last post by:
Hi all, I am pretty new to PHP and am stuck on - what I think - is a generic string handling problem. I need to read and manipulate some HTML files and have a problem in getting some...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.