473,395 Members | 1,999 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

How do you htmlentities in Python

js
Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like & or > into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance...
Jun 4 '07 #1
8 2674
As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)

Jun 4 '07 #2
In article <11**********************@q75g2000hsh.googlegroups .com>,
Adam Atlas <ad**@atlas.stwrote:
>As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)
How strange that this doesn't appear in the Cookbook! I'm
curious about how others think: does such an item better
belong in the Cookbook, or the Wiki?
Jun 4 '07 #3
"Adam Atlas" <ad**@atlas.stwrote in message
news:11**********************@q75g2000hsh.googlegr oups.com...
As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)
'&(%s);' won't quite work: HTML (and, I assume, SGML, but not XHTML being
XML) allows you to skip the semicolon after the entity if it's followed by a
white space (IIRC). Should this be respected, it looks more like this:
r'&(%s)([;\s]|$)'

Also, this completely ignores non-name entities as also found in XML. (eg
%x20; for ' ' or so) Maybe some part of the HTMLParser module is useful, I
wouldn't know. IMHO, these particular batteries aren't too commonly needed.

Regards,
Thomas Jollans
Jun 4 '07 #4
In article <11**********************@q75g2000hsh.googlegroups .com>,
Adam Atlas <ad**@atlas.stwrote:
>As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)
A. I *think* you meant
import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: chr(name2codepoint[m.group(1)]), s)
We're stretching the limits of what's comfortable
for me as a one-liner.
B. How's it happen this isn't in the Cookbook? I'm
curious about what other Pythoneers think: is
this better memorialized in the Cookbook or the
Wiki?
Jun 4 '07 #5
On Jun 4, 6:31 am, "js " <ebgs...@gmail.comwrote:
Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like &amp; or &gt; into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance.
I think this is the standard idiom:
>>import xml.sax.saxutils as saxutils
saxutils.escape("&")
'&amp;'
>>saxutils.unescape("&gt;")
'>'
>>saxutils.unescape("A bunch of text with entities: &amp; &gt; &lt;")
'A bunch of text with entities: & <'

Notice there is an optional parameter (a dict) that can be used to
define additional entities as well.

Matt

Jun 4 '07 #6
js
Thanks you Matimus.
That's exactly what I'm looking for!
Easy, clean and customizable.
I love python :)

On 6/5/07, Matimus <mc******@gmail.comwrote:
On Jun 4, 6:31 am, "js " <ebgs...@gmail.comwrote:
Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like &amp; or &gt; into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance.

I think this is the standard idiom:
>import xml.sax.saxutils as saxutils
saxutils.escape("&")
'&amp;'
>saxutils.unescape("&gt;")
'>'
>saxutils.unescape("A bunch of text with entities: &amp; &gt; &lt;")
'A bunch of text with entities: & <'

Notice there is an optional parameter (a dict) that can be used to
define additional entities as well.

Matt

--
http://mail.python.org/mailman/listinfo/python-list
Jun 4 '07 #7
In article <11**********************@q19g2000prn.googlegroups .com>,
Matimus <mc******@gmail.comwrote:
>On Jun 4, 6:31 am, "js " <ebgs...@gmail.comwrote:
>Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like &amp; or &gt; into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance.

I think this is the standard idiom:
>>>import xml.sax.saxutils as saxutils
saxutils.escape("&")
'&amp;'
>>>saxutils.unescape("&gt;")
'>'
>>>saxutils.unescape("A bunch of text with entities: &amp; &gt; &lt;")
'A bunch of text with entities: & <'

Notice there is an optional parameter (a dict) that can be used to
define additional entities as well.
Jun 5 '07 #8
"Thomas Jollans" <th****@jollans.NOSPAM.comwrites:
"Adam Atlas" <ad**@atlas.stwrote in message
news:11**********************@q75g2000hsh.googlegr oups.com...
As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)

'&(%s);' won't quite work: HTML (and, I assume, SGML, but not XHTML being
XML) allows you to skip the semicolon after the entity if it's followed by a
white space (IIRC). Should this be respected, it looks more like this:
r'&(%s)([;\s]|$)'

Also, this completely ignores non-name entities as also found in XML. (eg
%x20; for ' ' or so) Maybe some part of the HTMLParser module is useful, I
wouldn't know. IMHO, these particular batteries aren't too commonly needed.
Here's one that handles numeric character references, and chooses to
leave entity references that are not defined in standard library
module htmlentitydefs intact, rather than throwing an exception.

It ignores the missing semicolon issue (and note also that IE can cope
with even a missing space, like "tr&eacutes mal", so you'll see that
in the wild). Probably it could be adapted to handle that (possibly
the presumably-slower htmllib-based recipe on the python.org wiki
already does handle that, not sure).
import htmlentitydefs
import re
import unittest

def unescape_charref(ref):
name = ref[2:-1]
base = 10
if name.startswith("x"):
name = name[1:]
base = 16
return unichr(int(name, base))

def replace_entities(match):
ent = match.group()
if ent[1] == "#":
return unescape_charref(ent)

repl = htmlentitydefs.name2codepoint.get(ent[1:-1])
if repl is not None:
repl = unichr(repl)
else:
repl = ent
return repl

def unescape(data):
return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data)

class UnescapeTests(unittest.TestCase):

def test_unescape_charref(self):
self.assertEqual(unescape_charref(u"&"), u"&")
self.assertEqual(unescape_charref(u"&#x2014;"), u"\N{EM DASH}")
self.assertEqual(unescape_charref(u"—"), u"\N{EM DASH}")

def test_unescape(self):
self.assertEqual(
unescape(u"&amp; &lt; &mdash; — &#x2014;"),
u"& < %s %s %s" % tuple(u"\N{EM DASH}"*3)
)
self.assertEqual(unescape(u"&a&amp;"), u"&a&")
self.assertEqual(unescape(u"a&amp;"), u"a&")
self.assertEqual(unescape(u"&nonexistent;"), u"&nonexistent;")
unittest.main()

John
Jun 6 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Randell D. | last post by:
Folks, I'm using Apache/1.3.28 (SuSE 7.1, kernal 2.4) with PHP/4.3.2. I have the following code to help cleanse form data. function cleanData($sourceData, &$cleanData) { foreach($myData as...
6
by: John Dunlop | last post by:
(Crossposted and followups set. Hope you don't mind Markus.) The Manual expresses the parameters of htmlentities as: string string ] http://www.php.net/manual/en/function.htmlentities.php ...
2
by: tco | last post by:
Hi all, I'm searching a reverse function for htmlentities.... i couldn't find anything in the manual and over forums :-/ does anyone have an idea ? many thanks in advance, -- tco
0
by: Gandalf | last post by:
Hi all! I'm writting a web application using IIS and Python. I would like to have the Python equvalient of the PHP functions 'htmlentities' and 'htmlspecialchars'. E.g. to convert a' >>>> ...
7
by: Taras_96 | last post by:
Hi all, I was hoping to get some clarification on a couple of questions I have: 1) When should htmlspecial characters be used? As a general rule should it be used for text that may contain...
3
by: jl | last post by:
>From the php manual I copied and pasted this example: <?php $str = "A 'quote' is <b>bold</b>"; // Outputs: A 'quote' is &lt;b&gt;bold&lt;/b&gt; echo htmlentities($str); // Outputs: A 'quote' is...
2
by: matthud | last post by:
<?php //MAKE IT SAFE $chunk = $_POST; $title = $_POST; $url = $_POST; $tags = $_POST; $user = $_POST; $safe_chunk = mysql_real_escape_string(htmlentities($chunk)); $safe_title =...
9
nathj
by: nathj | last post by:
Hi, As you can tell by the subject of this post I'm having a spot of bother with htmlentities() and html_entity_decode(). I have built/am building a web site that allows user feedback. When...
8
by: mijn naam | last post by:
Can someone please explain to me why/when one would use htmlspecialchars instead of htmlentities? I know: if you only want to get certain characters translated. This is not the answer I'm...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.