How do you htmlentities in Python

Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like & or > into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance...

Jun 4 '07 #1

Subscribe Post Reply

2675

Adam Atlas

As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)

Jun 4 '07 #2

Cameron Laird

In article <11**********************@q75g2000hsh.googlegroups .com>,
Adam Atlas <ad**@atlas.stwrote:

>As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)

How strange that this doesn't appear in the Cookbook! I'm
curious about how others think: does such an item better
belong in the Cookbook, or the Wiki?

Jun 4 '07 #3

Thomas Jollans

"Adam Atlas" <ad**@atlas.stwrote in message
news:11**********************@q75g2000hsh.googlegr oups.com...

As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)

'&(%s);' won't quite work: HTML (and, I assume, SGML, but not XHTML being
XML) allows you to skip the semicolon after the entity if it's followed by a
white space (IIRC). Should this be respected, it looks more like this:
r'&(%s)([;\s]|$)'

Also, this completely ignores non-name entities as also found in XML. (eg
%x20; for ' ' or so) Maybe some part of the HTMLParser module is useful, I
wouldn't know. IMHO, these particular batteries aren't too commonly needed.

Regards,
Thomas Jollans

Jun 4 '07 #4

Cameron Laird

In article <11**********************@q75g2000hsh.googlegroups .com>,
Adam Atlas <ad**@atlas.stwrote:

>As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)

A. I *think* you meant
import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: chr(name2codepoint[m.group(1)]), s)
We're stretching the limits of what's comfortable
for me as a one-liner.
B. How's it happen this isn't in the Cookbook? I'm
curious about what other Pythoneers think: is
this better memorialized in the Cookbook or the
Wiki?

Jun 4 '07 #5

Matimus

On Jun 4, 6:31 am, "js " <ebgs...@gmail.comwrote:

Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like & or > into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance.

I think this is the standard idiom:

>>import xml.sax.saxutils as saxutils
saxutils.escape("&")

'&'

>>saxutils.unescape(">")

'>'

>>saxutils.unescape("A bunch of text with entities: & > <")

'A bunch of text with entities: & <'

Notice there is an optional parameter (a dict) that can be used to
define additional entities as well.

Matt

Jun 4 '07 #6

Thanks you Matimus.
That's exactly what I'm looking for!
Easy, clean and customizable.
I love python :)

On 6/5/07, Matimus <mc******@gmail.comwrote:

On Jun 4, 6:31 am, "js " <ebgs...@gmail.comwrote:
Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like & or > into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance.

I think this is the standard idiom:

>import xml.sax.saxutils as saxutils
saxutils.escape("&")

'&'

>saxutils.unescape(">")

'>'

>saxutils.unescape("A bunch of text with entities: & > <")

'A bunch of text with entities: & <'

Notice there is an optional parameter (a dict) that can be used to
define additional entities as well.

Matt

--
http://mail.python.org/mailman/listinfo/python-list

Jun 4 '07 #7

Cameron Laird

In article <11**********************@q19g2000prn.googlegroups .com>,
Matimus <mc******@gmail.comwrote:

>On Jun 4, 6:31 am, "js " <ebgs...@gmail.comwrote:
>Hi list.

If I'm not mistaken, in python, there's no standard library to convert
html entities, like & or > into their applicable characters.

htmlentitydefs provides maps that helps this conversion,
but it's not a function so you have to write your own function
make use of htmlentitydefs, probably using regex or something.

To me this seemed odd because python is known as
'Batteries Included' language.

So my questions are
1. Why doesn't python have/need entity encoding/decoding?
2. Is there any idiom to do entity encode/decode in python?

Thank you in advance.

I think this is the standard idiom:

>>>import xml.sax.saxutils as saxutils
saxutils.escape("&")

'&'

>>>saxutils.unescape(">")

'>'

>>>saxutils.unescape("A bunch of text with entities: & > <")

'A bunch of text with entities: & <'

Notice there is an optional parameter (a dict) that can be used to
define additional entities as well.

Jun 5 '07 #8

John J. Lee

"Thomas Jollans" <th****@jollans.NOSPAM.comwrites:

"Adam Atlas" <ad**@atlas.stwrote in message
news:11**********************@q75g2000hsh.googlegr oups.com...
As far as I know, there isn't a standard idiom to do this, but it's
still a one-liner. Untested, but I think this should work:

import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
name2codepoint[m.group(1)], s)

'&(%s);' won't quite work: HTML (and, I assume, SGML, but not XHTML being
XML) allows you to skip the semicolon after the entity if it's followed by a
white space (IIRC). Should this be respected, it looks more like this:
r'&(%s)([;\s]|$)'

Also, this completely ignores non-name entities as also found in XML. (eg
%x20; for ' ' or so) Maybe some part of the HTMLParser module is useful, I
wouldn't know. IMHO, these particular batteries aren't too commonly needed.

Here's one that handles numeric character references, and chooses to
leave entity references that are not defined in standard library
module htmlentitydefs intact, rather than throwing an exception.

It ignores the missing semicolon issue (and note also that IE can cope
with even a missing space, like "tr&eacutes mal", so you'll see that
in the wild). Probably it could be adapted to handle that (possibly
the presumably-slower htmllib-based recipe on the python.org wiki
already does handle that, not sure).
import htmlentitydefs
import re
import unittest

def unescape_charref(ref):
name = ref[2:-1]
base = 10
if name.startswith("x"):
name = name[1:]
base = 16
return unichr(int(name, base))

def replace_entities(match):
ent = match.group()
if ent[1] == "#":
return unescape_charref(ent)

repl = htmlentitydefs.name2codepoint.get(ent[1:-1])
if repl is not None:
repl = unichr(repl)
else:
repl = ent
return repl

def unescape(data):
return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data)

class UnescapeTests(unittest.TestCase):

def test_unescape_charref(self):
self.assertEqual(unescape_charref(u"&"), u"&")
self.assertEqual(unescape_charref(u"—"), u"\N{EM DASH}")
self.assertEqual(unescape_charref(u"—"), u"\N{EM DASH}")

def test_unescape(self):
self.assertEqual(
unescape(u"& < — — —"),
u"& < %s %s %s" % tuple(u"\N{EM DASH}"*3)
)
self.assertEqual(unescape(u"&a&"), u"&a&")
self.assertEqual(unescape(u"a&"), u"a&")
self.assertEqual(unescape(u"&nonexistent;"), u"&nonexistent;")
unittest.main()

John

Jun 6 '07 #9

Similar topics

htmlentities adds slashes - why?

by: Randell D. | last post by:

Folks, I'm using Apache/1.3.28 (SuSE 7.1, kernal 2.4) with PHP/4.3.2. I have the following code to help cleanse form data. function cleanData($sourceData, &$cleanData) { foreach($myData as...

PHP

htmlentities et al.: relationship between quote_style and charset parameters (was: character switch)

by: John Dunlop | last post by:

(Crossposted and followups set. Hope you don't mind Markus.) The Manual expresses the parameters of htmlentities as: string string ] http://www.php.net/manual/en/function.htmlentities.php ...

PHP

htmlentities

by: tco | last post by:

Hi all, I'm searching a reverse function for htmlentities.... i couldn't find anything in the manual and over forums :-/ does anyone have an idea ? many thanks in advance, -- tco

PHP

htmlentities, htmlspecialchars

by: Gandalf | last post by:

Hi all! I'm writting a web application using IIS and Python. I would like to have the Python equvalient of the PHP functions 'htmlentities' and 'htmlspecialchars'. E.g. to convert a' >>>> ...

Python

htmlentities & charencoding

by: Taras_96 | last post by:

Hi all, I was hoping to get some clarification on a couple of questions I have: 1) When should htmlspecial characters be used? As a general rule should it be used for text that may contain...

PHP

htmlentities is not working for me

by: jl | last post by:

>From the php manual I copied and pasted this example: <?php $str = "A 'quote' is <b>bold</b>"; // Outputs: A 'quote' is <b>bold</b> echo htmlentities($str); // Outputs: A 'quote' is...

PHP

mysql_real_escape_string/htmlentities issue

by: matthud | last post by:

<?php //MAKE IT SAFE $chunk = $_POST; $title = $_POST; $url = $_POST; $tags = $_POST; $user = $_POST; $safe_chunk = mysql_real_escape_string(htmlentities($chunk)); $safe_title =...

PHP

Trouble with htmlentities() and html_entity_decode()

by: nathj | last post by:

Hi, As you can tell by the subject of this post I'm having a spot of bother with htmlentities() and html_entity_decode(). I have built/am building a web site that allows user feedback. When...

PHP

just wondering... htmlspecialchars vs htmlentities

by: mijn naam | last post by:

Can someone please explain to me why/when one would use htmlspecialchars instead of htmlentities? I know: if you only want to get certain characters translated. This is not the answer I'm...

PHP

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice