473,217 Members | 1,979 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,217 software developers and data experts.

Python and decimal character entities over 128.

Some web feeds use decimal character entities that seem to confuse
Python (or me). For example, the string "doesn't" may be coded as
"doesn’t" which should produce a right leaning apostrophe.
Python hates decimal entities beyond 128 so it chokes unless you do
something like string.encode('utf-8'). Even then, what should have
been a right-leaning apostrophe ends up as "’". The following script
does just that. Look for the string "The Canuck iPhone: Apple doesn
t care" after running it.

# coding: UTF-8
import feedparser

s = ''
d = feedparser.parse('http://feeds.feedburner.com/Mathewingramcom/
work')
title = d.feed.title
link = d.feed.link
for i in range(0,4):
title = d.entries[i].title
link = d.entries[i].link
s += title +'\n' + link + '\n'

f = open('c:/x/test.txt', 'w')
f.write(s.encode('utf-8'))
f.close()

This useless script is adapted from a "useful" script. Its only
purpose is to ask the Python community how I can deal with decimal
entities 128. Thanks in advance, Bill
Jul 9 '08 #1
3 2029
On Wed, 09 Jul 2008 16:39:24 -0700, bsagert wrote:
Some web feeds use decimal character entities that seem to confuse
Python (or me).
I guess they confuse you. Python is fine.
For example, the string "doesn't" may be coded as "doesn’t" which
should produce a right leaning apostrophe. Python hates decimal entities
beyond 128 so it chokes unless you do something like
string.encode('utf-8').
Python doesn't hate nor chokes on these entities. It just refuses to
guess which encoding you want, if you try to write *unicode* objects into
a file. Files contain byte values not characters.
Even then, what should have been a right-leaning apostrophe ends up as
"’". The following script does just that. Look for the string "The
Canuck iPhone: Apple doesnâ €™t care" after running it.
Then you didn't tell the application you used to look at the result, that
the text is UTF-8 encoded. I guess you are using Windows and
the application expects cp1252 encoded text because an UTF-8 encoded
apostrophe looks like '’' in cp1252.

Choose the encoding you want the result to have and anything is fine.
Unless you stumble over a feed using characters which can't be encoded
in the encoding of your choice. That's why UTF-8 might have been a good
idea.

Ciao,
Marc 'BlackJack' Rintsch
Jul 10 '08 #2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

bs*****@gmail.com wrote:
Some web feeds use decimal character entities that seem to confuse
Python (or me). For example, the string "doesn't" may be coded as
"doesn’t" which should produce a right leaning apostrophe.
Python hates decimal entities beyond 128 so it chokes unless you do
something like string.encode('utf-8'). Even then, what should have
been a right-leaning apostrophe ends up as "’". The following script
does just that. Look for the string "The Canuck iPhone: Apple doesnâ
€™t care" after running it.

# coding: UTF-8
import feedparser

s = ''
d = feedparser.parse('http://feeds.feedburner.com/Mathewingramcom/
work')
title = d.feed.title
link = d.feed.link
for i in range(0,4):
title = d.entries[i].title
link = d.entries[i].link
s += title +'\n' + link + '\n'

f = open('c:/x/test.txt', 'w')
f.write(s.encode('utf-8'))
f.close()

This useless script is adapted from a "useful" script. Its only
purpose is to ask the Python community how I can deal with decimal
entities 128. Thanks in advance, Bill
--
http://mail.python.org/mailman/listinfo/python-list
This is a two-fold issue: encodings/charsets and entities. Encodings are
a way to _encode_ charsets to a sequence of octets. Entities are a way
to avoid a (harder) encoding/decoding process at the expense of
readability: when you type #8217; no one actually see the intended
character, but those are easily encoded in ascii.

When dealing with multiples sources of information, like your script may
be, I always include a middleware of normalization to Python's Unicode
Type. Web sites may use whatever encoding they please.

The whole process is like this:
1. Fetch the content
2. Use whatever clue in the contents to guess the encoding used by the
document, e.g Content-type HTTP header; <meta http-equiv="content-type"
....>; <?xml version="1.0" encoding="utf-8"?>, and so on.
3. If none are present, then use chardet to guess for an acceptable decoder.
4. Decode ignoring those character that cannot be decoded.
5. The result is further processed to find entities and "decode" them to
actual Unicode characters. (See below)

You may find these helpful:
http://effbot.org/zone/unicode-objects.htm
http://www.mozilla.org/projects/intl...Detection.html
http://www.amk.ca/python/howto/unicode

This is function I have used to process entities:
Expand|Select|Wrap|Line Numbers
  1. from htmlentitydefs import name2codepoint
  2. def __processhtmlentities__(text):
  3. assert type(text) is unicode, "Non-normalized text"
  4. html = []
  5. (buffer, amp, text) = text.partition('&')
  6. while amp:
  7. html.append(buffer)
  8. (entity, semicolon, text) = text.partition(';')
  9. if entity[0] != '#':
  10. if entity in name2codepoint:
  11. html.append(unichr(name2codepoint[entity]))
  12. else:
  13. html.append(int(entity[1:])))
  14. (buffer, amp, text) = text.partition('&')
  15. html.append(buffer)
  16. return u''.join(html)
  17.  

Best regards,
Manuel.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkh2S+sACgkQI2zpkmcEAhil6gCgkAnRE4s5b8 oQHamk6utkbAl7
m+YAoIZH2/u73hDcs0G/u294use27v17
=mXuK
-----END PGP SIGNATURE-----
Jul 10 '08 #3
I don't have an answer for why Python might be mis-handling the data,
but wanted to make a factual correction:

bs*****@gmail.com writes:
Some web feeds use decimal character entities that seem to confuse
Python (or me). For example, the string "doesn't" may be coded as
"doesn’t" which should produce a right leaning apostrophe.
That character isn't a "right leaning apostrophe"; it has nothing to
do with apostrophes. It is the character called "right single
quotation mark" in <URL:http://www.w3.org/TR/html4/sgml/entities.html>
and in Unicode (code point U+2019).

It's a typographical error to use a quotation mark as an apostrophe.
Use the apostrophe character (U+0027) where an apostrophe is intended,
and quotation mark characters where those are intended.

This is directed, of course, at the person generating that output.

--
\ “If you go to a costume party at your boss's house, wouldn't |
`\ you think a good costume would be to dress up like the boss's |
_o__) wife? Trust me, it's not.” —Jack Handey |
Ben Finney
Jul 10 '08 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Albretch | last post by:
Hi HTML gurus, I understand that you would use HTML character entities for &auml; and &euro; but why on earth would anyone encode: a colon: ":", a semicolon ";", or a gramatical period...
19
by: Ian | last post by:
I'm using the following meta tag with my documents: <meta http-equiv="Content-Type" content= "text/html; charset=us-ascii" /> and yet using character entities like &rsquo; and &mdash; It...
2
by: kamp | last post by:
Hello, Below is a snippet from a schema. The second enumeration should contain an i umlaut (archasch) but when I use this schema with Altova's Stylevision software the iumlaut is not displayed...
1
by: suresh | last post by:
How can I use HTML character entities in my xml file? it has reference of Schema file(.xsd) which has fixed structure defined by our client. so without disturbing to schema, I want to use...
4
by: lianciana | last post by:
Hi, I'd like to know more about the possible use of character entities in eg: google groups, Is it possible to use these without using a newsreader directly? Outside of the usual ASCII...
0
by: Mike McGranahan | last post by:
Hello, My wsdl.exe-generated, ASP.NET 1.1 web service pulls data from the local Exchange 2003 store using CDO 2000 and returns it to consumers. CDO occasionally returns (char)1, (char)5, and...
1
by: Tony | last post by:
I have been using TinyMCE as a WYSIWYG editor for getting content into a database and then exporting that data into an XML format to redender in flash using CDATA. The problem is that I didn't...
4
by: Paul Rubin | last post by:
I'm new to xml mongering so forgive me if there's an obvious well-known answer to this. It's not real obvious from the library documentation I've looked at so far. Basically I have to munch of a...
7
by: tempest | last post by:
Hi all. This is a rather long posting but I have some questions concerning the usage of character entities in XML documents and PCI security compliance. The company I work for is using a...
1
isladogs
by: isladogs | last post by:
The next online meeting of the Access Europe User Group will be on Wednesday 6 Dec 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, Mike...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
by: jimatqsi | last post by:
The boss wants the word "CONFIDENTIAL" overlaying certain reports. He wants it large, slanted across the page, on every page, very light gray, outlined letters, not block letters. I thought Word Art...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.