unicode html - Latest Bytes

lorenzo.viscanti

X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.

thanks,
lorenzo

Jul 17 '06 #1

Subscribe Reply

2786

Gerard Flanagan

lo**************@gmail.com wrote:

X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.

thanks,
lorenzo

no expertise with unicode issues but using 'pytextile' at the minute
which converts non-ascii to (numeric) html entities. It does something
like:

>>s =unicode('\xe7', encoding='latin-1')
s

u'\xe7'

>>print s

>>print s.encode('ascii','xmlcharrefreplace')

ç
http://wiki.python.org/moin/PyTextile
hth

Gerard

Jul 17 '06 #2

Jim

Sybren Stuvel wrote:

lo**************@gmail.com enlightened us with:

As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.

Why would you want that? Just make sure you declare your document as
UTF-8, encode it as such, and you're done. Much easier.

For example, I am programming a script that makes html pages, but I do
not have the ability to change the "Content-Type .. charset=.." line
that is sent preceeding those pages.

Jim

Jul 17 '06 #3

Jim

Sybren Stuvel wrote:

Jim enlightened us with:
For example, I am programming a script that makes html pages, but I
do not have the ability to change the "Content-Type .. charset=.."
line that is sent preceeding those pages.

"line"? Are you talking about the HTTP header? If it is wrong, it
should be corrected. If you are in control of the content, you should
also be control of the Content-Type header. Otherwise, use a <meta>
tag that describes the content.

Ah, but I cannot change it. It is not my machine and the folks who own
the machine perceive that the charset line that they use is the right
one for them. (Many people ship pages off this machine.)

Unfortunately, the <metatag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html
in section 5.2.2 where it states that in a contest the charset
parameter wins.

My only point is that things are complicated and that there are times
when HTML entities are the answer (or anyway, an answer).

Jim

Jul 17 '06 #4

Jim

Sybren Stuvel wrote:

Jim enlightened us with:
Ah, but I cannot change it. It is not my machine and the folks who
own the machine perceive that the charset line that they use is the
right one for them.

Well, _you_ are the one providing the content, aren't you?

? This site has many people operating off of it (it is
sourceforge-like) and the operators (who are volunteers) are kind
enough to let us use it in the first place. I presume that they think
the charset line that they use is the one that most people want.
Probably if they changed it then someone else would complain.

Sounds like they either don't know what they are talking about, or use
incompetent software. With Apache, it's very easy to give every
directory its own default character encoding header.

I am operating under constraints. Asking the operators of the site has
led to the understanding that I must work with the charset parameter
that I have. That is, I have an environment in which I must work, and
whether you or I think the people providing the service should do it
differently doesn't matter. I replied originally because I thought I
could give an example of HTML entities providing a way that I can solve
the problem that is entirely under my control.

Unfortunately, the <metatag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html in section 5.2.2 where it
states that in a contest the charset parameter wins.

I assume that with "the charset parameter" you mean "the HTTP header",
as the <metatag also has a "charset parameter".

AIUI "charset parameter" is the language of the HTML standard that I
referred to. For the meta tag, I at least would use "charset
attribute".

My only point is that things are complicated

Call me thick, but from my point of view they aren't.

;-)

Jim

Jul 17 '06 #5

Damjan

Hi, I've found lots of material on the net about unicode html

conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =ô

'&#%d;' % ord(u'\u0430')

or

'&#x%x;' % ord(u'\u0430')

for all available html entities.

--
damjan

Jul 17 '06 #6

Stefan Behnel

lo**************@gmail.com wrote:

Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.

I don't know how you generate your HTML, but ElementTree and lxml both have
good HTML parsers, so that you can let them write out the result with an
"US-ASCII" encoding and they will generate numeric entities for everything
that's not ASCII.

>>from lxml import etree
root = etree.HTML(my_html_data)
html_7_bit = etree.tostring(root, "us-ascii")

Stefan

Jul 18 '06 #7

Duncan Booth

wrote:

As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.

>>u"\u3cB4".encode('ascii','xmlcharrefreplace')

'㲴'

Don't bother using named entities. If you encode your unicode as ascii
replacing all non-ascii characters with the xml entity reference then your
pages will display fine whatever encoding is specified in the HTTP headers.

Jul 18 '06 #8

Duncan Booth

Sybren Stuvel wrote:

Duncan Booth enlightened us with:
>Don't bother using named entities. If you encode your unicode as
ascii replacing all non-ascii characters with the xml entity
reference then your pages will display fine whatever encoding is
specified in the HTTP headers.

Which means OP can't use Unicode/UTF-8 entity references, since that's
not specified in the HTTP header.

That doesn't matter, character references are not affected by the network
encoding.

From http://www.w3.org/TR/html4/charset.html#h-5.3.1

5.3.1 Numeric character references

Numeric character references specify the code position of a character
in the document character set.

The character references use the *document character set*, which is
independant of the character encoding used for network transmission. This
is defined for HTML as ISO10646, and (section 5.1) "The character set
defined in [ISO10646] is character-by-character equivalent to Unicode
([UNICODE])".

Jul 18 '06 #9

Similar topics

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...