473,473 Members | 1,492 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

unicode html

X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.

thanks,
lorenzo

Jul 17 '06 #1
8 2786

lo**************@gmail.com wrote:
X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.

thanks,
lorenzo
no expertise with unicode issues but using 'pytextile' at the minute
which converts non-ascii to (numeric) html entities. It does something
like:
>>s =unicode('\xe7', encoding='latin-1')
s
u'\xe7'
>>print s
ç
>>print s.encode('ascii','xmlcharrefreplace')
ç
http://wiki.python.org/moin/PyTextile
hth

Gerard

Jul 17 '06 #2
Jim

Sybren Stuvel wrote:
lo**************@gmail.com enlightened us with:
As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.

Why would you want that? Just make sure you declare your document as
UTF-8, encode it as such, and you're done. Much easier.
For example, I am programming a script that makes html pages, but I do
not have the ability to change the "Content-Type .. charset=.." line
that is sent preceeding those pages.

Jim

Jul 17 '06 #3
Jim
Sybren Stuvel wrote:
Jim enlightened us with:
For example, I am programming a script that makes html pages, but I
do not have the ability to change the "Content-Type .. charset=.."
line that is sent preceeding those pages.

"line"? Are you talking about the HTTP header? If it is wrong, it
should be corrected. If you are in control of the content, you should
also be control of the Content-Type header. Otherwise, use a <meta>
tag that describes the content.
Ah, but I cannot change it. It is not my machine and the folks who own
the machine perceive that the charset line that they use is the right
one for them. (Many people ship pages off this machine.)

Unfortunately, the <metatag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html
in section 5.2.2 where it states that in a contest the charset
parameter wins.

My only point is that things are complicated and that there are times
when HTML entities are the answer (or anyway, an answer).

Jim

Jul 17 '06 #4
Jim
Sybren Stuvel wrote:
Jim enlightened us with:
Ah, but I cannot change it. It is not my machine and the folks who
own the machine perceive that the charset line that they use is the
right one for them.

Well, _you_ are the one providing the content, aren't you?
? This site has many people operating off of it (it is
sourceforge-like) and the operators (who are volunteers) are kind
enough to let us use it in the first place. I presume that they think
the charset line that they use is the one that most people want.
Probably if they changed it then someone else would complain.
Sounds like they either don't know what they are talking about, or use
incompetent software. With Apache, it's very easy to give every
directory its own default character encoding header.
I am operating under constraints. Asking the operators of the site has
led to the understanding that I must work with the charset parameter
that I have. That is, I have an environment in which I must work, and
whether you or I think the people providing the service should do it
differently doesn't matter. I replied originally because I thought I
could give an example of HTML entities providing a way that I can solve
the problem that is entirely under my control.
Unfortunately, the <metatag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html in section 5.2.2 where it
states that in a contest the charset parameter wins.

I assume that with "the charset parameter" you mean "the HTTP header",
as the <metatag also has a "charset parameter".
AIUI "charset parameter" is the language of the HTML standard that I
referred to. For the meta tag, I at least would use "charset
attribute".
My only point is that things are complicated

Call me thick, but from my point of view they aren't.
;-)

Jim

Jul 17 '06 #5
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =&ocirc;
'&#%d;' % ord(u'\u0430')

or

'&#x%x;' % ord(u'\u0430')
for all available html entities.

--
damjan
Jul 17 '06 #6
lo**************@gmail.com wrote:
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =&ocirc;
for all available html entities.
I don't know how you generate your HTML, but ElementTree and lxml both have
good HTML parsers, so that you can let them write out the result with an
"US-ASCII" encoding and they will generate numeric entities for everything
that's not ASCII.
>>from lxml import etree
root = etree.HTML(my_html_data)
html_7_bit = etree.tostring(root, "us-ascii")
Stefan
Jul 18 '06 #7
wrote:
As an example I would like to do this kind of conversion:
\uc3B4 =&ocirc;
for all available html entities.
>>u"\u3cB4".encode('ascii','xmlcharrefreplace')
'㲴'

Don't bother using named entities. If you encode your unicode as ascii
replacing all non-ascii characters with the xml entity reference then your
pages will display fine whatever encoding is specified in the HTTP headers.
Jul 18 '06 #8
Sybren Stuvel wrote:
Duncan Booth enlightened us with:
>Don't bother using named entities. If you encode your unicode as
ascii replacing all non-ascii characters with the xml entity
reference then your pages will display fine whatever encoding is
specified in the HTTP headers.

Which means OP can't use Unicode/UTF-8 entity references, since that's
not specified in the HTTP header.
That doesn't matter, character references are not affected by the network
encoding.

From http://www.w3.org/TR/html4/charset.html#h-5.3.1
5.3.1 Numeric character references

Numeric character references specify the code position of a character
in the document character set.
The character references use the *document character set*, which is
independant of the character encoding used for network transmission. This
is defined for HTML as ISO10646, and (section 5.1) "The character set
defined in [ISO10646] is character-by-character equivalent to Unicode
([UNICODE])".
Jul 18 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
6
by: S. | last post by:
if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode...
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
3
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
11
by: Patrick Van Esch | last post by:
Hello, I have the following problem of principle: in writing HTML pages containing ancient greek, there are two possibilities: one is to write the unicode characters directly (encoded as two...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
3
by: dalei | last post by:
My question is presented more clearly in following web page: http://www.pinyinology.com/signs2.html <html> HTML entities display outside script tags: a&sup1;, a&sup2;, a&sup3;, a⁴ But...
1
by: David Dvali | last post by:
Hello. I have a problem with sending Unicode text in mail message. So what I do: First of all I have some template file like this: ================================= <html> <head><title>Test...
2
by: Frantic | last post by:
I'm working on a list of japaneese entities that contain the entity, the unicode hexadecimal code and the xml/sgml entity used for that entity. A unicode document is read into the program, then the...
3
by: pratik.best | last post by:
Hi, I just seen the web site of the unicode committee and was amazed to see the site showing document in Hindi without using any such fonts like "Kruti Dev" or "Dev Lys". "Webdunia.com" is also...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.