encodeURI and unicode

Csaba Gabor

If I do alert(encodeURI (String.fromCha rCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,
or am I guaranteed that all % encodings (from encodeURI) will have
exactly two hex digits following?

Perhaps someone could shed some light on this or point me to quality
site. Be gentle, I know almost nothing about unicode.

Thanks,
Csaba Gabor from Vienna
alert(encodeURI (String.fromCha rCode(2500))) => %E0%A7%84
alert(encodeURI (String.fromCha rCode(25000))) => %E6%86%A8

Mar 17 '06 #1

Subscribe Reply

5085

Csaba Gabor

Csaba Gabor wrote:

If I do alert(encodeURI (String.fromCha rCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,

OK, I think I have most it it now. I was confusing encodeURI with what
I had earlier read at this site:
http://html.megalink.com/programmer/...sTabChars.html

but that is covering how to specify javascript (1.3) strings and not
what happens with encodeURI. I presume this is a reflection of the
spec that browsers must follow in transmitting information to servers.
Still, I was a little surprised.

Here is another interesting point:
var a=String.fromCh arCode(131071);
alert(a.charCod eAt(0)+"\n"+a);

That code shows a char code of 65535, and if use 131072 then the char
code goes to 0. In other words, it wraps.

I just have one question at this point. As I mentioned in my original
post,
String.fromChar Code(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?

Csaba

Mar 17 '06 #2

Thomas 'PointedEars' Lahn

Csaba Gabor wrote:

I just have one question at this point. As I mentioned in my original
post, String.fromChar Code(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?

Those are percent-escaped representations of the three UTF-8 code
units that are required to encode the Unicode character at code
point U+09C4. See also ECMAScript 3 Final, subsection 15.1.3, and
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.
PointedEars

Mar 18 '06 #3

Csaba Gabor

Thomas 'PointedEars' Lahn wrote:

Csaba Gabor wrote:
I just have one question at this point. As I mentioned in my original
post, String.fromChar Code(2500) == "\u09C4" => %E0%A7%84

Those are percent-escaped representations of the three UTF-8 code ...
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparated HexValues) that does
essentially:

n = ...unicodeValue ...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);
In words: If your positive integer (the char code) is not less than
17*16^4, report an error,
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.

Otherwise, let k be the number of bits in your number. That is to say,
k is the smallest integer such that 2^k is greater than your number -
e.g. [2^(k-1),2^k)->k; [128,256)->8; [8,16)->4; [4,8)->3; [2,4)->2;
1->1; 0->0). Now, starting at the low end, section the number into
m=ceiling((k-1)/5) groups of 6 bits, with any leftovers in the final
(high) group. Prefix all but the high groups with (bits) 10 (that is
to say, OR them with (hex) 80). Prefix the high group with the m+1
bits corresponding to 2^(m+1)-2. That is to say, prefix the first
group of 2 with (bits) 110, the first group of 3 with 1110, or the
first group of 4 with 11110.

Thus, if your number has 7 bits or less, it takes two hex digits to
represent. From 8 to 11 (inclusive) it takes four hex digits, from 12
to 16 (inclusive) it takes six, and from 17 to 21 (inclusive) bits it
takes eight hex digits to represent.

Example: 2500 -> 0x9C4 ->
1001 1100 0100 so k=12 and m=3 ->
(0000) 100111 000100 (that first group got no bits so it is implied) ->
(1110)0000 (10)100111 (10)000100 ->
E0 A7 84

With this it's also easy to see how to work from UTF-8 to unicode.
Given a byte, scan for (from the high (left) side, the first 0 bit).
If the high bit is 0, you are done and you have a "normal" character.
Otherwise, the character is specified by the next m bytes (including
the one the scan started with), where m is one less than the number of
1s encountered before finding that first 0 bit. Knock out all the bits
up to the first 0 bit, and the top 2 bits of all the rest, and
concatenate the remaining bits to get the char code.

Thus, we see the correspondence between UTF8 and unicode
Csaba
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/

Mar 18 '06 #4

Thomas 'PointedEars' Lahn

Csaba Gabor wrote:

Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
> I just have one question at this point. As I mentioned in my original
> post, String.fromChar Code(2500) == "\u09C4" => %E0%A7%84
Those are percent-escaped representations of the three UTF-8 code ...
Would you please at least try to retain context in quotations?

<URL:http://learn.to/quote>
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparated HexValues) that does
essentially:

n = ...unicodeValue ...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);

In words: If your positive integer (the char code) is not less
than 17*16^4, report an error,

Yes. The error is reported if the value is greater than or equal to
0x110000, because The Unicode Standard, version 4.0, does not provide
for more than 1114112 code points, starting with code point U+0000.

(BTW: You have mis-wrapped your abstraction of the original source
code; a trailing `return' statement would only return `undefined',
not the evaluated value of the following lines.)
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.
Yes. One (8-bit) UTF-8 code unit suffices to encode Unicode characters
at these code points.
[...]
With this it's also easy to see how to work from UTF-8 to unicode.
[...]
Thus, we see the correspondence between UTF8 and unicode
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

<URL:http://en.wikipedia.or g/wiki/Unicode>
[...]
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/

But obviously you have not found <URL:http://unicode.org/faq/> yet.
Please make it so.
PointedEars

Mar 18 '06 #5

Csaba Gabor

Thomas 'PointedEars' Lahn wrote:

Would you please at least try to retain context in quotations? I did.
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

Sorry you didn't get it. It seems I was spot on in showing how to go
from CP number to the UTF-8 code units and back, as can be verified at
the nice
http://en.wikipedia.org/wiki/UTF-8

Csaba

Mar 19 '06 #6

Thomas 'PointedEars' Lahn

Csaba Gabor wrote:

Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.

You did not. I wrote (at least):

| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the "units" word.

You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

Thank you for destroying the context again.
Sorry you didn't get it.
YMMD.
It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8

What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not. You
said this shows the relation between Unicode and UTF-8, which is nonsense,
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandin gs.
Score adjusted

PointedEars

Mar 19 '06 #7

Csaba Gabor

Thomas 'PointedEars' Lahn wrote:

Csaba Gabor wrote:
Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.

You did not. I wrote (at least):

In fact, I did try. You are not an authority on me so I will
appreciate it if you will refrain from making assertions on
things you can not know.
| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the ....

Yes.
Upon review, I find that I quoted exactly what I wanted to quote.

You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

Thank you for destroying the context again.
Sorry you didn't get it.

YMMD.
It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8

What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not.

In fact it is, since making sense is always subjective.
You said this shows the relation between Unicode and UTF-8, which is nonsense,
Really? Care to offer a quote for your assertion about what I said?
I never even used the word relationship in this thread.
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandin gs.

Now that I have expressed myself, you might consider
expressing yourself better next time.
In particular, ordering and making demands on people is neither polite,
nor very effective on newsgroups where there is no means of enforcement.
If there is something that you would like to see done differently, then
it might be more expedient to point out what bothers you about it, and
suggest what would make you happier. Just saying "Don't" or "That was
nonesense" is not very constructive in forestalling future occurrences.

Csaba Gabor from Vienna

Apr 20 '06 #8

Similar topics

17626

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code that - opens a file appropriately for output - writes to this file Thanks very much. Michael Weir

Python

5278

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib

Python

3668

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?

Python

4647

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the

HTML / CSS

6072

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning

Python

2634

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is feasible. It would be helpful if people could test the patched Python with their own applications and report any incompatibilities. PEP: 349

Python

9074

Convertion of Unicode to ASCII NIGHTMARE

by: ChaosKCW | last post by:

Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special charcaters from an ASCII perspective. I get the following error: > SQLiteCur.execute(sql, row)

Python

8035

encodeURI

by: polilop | last post by:

I'm having problems encoding URI. I have a page in which I use XMLHttpRequest to send a request. In my request it is possible to have Central European characters. When i send the request through mozilla, the URL is encoded (the character Z in to %AE) my problem is that IE dose not encode the character, it tourns it into ?, and if I use the javascript encodeURI for the same Character (Z) i get %C5%BD, i have set the header to...

Javascript

9535

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10467

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10201

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

10021

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9061

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6802

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5582

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4130

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3744

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP