473,327 Members | 2,071 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

encodeURI and unicode

If I do alert(encodeURI(String.fromCharCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,
or am I guaranteed that all % encodings (from encodeURI) will have
exactly two hex digits following?

Perhaps someone could shed some light on this or point me to quality
site. Be gentle, I know almost nothing about unicode.

Thanks,
Csaba Gabor from Vienna
alert(encodeURI(String.fromCharCode(2500))) => %E0%A7%84
alert(encodeURI(String.fromCharCode(25000))) => %E6%86%A8

Mar 17 '06 #1
7 5041
Csaba Gabor wrote:
If I do alert(encodeURI(String.fromCharCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,


OK, I think I have most it it now. I was confusing encodeURI with what
I had earlier read at this site:
http://html.megalink.com/programmer/...sTabChars.html

but that is covering how to specify javascript (1.3) strings and not
what happens with encodeURI. I presume this is a reflection of the
spec that browsers must follow in transmitting information to servers.
Still, I was a little surprised.

Here is another interesting point:
var a=String.fromCharCode(131071);
alert(a.charCodeAt(0)+"\n"+a);

That code shows a char code of 65535, and if use 131072 then the char
code goes to 0. In other words, it wraps.

I just have one question at this point. As I mentioned in my original
post,
String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?

Csaba

Mar 17 '06 #2
Csaba Gabor wrote:
I just have one question at this point. As I mentioned in my original
post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?


Those are percent-escaped representations of the three UTF-8 code
units that are required to encode the Unicode character at code
point U+09C4. See also ECMAScript 3 Final, subsection 15.1.3, and
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.
PointedEars
Mar 18 '06 #3
Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
I just have one question at this point. As I mentioned in my original
post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84


Those are percent-escaped representations of the three UTF-8 code ...
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.


Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparatedHexValues) that does
essentially:

n = ...unicodeValue...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);
In words: If your positive integer (the char code) is not less than
17*16^4, report an error,
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.

Otherwise, let k be the number of bits in your number. That is to say,
k is the smallest integer such that 2^k is greater than your number -
e.g. [2^(k-1),2^k)->k; [128,256)->8; [8,16)->4; [4,8)->3; [2,4)->2;
1->1; 0->0). Now, starting at the low end, section the number into
m=ceiling((k-1)/5) groups of 6 bits, with any leftovers in the final
(high) group. Prefix all but the high groups with (bits) 10 (that is
to say, OR them with (hex) 80). Prefix the high group with the m+1
bits corresponding to 2^(m+1)-2. That is to say, prefix the first
group of 2 with (bits) 110, the first group of 3 with 1110, or the
first group of 4 with 11110.

Thus, if your number has 7 bits or less, it takes two hex digits to
represent. From 8 to 11 (inclusive) it takes four hex digits, from 12
to 16 (inclusive) it takes six, and from 17 to 21 (inclusive) bits it
takes eight hex digits to represent.

Example: 2500 -> 0x9C4 ->
1001 1100 0100 so k=12 and m=3 ->
(0000) 100111 000100 (that first group got no bits so it is implied) ->
(1110)0000 (10)100111 (10)000100 ->
E0 A7 84

With this it's also easy to see how to work from UTF-8 to unicode.
Given a byte, scan for (from the high (left) side, the first 0 bit).
If the high bit is 0, you are done and you have a "normal" character.
Otherwise, the character is specified by the next m bytes (including
the one the scan started with), where m is one less than the number of
1s encountered before finding that first 0 bit. Knock out all the bits
up to the first 0 bit, and the top 2 bits of all the rest, and
concatenate the remaining bits to get the char code.

Thus, we see the correspondence between UTF8 and unicode
Csaba
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/

Mar 18 '06 #4
Csaba Gabor wrote:
Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
> I just have one question at this point. As I mentioned in my original
> post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
Those are percent-escaped representations of the three UTF-8 code ...
Would you please at least try to retain context in quotations?

<URL:http://learn.to/quote>
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.


Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparatedHexValues) that does
essentially:

n = ...unicodeValue...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);

In words: If your positive integer (the char code) is not less
than 17*16^4, report an error,


Yes. The error is reported if the value is greater than or equal to
0x110000, because The Unicode Standard, version 4.0, does not provide
for more than 1114112 code points, starting with code point U+0000.

(BTW: You have mis-wrapped your abstraction of the original source
code; a trailing `return' statement would only return `undefined',
not the evaluated value of the following lines.)
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.
Yes. One (8-bit) UTF-8 code unit suffices to encode Unicode characters
at these code points.
[...]
With this it's also easy to see how to work from UTF-8 to unicode.
[...]
Thus, we see the correspondence between UTF8 and unicode
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

<URL:http://en.wikipedia.org/wiki/Unicode>
[...]
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/


But obviously you have not found <URL:http://unicode.org/faq/> yet.
Please make it so.
PointedEars
Mar 18 '06 #5
Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.


Sorry you didn't get it. It seems I was spot on in showing how to go
from CP number to the UTF-8 code units and back, as can be verified at
the nice
http://en.wikipedia.org/wiki/UTF-8

Csaba

Mar 19 '06 #6
Csaba Gabor wrote:
Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.


You did not. I wrote (at least):

| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the "units" word.
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.


Thank you for destroying the context again.
Sorry you didn't get it.
YMMD.
It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8


What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not. You
said this shows the relation between Unicode and UTF-8, which is nonsense,
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandings.
Score adjusted

PointedEars
Mar 19 '06 #7
Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.


You did not. I wrote (at least):


In fact, I did try. You are not an authority on me so I will
appreciate it if you will refrain from making assertions on
things you can not know.
| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the ....

Yes.
Upon review, I find that I quoted exactly what I wanted to quote.
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.


Thank you for destroying the context again.
Sorry you didn't get it.


YMMD.
It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8


What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not.


In fact it is, since making sense is always subjective.
You said this shows the relation between Unicode and UTF-8, which is nonsense,
Really? Care to offer a quote for your assertion about what I said?
I never even used the word relationship in this thread.
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandings.


Now that I have expressed myself, you might consider
expressing yourself better next time.
In particular, ordering and making demands on people is neither polite,
nor very effective on newsgroups where there is no means of enforcement.
If there is something that you would like to see done differently, then
it might be more expedient to point out what bothers you about it, and
suggest what would make you happier. Just saying "Don't" or "That was
nonesense" is not very constructive in forestalling future occurrences.

Csaba Gabor from Vienna
Apr 20 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
8
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
2
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...
24
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...
2
by: polilop | last post by:
I'm having problems encoding URI. I have a page in which I use XMLHttpRequest to send a request. In my request it is possible to have Central European characters. When i send the request through...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.