473,796 Members | 2,916 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

encodeURI and unicode

If I do alert(encodeURI (String.fromCha rCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,
or am I guaranteed that all % encodings (from encodeURI) will have
exactly two hex digits following?

Perhaps someone could shed some light on this or point me to quality
site. Be gentle, I know almost nothing about unicode.

Thanks,
Csaba Gabor from Vienna
alert(encodeURI (String.fromCha rCode(2500))) => %E0%A7%84
alert(encodeURI (String.fromCha rCode(25000))) => %E6%86%A8

Mar 17 '06 #1
7 5085
Csaba Gabor wrote:
If I do alert(encodeURI (String.fromCha rCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,


OK, I think I have most it it now. I was confusing encodeURI with what
I had earlier read at this site:
http://html.megalink.com/programmer/...sTabChars.html

but that is covering how to specify javascript (1.3) strings and not
what happens with encodeURI. I presume this is a reflection of the
spec that browsers must follow in transmitting information to servers.
Still, I was a little surprised.

Here is another interesting point:
var a=String.fromCh arCode(131071);
alert(a.charCod eAt(0)+"\n"+a);

That code shows a char code of 65535, and if use 131072 then the char
code goes to 0. In other words, it wraps.

I just have one question at this point. As I mentioned in my original
post,
String.fromChar Code(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?

Csaba

Mar 17 '06 #2
Csaba Gabor wrote:
I just have one question at this point. As I mentioned in my original
post, String.fromChar Code(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?


Those are percent-escaped representations of the three UTF-8 code
units that are required to encode the Unicode character at code
point U+09C4. See also ECMAScript 3 Final, subsection 15.1.3, and
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.
PointedEars
Mar 18 '06 #3
Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
I just have one question at this point. As I mentioned in my original
post, String.fromChar Code(2500) == "\u09C4" => %E0%A7%84


Those are percent-escaped representations of the three UTF-8 code ...
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.


Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparated HexValues) that does
essentially:

n = ...unicodeValue ...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);
In words: If your positive integer (the char code) is not less than
17*16^4, report an error,
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.

Otherwise, let k be the number of bits in your number. That is to say,
k is the smallest integer such that 2^k is greater than your number -
e.g. [2^(k-1),2^k)->k; [128,256)->8; [8,16)->4; [4,8)->3; [2,4)->2;
1->1; 0->0). Now, starting at the low end, section the number into
m=ceiling((k-1)/5) groups of 6 bits, with any leftovers in the final
(high) group. Prefix all but the high groups with (bits) 10 (that is
to say, OR them with (hex) 80). Prefix the high group with the m+1
bits corresponding to 2^(m+1)-2. That is to say, prefix the first
group of 2 with (bits) 110, the first group of 3 with 1110, or the
first group of 4 with 11110.

Thus, if your number has 7 bits or less, it takes two hex digits to
represent. From 8 to 11 (inclusive) it takes four hex digits, from 12
to 16 (inclusive) it takes six, and from 17 to 21 (inclusive) bits it
takes eight hex digits to represent.

Example: 2500 -> 0x9C4 ->
1001 1100 0100 so k=12 and m=3 ->
(0000) 100111 000100 (that first group got no bits so it is implied) ->
(1110)0000 (10)100111 (10)000100 ->
E0 A7 84

With this it's also easy to see how to work from UTF-8 to unicode.
Given a byte, scan for (from the high (left) side, the first 0 bit).
If the high bit is 0, you are done and you have a "normal" character.
Otherwise, the character is specified by the next m bytes (including
the one the scan started with), where m is one less than the number of
1s encountered before finding that first 0 bit. Knock out all the bits
up to the first 0 bit, and the top 2 bits of all the rest, and
concatenate the remaining bits to get the char code.

Thus, we see the correspondence between UTF8 and unicode
Csaba
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/

Mar 18 '06 #4
Csaba Gabor wrote:
Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
> I just have one question at this point. As I mentioned in my original
> post, String.fromChar Code(2500) == "\u09C4" => %E0%A7%84
Those are percent-escaped representations of the three UTF-8 code ...
Would you please at least try to retain context in quotations?

<URL:http://learn.to/quote>
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.


Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparated HexValues) that does
essentially:

n = ...unicodeValue ...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);

In words: If your positive integer (the char code) is not less
than 17*16^4, report an error,


Yes. The error is reported if the value is greater than or equal to
0x110000, because The Unicode Standard, version 4.0, does not provide
for more than 1114112 code points, starting with code point U+0000.

(BTW: You have mis-wrapped your abstraction of the original source
code; a trailing `return' statement would only return `undefined',
not the evaluated value of the following lines.)
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.
Yes. One (8-bit) UTF-8 code unit suffices to encode Unicode characters
at these code points.
[...]
With this it's also easy to see how to work from UTF-8 to unicode.
[...]
Thus, we see the correspondence between UTF8 and unicode
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

<URL:http://en.wikipedia.or g/wiki/Unicode>
[...]
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/


But obviously you have not found <URL:http://unicode.org/faq/> yet.
Please make it so.
PointedEars
Mar 18 '06 #5
Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.


Sorry you didn't get it. It seems I was spot on in showing how to go
from CP number to the UTF-8 code units and back, as can be verified at
the nice
http://en.wikipedia.org/wiki/UTF-8

Csaba

Mar 19 '06 #6
Csaba Gabor wrote:
Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.


You did not. I wrote (at least):

| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the "units" word.
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.


Thank you for destroying the context again.
Sorry you didn't get it.
YMMD.
It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8


What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not. You
said this shows the relation between Unicode and UTF-8, which is nonsense,
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandin gs.
Score adjusted

PointedEars
Mar 19 '06 #7
Thomas 'PointedEars' Lahn wrote:
Csaba Gabor wrote:
Thomas 'PointedEars' Lahn wrote:
Would you please at least try to retain context in quotations? I did.


You did not. I wrote (at least):


In fact, I did try. You are not an authority on me so I will
appreciate it if you will refrain from making assertions on
things you can not know.
| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the ....

Yes.
Upon review, I find that I quoted exactly what I wanted to quote.
You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.


Thank you for destroying the context again.
Sorry you didn't get it.


YMMD.
It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8


What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not.


In fact it is, since making sense is always subjective.
You said this shows the relation between Unicode and UTF-8, which is nonsense,
Really? Care to offer a quote for your assertion about what I said?
I never even used the word relationship in this thread.
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandin gs.


Now that I have expressed myself, you might consider
expressing yourself better next time.
In particular, ordering and making demands on people is neither polite,
nor very effective on newsgroups where there is no means of enforcement.
If there is something that you would like to see done differently, then
it might be more expedient to point out what bothers you about it, and
suggest what would make you happier. Just saying "Don't" or "That was
nonesense" is not very constructive in forestalling future occurrences.

Csaba Gabor from Vienna
Apr 20 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
17626
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code that - opens a file appropriately for output - writes to this file Thanks very much. Michael Weir
8
5278
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
8
3668
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?
48
4647
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
4
6072
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning
2
2634
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is feasible. It would be helpful if people could test the patched Python with their own applications and report any incompatibilities. PEP: 349
24
9074
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special charcaters from an ASCII perspective. I get the following error: > SQLiteCur.execute(sql, row)
2
8035
by: polilop | last post by:
I'm having problems encoding URI. I have a page in which I use XMLHttpRequest to send a request. In my request it is possible to have Central European characters. When i send the request through mozilla, the URL is encoded (the character Z in to %AE) my problem is that IE dose not encode the character, it tourns it into ?, and if I use the javascript encodeURI for the same Character (Z) i get %C5%BD, i have set the header to...
0
9535
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10467
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10201
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10021
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9061
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6802
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5582
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4130
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3744
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.