On Sat, 15 Nov 2003 19:57:14 GMT, Martin Goldman <www@nowhere.foo> wrote:
[color=blue]
>Daniel Tryba <news_comp.lang.php@canopus.nl> wrote in news:bp5nhq$d0e$1
>@news.tue.nl:
>[color=green]
>> That might mean that there is nog chr(147) in the string although you
>> _see_ a character that might be represented as the character you know
>> as 147 in cp1252! Another fine example is the eurosymbol, IIRC its 128 in
>> cp1252 and 204 in iso-8859-15, in iso-8859-1 204 is a generic symbol
>> and totally lacks the eurosymbol. Thats why if you want to display the uero
>> symbol one is encouraged to use the htmlentitie €, which can be
>> rendered in any font and any character set (with a fallback to EUR).
>>
>> So you job is to figure out how you quote is encoded (just step through
>> the string and print the chr value for each character)...[/color]
>
>Interesting you should suggest this, because I just did that. And indeed,
>it's not coming out as 147. It's coming out as 226, followed by 128,
>followed by 156. I suppose I could do a str_replace for these 3
>characters and replace it with 147. Although, then I'd have to do that
>for every character I want to support. What a drag.[/color]
Your text is encoded in UTF-8. Going back to the characters again:
hex dec Unicode Unicode name
91 145 8216 LEFT SINGLE QUOTATION MARK
92 146 8217 RIGHT SINGLE QUOTATION MARK
93 147 8220 LEFT DOUBLE QUOTATION MARK
94 148 8221 RIGHT DOUBLE QUOTATION MARK
226,128,147 in binary is:
11100010
10000000
10011100
'1110' in the first few bits of the first byte indicates it is a lead byte for
a three-byte character. The remaining two are trail bytes, as they start with
10. So separating out the data gets:
1110 0010
10 000000
10 011100
=> 0010000000011100 (binary)
= 8220 (decicmal)
Which is LEFT DOUBLE QUOTATION MARK.
--
Andy Hassall (andy@andyh.co.uk) icq(5747695) (
http://www.andyh.co.uk)
Space: disk usage analysis tool (
http://www.andyhsoftware.co.uk/space)