lawrence <lkrubner@geocities.com> wrote:[color=blue]
> Simon Stienen <simon.stienen@news.slashlife.de> wrote in message news:<1wi5p87hn70gq$.dlg@news.dangerouscat.net>...[color=green]
>> lawrence <lkrubner@geocities.com> wrote:[color=darkred]
>>> Someone on
www.php.net suggested using a seems_utf8() method to test
>>> text for UTF-8 character encoding but didn't specify how to write such
>>> a method. Can anyone suggest a test that might work? Something that
>>> maybe gives 90% confidence that a given block of text is or is not
>>> UTF-8 encoded?[/color]
>>
>> You may be able to decide, that a given string is *not* UTF-8, but there is
>> no way to clearly decide that the string *is* UTF-8. Therefore,
>> "seems_utf8" is a good name for such a function.[/color]
>
> This is very good information. Thanks. It certainly points the right
> way. But how does one get the value of the characters? Using ord()???[/color]
ord() will give you the value of the given single byte character, that is
0..255. In UTF-8, every character which has a higher value than 127 (0x7f)
is represented using at least two bytes:
<http://en.wikipedia.org/wiki/Utf-8>
| Code range (hex) | UTF-8 (binary)
| 000000 - 00007F | 0xxxxxxx
| 000080 - 0007FF | 110xxxxx 10xxxxxx
| 000800 - 00FFFF | 1110xxxx 10xxxxxx 10xxxxxx
| 010000 - 10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
It also states:
| [...] number of unused bytes in a UTF-8 stream increased to 13 bytes:
| 0xC0, 0xC1, 0xF5-0xFF
Therefore, you have to find the first byte with a value of 0x80 or greater.
Either checking against ord():
1) if (ord($string{$i})>=0x80) ...
2) if (ord($string{$i})&0x80) ...
Or using a regular expression:
3) /[\x80-\xFF]/ (Get the offset when using preg_match)
Then check, whether the byte may occur in UTF-8 encoided text. If it
doesn't match any in the list 0xC0, 0xC1, 0xF5-0xFF, it may occur. (You
might want to do this check before finding the first byte >=0x80, using a
regexp or repeated substr_count.)
If it may occur in an UTF-8 encoded string this does not imply that it may
occur at *this* position. If ord($byte)&0xc0 (the two uppermost bits) is
0xC0, it is a byte, which has to be in the middle of a unicode character
sequence. Therefore, if we find such a character here, the string is not
valid UTF-8.
Otherwise, count how many of the highest significant bits are set.
Substract one. This is the number of bytes following in this UTF-8
character. Each of the following bytes has to validate: $byte&0xC0==0xC0.
If so, this is a valid UTF-8 encoded character.
Find the next byte >=0x80 and continue checking until you either find an
invalid value (seems_utf8 -> false) or reach the end of the string
(seems_utf8 -> true).
[color=blue]
> I have the impression that UTF-16 or 32 is a bad idea in a web
> context. [...][/color]
As I explicitly mentioned:
| (Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
| files, for transmission UTF-8 is just fine [...])
--
Simon Stienen <http://dangerouscat.net> <http://slashlife.de>
»What you do in this world is a matter of no consequence,
The question is, what can you make people believe that you have done.«
-- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle