Someone on www.php.net suggested using a seems_utf8() method to test
text for UTF-8 character encoding but didn't specify how to write such
a method. Can anyone suggest a test that might work? Something that
maybe gives 90% confidence that a given block of text is or is not
UTF-8 encoded? 9 4187
lawrence <lk******@geoci ties.com> wrote: Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work? Something that maybe gives 90% confidence that a given block of text is or is not UTF-8 encoded?
You may be able to decide, that a given string is *not* UTF-8, but there is
no way to clearly decide that the string *is* UTF-8. Therefore,
"seems_utf8 " is a good name for such a function.
How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
whether you define this text as UTF-8 or any ISO encoding, since the first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure about
that.
HTH
Simon
(Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
files, for transmission UTF-8 is just fine because most characters won't
have an extra byte. In most UTF-16 encoded documents, you can be pretty
sure about the encoding due to the enormous percentage of 0x00 to 0x0F. In
almost every text you get a percentage of at least 33% of these characters,
since every character in US-ASCII and Latin 1 has a preceding 0x00, every
character in Latin Extended A and B is preceded by 0x01, and so on.
0x0000 to 0x0fff contains:
Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, IPA
Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek
and Coptic, Cyrillic, Cyrillic Supplement, Armenian, Hebrew, Arabic,
Syriac, Thaana, Devanagari and Bengali, Gurmukhi, Gujarati, Oriya, Tamil,
Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan
I guess this list should cover most documents.)
--
Simon Stienen <http://dangerouscat.ne t> <http://slashlife.de>
»What you do in this world is a matter of no consequence,
The question is, what can you make people believe that you have done.«
-- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle
Simon Stienen <si***********@ news.slashlife. de> wrote in message news:<1w******* *********@news. dangerouscat.ne t>... lawrence <lk******@geoci ties.com> wrote: Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work? Something that maybe gives 90% confidence that a given block of text is or is not UTF-8 encoded? You may be able to decide, that a given string is *not* UTF-8, but there is no way to clearly decide that the string *is* UTF-8. Therefore, "seems_utf8 " is a good name for such a function.
This is very good information. Thanks. It certainly points the right
way. But how does one get the value of the characters? Using ord()???
Take the string. If there is no character 0x80 to 0xFF, it doesn't matter, whether you define this text as UTF-8 or any ISO encoding, since the first 128 characters all have the same bit sequence in these encodings. However, if there actually *are* characters with a value of 128 or higher, check, whether the given sequence would be a valid UTF-8 sequence (see UTF-8 in Wikipedia for this). If this and every other sequence is valid UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence of extended ASCII/ANSI characters, too. It's impossible to be sure about that.
Take the string and move it through one character at a time, perhaps
in a for() loop, and get the byte value of each character using ord()?
The page for ord() says ord() "Return ASCII value of character" so if
a character is non-ASCII, perhaps it doesn't work? What PHP function
do I use to get the hex or dec value for a character?
(Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local files, for transmission UTF-8 is just fine because most characters won't have an extra byte. In most UTF-16 encoded documents, you can be pretty sure about the encoding due to the enormous percentage of 0x00 to 0x0F. In almost every text you get a percentage of at least 33% of these characters, since every character in US-ASCII and Latin 1 has a preceding 0x00, every character in Latin Extended A and B is preceded by 0x01, and so on.
I have the impression that UTF-16 or 32 is a bad idea in a web
context. Some good reasons were posted here: http://groups.google.com/groups?hl=e...40news.free.fr
The whole thread was informative.
lawrence <lk******@geoci ties.com> wrote: Simon Stienen <si***********@ news.slashlife. de> wrote in message news:<1w******* *********@news. dangerouscat.ne t>... lawrence <lk******@geoci ties.com> wrote: Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work? Something that maybe gives 90% confidence that a given block of text is or is not UTF-8 encoded? You may be able to decide, that a given string is *not* UTF-8, but there is no way to clearly decide that the string *is* UTF-8. Therefore, "seems_utf8 " is a good name for such a function.
This is very good information. Thanks. It certainly points the right way. But how does one get the value of the characters? Using ord()???
ord() will give you the value of the given single byte character, that is
0..255. In UTF-8, every character which has a higher value than 127 (0x7f)
is represented using at least two bytes:
<http://en.wikipedia.or g/wiki/Utf-8>
| Code range (hex) | UTF-8 (binary)
| 000000 - 00007F | 0xxxxxxx
| 000080 - 0007FF | 110xxxxx 10xxxxxx
| 000800 - 00FFFF | 1110xxxx 10xxxxxx 10xxxxxx
| 010000 - 10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
It also states:
| [...] number of unused bytes in a UTF-8 stream increased to 13 bytes:
| 0xC0, 0xC1, 0xF5-0xFF
Therefore, you have to find the first byte with a value of 0x80 or greater.
Either checking against ord():
1) if (ord($string{$i })>=0x80) ...
2) if (ord($string{$i })&0x80) ...
Or using a regular expression:
3) /[\x80-\xFF]/ (Get the offset when using preg_match)
Then check, whether the byte may occur in UTF-8 encoided text. If it
doesn't match any in the list 0xC0, 0xC1, 0xF5-0xFF, it may occur. (You
might want to do this check before finding the first byte >=0x80, using a
regexp or repeated substr_count.)
If it may occur in an UTF-8 encoded string this does not imply that it may
occur at *this* position. If ord($byte)&0xc0 (the two uppermost bits) is
0xC0, it is a byte, which has to be in the middle of a unicode character
sequence. Therefore, if we find such a character here, the string is not
valid UTF-8.
Otherwise, count how many of the highest significant bits are set.
Substract one. This is the number of bytes following in this UTF-8
character. Each of the following bytes has to validate: $byte&0xC0==0xC 0.
If so, this is a valid UTF-8 encoded character.
Find the next byte >=0x80 and continue checking until you either find an
invalid value (seems_utf8 -> false) or reach the end of the string
(seems_utf8 -> true).
I have the impression that UTF-16 or 32 is a bad idea in a web context. [...]
As I explicitly mentioned:
| (Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
| files, for transmission UTF-8 is just fine [...])
--
Simon Stienen <http://dangerouscat.ne t> <http://slashlife.de>
»What you do in this world is a matter of no consequence,
The question is, what can you make people believe that you have done.«
-- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle
Simon Stienen <si***********@ news.slashlife. de> wrote in message news:<1w******* *********@news. dangerouscat.ne t>... How validation is done: Take the string. If there is no character 0x80 to 0xFF, it doesn't matter, whether you define this text as UTF-8 or any ISO encoding, since the first 128 characters all have the same bit sequence in these encodings. However, if there actually *are* characters with a value of 128 or higher, check, whether the given sequence would be a valid UTF-8 sequence (see UTF-8 in Wikipedia for this). If this and every other sequence is valid UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence of extended ASCII/ANSI characters, too. It's impossible to be sure about that.
is there a way to figure out how many bytes a character has and the
value of each of those bytes?
On 1 Oct 2004 01:12:35 -0700, lk******@geocit ies.com (lawrence) wrote: Simon Stienen <si***********@ news.slashlife. de> wrote in message news:<1w******* *********@news. dangerouscat.ne t>... How validation is done: Take the string. If there is no character 0x80 to 0xFF, it doesn't matter, whether you define this text as UTF-8 or any ISO encoding, since the first 128 characters all have the same bit sequence in these encodings. However, if there actually *are* characters with a value of 128 or higher, check, whether the given sequence would be a valid UTF-8 sequence (see UTF-8 in Wikipedia for this). If this and every other sequence is valid UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence of extended ASCII/ANSI characters, too. It's impossible to be sure about that.
is there a way to figure out how many bytes a character has and the value of each of those bytes?
Here's my attempt at a function to determine if something is /not/ UTF-8.
<?php
function invalidUTF8($st r)
{
$charSize = 0;
for ($i = 0; $i < strlen($str); $i++)
{
$o = ord($str{$i});
if ($charSize == 0)
{ // must be a lead byte or a single byte character
if ($o <= 127) // single byte character
continue;
elseif (($o & 0xc0) == 0x80) // lead byte for 2 byte char
$charSize = 1;
elseif (($o & 0xe0) == 0xc0) // lead byte for 3 byte char
$charSize = 2;
elseif (($o & 0xf0) == 0xe0) // lead byte for 4 byte char
$charSize = 3;
else
{
trigger_error(
sprintf("Malfor med lead byte %08b at position %d",
$o, $i)
);
return true;
}
}
elseif (($o & 0xC0) == 0x80) // trail byte
{
$charSize--;
}
else
{
trigger_error(
sprintf("Malfor med trail byte %08b at position %d",
$o, $i)
);
return true;
}
}
return false;
}
var_dump(invali dUTF8("this is plain ASCII"));
print "<hr>";
// UTF-8 encoding of the Euro currency symbol
var_dump(invali dUTF8(chr(226). chr(130).chr(17 2)));
print "<hr>";
// invalid UTF-8
var_dump(invali dUTF8("xxxx" . chr(254)));
?>
--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool
Andy Hassall <an**@andyh.co. uk> wrote in message news:<6l******* *************** **********@4ax. com>...
<snip> Here's my attempt at a function to determine if something is /not/ UTF-8.
<?php function invalidUTF8($st r) { $charSize = 0; for ($i = 0; $i < strlen($str); $i++) { $o = ord($str{$i}); if ($charSize == 0) { // must be a lead byte or a single byte character if ($o <= 127) // single byte character continue; elseif (($o & 0xc0) == 0x80) // lead byte for 2 byte char $charSize = 1; elseif (($o & 0xe0) == 0xc0) // lead byte for 3 byte char $charSize = 2; elseif (($o & 0xf0) == 0xe0) // lead byte for 4 byte char $charSize = 3; else { trigger_error( sprintf("Malfor med lead byte %08b at position %d", $o, $i) ); return true; } } elseif (($o & 0xC0) == 0x80) // trail byte { $charSize--; } else { trigger_error( sprintf("Malfor med trail byte %08b at position %d", $o, $i) ); return true; } } return false; }
var_dump(invali dUTF8("this is plain ASCII")); print "<hr>";
// UTF-8 encoding of the Euro currency symbol var_dump(invali dUTF8(chr(226). chr(130).chr(17 2))); print "<hr>";
// invalid UTF-8 var_dump(invali dUTF8("xxxx" . chr(254))); ?> http://www.google.com/search?q=seems...%3Awww.php.net
--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
"Andy Hassall" <an**@andyh.co. uk> wrote in message
news:6l******** *************** *********@4ax.c om... On 1 Oct 2004 01:12:35 -0700, lk******@geocit ies.com (lawrence) wrote:
Simon Stienen <si***********@ news.slashlife. de> wrote in message
news:<1w******* *********@news. dangerouscat.ne t>... How validation is done: Take the string. If there is no character 0x80 to 0xFF, it doesn't
matter, whether you define this text as UTF-8 or any ISO encoding, since the
first 128 characters all have the same bit sequence in these encodings. However, if there actually *are* characters with a value of 128 or
higher, check, whether the given sequence would be a valid UTF-8 sequence (see UTF-8 in Wikipedia for this). If this and every other sequence is valid UTF-8, the string itself *might* be UTF-8. Of course it could be a
sequence of extended ASCII/ANSI characters, too. It's impossible to be sure
about that.
is there a way to figure out how many bytes a character has and the value of each of those bytes?
Here's my attempt at a function to determine if something is /not/ UTF-8.
<?php function invalidUTF8($st r) { ... }
Ehh, there's, like, this thing call regular expression :-)
function IsUTF8($s) {
$s = "$s ";
return !preg_match('/[\xF0-\xFF]/', $s) &&
!preg_match('/[\xC0-\xDF][^\x80-\xBF]/', $s) &&
!preg_match('/[\xE0-\xEF][^\x80-\xBF][^\x80-\xBF]/', $s);
}
Chung Leong <ch***********@ hotmail.com> wrote: Ehh, there's, like, this thing call regular expression :-)
function IsUTF8($s) { $s = "$s "; return !preg_match('/[\xF0-\xFF]/', $s) && !preg_match('/[\xC0-\xDF][^\x80-\xBF]/', $s) && !preg_match('/[\xE0-\xEF][^\x80-\xBF][^\x80-\xBF]/', $s); }
How about :
E0 00 80?
E0 <END>?
C0 <END>?
80 81 82?
Invalid UTF-8, but your function would return true for them.
If you want a RegExp:
$isinvalidutf8 = preg_match('=([\xF0-\xFF]|'.
'[\xC0-\xDF]([^\x80-\xBF]|$)|'.
'[\xE0-\xEF][\x00-\xFF]($|[^\x80-\xBF])|'.
'[\xE0-\xEF]($|[^\x80-\xBF][\x00-\xFF])|'.
'(^|[^\x80-\xBF])[\x80-\xBF])=', $string);
(untested!)
Btw.: The opposite of "All birds are able to fly." is "There is at least
one bird which can't fly.", *not* "No bird is able to fly."
Also, the opposite of "isInvalidU TF8" is "mightBeValidUT F8", not
"isValidUTF 8". Therefore the name you chose for your function is wrong.
--
Simon Stienen <http://dangerouscat.ne t> <http://slashlife.de>
»What you do in this world is a matter of no consequence,
The question is, what can you make people believe that you have done.«
-- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle
Andy Hassall <an**@andyh.co. uk> wrote in message news:<6l******* *************** **********@4ax. com>... On 1 Oct 2004 01:12:35 -0700, lk******@geocit ies.com (lawrence) wrote:
Simon Stienen <si***********@ news.slashlife. de> wrote in message news:<1w******* *********@news. dangerouscat.ne t>... How validation is done: Take the string. If there is no character 0x80 to 0xFF, it doesn't matter, whether you define this text as UTF-8 or any ISO encoding, since the first 128 characters all have the same bit sequence in these encodings. However, if there actually *are* characters with a value of 128 or higher, check, whether the given sequence would be a valid UTF-8 sequence (see UTF-8 in Wikipedia for this). If this and every other sequence is valid UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence of extended ASCII/ANSI characters, too. It's impossible to be sure about that.
is there a way to figure out how many bytes a character has and the value of each of those bytes?
Here's my attempt at a function to determine if something is /not/ UTF-8.
Thanks much for the code. I followed the other link to www.php.net
where someone had posted there seems_UTF8() function. Your function
and there's combined should offer a high level of confidence about
whether something is UTF-8. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Svennglenn |
last post by:
I'm working on a program that is supposed to save
different information to text files.
Because the program is in swedish i have to use
unicode text for ÅÄÖ letters.
When I run the following testscript I get an error message.
# -*- coding: cp1252 -*-
|
by: Eric Lilja |
last post by:
Hello, I had what I thought was normal text-file and I needed to locate a
string matching a certain pattern in that file and, if found, replace that
string. I thought this would be simple but I had problems getting my
algorithm to work and in order to help me find the solution I decided to
print each line to screen as I read them.
Then, to my surprise, I noticed that there was a space between every
character as I outputted the lines to the...
|
by: Clara Yeung |
last post by:
I have captured some SOAP messages (using org.wsi.test.monitor.Monitor
of the WSI test tool). When I try to analyze the messages with WSI
test tool analyzer, the "message" artifact of the report always return
"Missing Input". I have tried various "correlationType: endpoint,
operation, namespace". When I eyeball the captured SOAP messages,
they looks fine with correct namespace and valid operation name within
the soap body of the...
|
by: Alexis |
last post by:
Hello,
I have developted a webservice application. The application has a few
webservices each webservice with their own webmethods of course.
I want to measure the performance of my site. I look at the Application
Center Test of Visual Studio Enterprice edition, but can not make it to work
with the web services. I can't figure out how to call the web services from
an ACT test. It seems it only can work with aspx pages not web services asmx...
|
by: dpomt |
last post by:
When the ASP.NET menu is rendered on downlevel browers, the text "^ up one
level" is displayed.
Any ideas how I can change that text? I did not find a property for the menu
control where I can change it.
Dieter
| |
by: Timothy Grant |
last post by:
I'm playing around with py.test and writing a parser for it's output
for use in TextMate.
I've run into what appears to be a strange phenomenon, but which is
likely me doing something wrong.
I'm writing a test to test some HTML output and the test fails for
several reasons, all of which I understand but one.
Here's the output (I've surrounded the problem area with *** to make
|
by: Netkiller |
last post by:
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Project: Network News Transport Protocol Server Program
Description:
基于数æ®åº“的新闻组,实现BBSå‰ç«¯ä½¿ç”¨NNTPåè®®æ¥è®¿é—®è´´å
Reference:
NNTPå议: http://www.mibsoftware.com/userkt/0099.htm
æ£åˆ™è¡¨è¾¾å¼ï¼š
http://wiki.woodpecker.org.cn/moin/RegExpInPython#head-2358765384844ed72f01658cbcde24613d941e9d
|
by: list |
last post by:
Hi folks,
I am new to Googlegroups. I asked my questions at other forums, since
now.
I have an important question: I have to check files if they are
binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
check a file an find out if the file is binary or text?
Thanks for your help.
|
by: Tom |
last post by:
I don't want to re-invent the wheel and am looking for a simple
implementation of a text viewer or RichTextBox in read only mode that
allows rapid file positioning within large data files without the time
consuming and memory hogging associated with loading the entire file.
Using the thumb to get close and then paging and scrolling to get
exact placement within the file. Perhaps with only a few pages of data
loaded into memory.
I...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |