how to test text to see if maybe it is UTF-8????

lawrence

Someone on www.php.net suggested using a seems_utf8() method to test
text for UTF-8 character encoding but didn't specify how to write such
a method. Can anyone suggest a test that might work? Something that
maybe gives 90% confidence that a given block of text is or is not
UTF-8 encoded?

Jul 17 '05 #1

Subscribe Post Reply

4136

Simon Stienen

lawrence <lk******@geocities.com> wrote:

Someone on www.php.net suggested using a seems_utf8() method to test
text for UTF-8 character encoding but didn't specify how to write such
a method. Can anyone suggest a test that might work? Something that
maybe gives 90% confidence that a given block of text is or is not
UTF-8 encoded?

You may be able to decide, that a given string is *not* UTF-8, but there is
no way to clearly decide that the string *is* UTF-8. Therefore,
"seems_utf8" is a good name for such a function.

How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
whether you define this text as UTF-8 or any ISO encoding, since the first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure about
that.

HTH
Simon

(Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
files, for transmission UTF-8 is just fine because most characters won't
have an extra byte. In most UTF-16 encoded documents, you can be pretty
sure about the encoding due to the enormous percentage of 0x00 to 0x0F. In
almost every text you get a percentage of at least 33% of these characters,
since every character in US-ASCII and Latin 1 has a preceding 0x00, every
character in Latin Extended A and B is preceded by 0x01, and so on.

0x0000 to 0x0fff contains:
Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, IPA
Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek
and Coptic, Cyrillic, Cyrillic Supplement, Armenian, Hebrew, Arabic,
Syriac, Thaana, Devanagari and Bengali, Gurmukhi, Gujarati, Oriya, Tamil,
Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan
I guess this list should cover most documents.)
--
Simon Stienen <http://dangerouscat.net> <http://slashlife.de>
»What you do in this world is a matter of no consequence,
The question is, what can you make people believe that you have done.«
-- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle

Jul 17 '05 #2

lawrence

Simon Stienen <si***********@news.slashlife.de> wrote in message news:<1w****************@news.dangerouscat.net>...

lawrence <lk******@geocities.com> wrote:
Someone on www.php.net suggested using a seems_utf8() method to test
text for UTF-8 character encoding but didn't specify how to write such
a method. Can anyone suggest a test that might work? Something that
maybe gives 90% confidence that a given block of text is or is not
UTF-8 encoded?
You may be able to decide, that a given string is *not* UTF-8, but there is
no way to clearly decide that the string *is* UTF-8. Therefore,
"seems_utf8" is a good name for such a function.

This is very good information. Thanks. It certainly points the right
way. But how does one get the value of the characters? Using ord()???

Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
whether you define this text as UTF-8 or any ISO encoding, since the first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure about
that.
Take the string and move it through one character at a time, perhaps
in a for() loop, and get the byte value of each character using ord()?

The page for ord() says ord() "Return ASCII value of character" so if
a character is non-ASCII, perhaps it doesn't work? What PHP function
do I use to get the hex or dec value for a character?

(Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
files, for transmission UTF-8 is just fine because most characters won't
have an extra byte. In most UTF-16 encoded documents, you can be pretty
sure about the encoding due to the enormous percentage of 0x00 to 0x0F. In
almost every text you get a percentage of at least 33% of these characters,
since every character in US-ASCII and Latin 1 has a preceding 0x00, every
character in Latin Extended A and B is preceded by 0x01, and so on.

I have the impression that UTF-16 or 32 is a bad idea in a web
context. Some good reasons were posted here:

http://groups.google.com/groups?hl=e...40news.free.fr

The whole thread was informative.

Jul 17 '05 #3

Simon Stienen

lawrence <lk******@geocities.com> wrote:

Simon Stienen <si***********@news.slashlife.de> wrote in message news:<1w****************@news.dangerouscat.net>...
lawrence <lk******@geocities.com> wrote:
Someone on www.php.net suggested using a seems_utf8() method to test
text for UTF-8 character encoding but didn't specify how to write such
a method. Can anyone suggest a test that might work? Something that
maybe gives 90% confidence that a given block of text is or is not
UTF-8 encoded?
You may be able to decide, that a given string is *not* UTF-8, but there is
no way to clearly decide that the string *is* UTF-8. Therefore,
"seems_utf8" is a good name for such a function.

This is very good information. Thanks. It certainly points the right
way. But how does one get the value of the characters? Using ord()???

ord() will give you the value of the given single byte character, that is
0..255. In UTF-8, every character which has a higher value than 127 (0x7f)
is represented using at least two bytes:

<http://en.wikipedia.org/wiki/Utf-8>
| Code range (hex) | UTF-8 (binary)
| 000000 - 00007F | 0xxxxxxx
| 000080 - 0007FF | 110xxxxx 10xxxxxx
| 000800 - 00FFFF | 1110xxxx 10xxxxxx 10xxxxxx
| 010000 - 10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

It also states:
| [...] number of unused bytes in a UTF-8 stream increased to 13 bytes:
| 0xC0, 0xC1, 0xF5-0xFF

Therefore, you have to find the first byte with a value of 0x80 or greater.
Either checking against ord():
1) if (ord($string{$i})>=0x80) ...
2) if (ord($string{$i})&0x80) ...
Or using a regular expression:
3) /[\x80-\xFF]/ (Get the offset when using preg_match)

Then check, whether the byte may occur in UTF-8 encoided text. If it
doesn't match any in the list 0xC0, 0xC1, 0xF5-0xFF, it may occur. (You
might want to do this check before finding the first byte >=0x80, using a
regexp or repeated substr_count.)

If it may occur in an UTF-8 encoded string this does not imply that it may
occur at *this* position. If ord($byte)&0xc0 (the two uppermost bits) is
0xC0, it is a byte, which has to be in the middle of a unicode character
sequence. Therefore, if we find such a character here, the string is not
valid UTF-8.
Otherwise, count how many of the highest significant bits are set.
Substract one. This is the number of bytes following in this UTF-8
character. Each of the following bytes has to validate: $byte&0xC0==0xC0.
If so, this is a valid UTF-8 encoded character.

Find the next byte >=0x80 and continue checking until you either find an
invalid value (seems_utf8 -> false) or reach the end of the string
(seems_utf8 -> true).
I have the impression that UTF-16 or 32 is a bad idea in a web
context. [...]

As I explicitly mentioned:
| (Therefore, I prefer UTF-16 and/or UTF-32 over UTF-8... at least for local
| files, for transmission UTF-8 is just fine [...])
--
Simon Stienen <http://dangerouscat.net> <http://slashlife.de>
»What you do in this world is a matter of no consequence,
The question is, what can you make people believe that you have done.«
-- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle

Jul 17 '05 #4

lawrence

Simon Stienen <si***********@news.slashlife.de> wrote in message news:<1w****************@news.dangerouscat.net>...

How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
whether you define this text as UTF-8 or any ISO encoding, since the first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure about
that.

is there a way to figure out how many bytes a character has and the
value of each of those bytes?

Jul 17 '05 #5

Andy Hassall

On 1 Oct 2004 01:12:35 -0700, lk******@geocities.com (lawrence) wrote:

Simon Stienen <si***********@news.slashlife.de> wrote in message news:<1w****************@news.dangerouscat.net>...
How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
whether you define this text as UTF-8 or any ISO encoding, since the first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure about
that.

is there a way to figure out how many bytes a character has and the
value of each of those bytes?

Here's my attempt at a function to determine if something is /not/ UTF-8.

<?php
function invalidUTF8($str)
{
$charSize = 0;
for ($i = 0; $i < strlen($str); $i++)
{
$o = ord($str{$i});
if ($charSize == 0)
{ // must be a lead byte or a single byte character
if ($o <= 127) // single byte character
continue;
elseif (($o & 0xc0) == 0x80) // lead byte for 2 byte char
$charSize = 1;
elseif (($o & 0xe0) == 0xc0) // lead byte for 3 byte char
$charSize = 2;
elseif (($o & 0xf0) == 0xe0) // lead byte for 4 byte char
$charSize = 3;
else
{
trigger_error(
sprintf("Malformed lead byte %08b at position %d",
$o, $i)
);
return true;
}
}
elseif (($o & 0xC0) == 0x80) // trail byte
{
$charSize--;
}
else
{
trigger_error(
sprintf("Malformed trail byte %08b at position %d",
$o, $i)
);
return true;
}
}
return false;
}

var_dump(invalidUTF8("this is plain ASCII"));
print "<hr>";

// UTF-8 encoding of the Euro currency symbol
var_dump(invalidUTF8(chr(226).chr(130).chr(172)));
print "<hr>";

// invalid UTF-8
var_dump(invalidUTF8("xxxx" . chr(254)));
?>

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #6

R. Rajesh Jeba Anbiah

Andy Hassall <an**@andyh.co.uk> wrote in message news:<6l********************************@4ax.com>. ..
<snip>

Here's my attempt at a function to determine if something is /not/ UTF-8.

<?php
function invalidUTF8($str)
{
$charSize = 0;
for ($i = 0; $i < strlen($str); $i++)
{
$o = ord($str{$i});
if ($charSize == 0)
{ // must be a lead byte or a single byte character
if ($o <= 127) // single byte character
continue;
elseif (($o & 0xc0) == 0x80) // lead byte for 2 byte char
$charSize = 1;
elseif (($o & 0xe0) == 0xc0) // lead byte for 3 byte char
$charSize = 2;
elseif (($o & 0xf0) == 0xe0) // lead byte for 4 byte char
$charSize = 3;
else
{
trigger_error(
sprintf("Malformed lead byte %08b at position %d",
$o, $i)
);
return true;
}
}
elseif (($o & 0xC0) == 0x80) // trail byte
{
$charSize--;
}
else
{
trigger_error(
sprintf("Malformed trail byte %08b at position %d",
$o, $i)
);
return true;
}
}
return false;
}

var_dump(invalidUTF8("this is plain ASCII"));
print "<hr>";

// UTF-8 encoding of the Euro currency symbol
var_dump(invalidUTF8(chr(226).chr(130).chr(172)));
print "<hr>";

// invalid UTF-8
var_dump(invalidUTF8("xxxx" . chr(254)));
?>

http://www.google.com/search?q=seems...%3Awww.php.net

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com

Jul 17 '05 #7

Chung Leong

"Andy Hassall" <an**@andyh.co.uk> wrote in message
news:6l********************************@4ax.com...

On 1 Oct 2004 01:12:35 -0700, lk******@geocities.com (lawrence) wrote:
Simon Stienen <si***********@news.slashlife.de> wrote in message news:<1w****************@news.dangerouscat.net>...
How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't matter, whether you define this text as UTF-8 or any ISO encoding, since the first 128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or higher, check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence of extended ASCII/ANSI characters, too. It's impossible to be sure about that.

is there a way to figure out how many bytes a character has and the
value of each of those bytes?

Here's my attempt at a function to determine if something is /not/ UTF-8.

<?php
function invalidUTF8($str)
{
...
}

Ehh, there's, like, this thing call regular expression :-)

function IsUTF8($s) {
$s = "$s ";
return !preg_match('/[\xF0-\xFF]/', $s) &&
!preg_match('/[\xC0-\xDF][^\x80-\xBF]/', $s) &&
!preg_match('/[\xE0-\xEF][^\x80-\xBF][^\x80-\xBF]/', $s);
}

Jul 17 '05 #8

Simon Stienen

Chung Leong <ch***********@hotmail.com> wrote:

Ehh, there's, like, this thing call regular expression :-)

function IsUTF8($s) {
$s = "$s ";
return !preg_match('/[\xF0-\xFF]/', $s) &&
!preg_match('/[\xC0-\xDF][^\x80-\xBF]/', $s) &&
!preg_match('/[\xE0-\xEF][^\x80-\xBF][^\x80-\xBF]/', $s);
}

How about:
E0 00 80?
E0 <END>?
C0 <END>?
80 81 82?
Invalid UTF-8, but your function would return true for them.

If you want a RegExp:
$isinvalidutf8 = preg_match('=([\xF0-\xFF]|'.
'[\xC0-\xDF]([^\x80-\xBF]|$)|'.
'[\xE0-\xEF][\x00-\xFF]($|[^\x80-\xBF])|'.
'[\xE0-\xEF]($|[^\x80-\xBF][\x00-\xFF])|'.
'(^|[^\x80-\xBF])[\x80-\xBF])=', $string);
(untested!)

Btw.: The opposite of "All birds are able to fly." is "There is at least
one bird which can't fly.", *not* "No bird is able to fly."
Also, the opposite of "isInvalidUTF8" is "mightBeValidUTF8", not
"isValidUTF8". Therefore the name you chose for your function is wrong.
--
Simon Stienen <http://dangerouscat.net> <http://slashlife.de>
»What you do in this world is a matter of no consequence,
The question is, what can you make people believe that you have done.«
-- Sherlock Holmes in "A Study in Scarlet" by Sir Arthur Conan Doyle

Jul 17 '05 #9

lawrence

Andy Hassall <an**@andyh.co.uk> wrote in message news:<6l********************************@4ax.com>. ..

On 1 Oct 2004 01:12:35 -0700, lk******@geocities.com (lawrence) wrote:
Simon Stienen <si***********@news.slashlife.de> wrote in message news:<1w****************@news.dangerouscat.net>...
How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't matter,
whether you define this text as UTF-8 or any ISO encoding, since the first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure about
that.

is there a way to figure out how many bytes a character has and the
value of each of those bytes?

Here's my attempt at a function to determine if something is /not/ UTF-8.

Thanks much for the code. I followed the other link to www.php.net
where someone had posted there seems_UTF8() function. Your function
and there's combined should offer a high level of confidence about
whether something is UTF-8.

Jul 17 '05 #10

Similar topics

Trouble saving unicode text to file

by: Svennglenn | last post by:

I'm working on a program that is supposed to save different information to text files. Because the program is in swedish i have to use unicode text for ÅÄÖ letters. When I run the following...

Python

Will standard C++ allow me to replace a string in a unicode-encoded text file?

by: Eric Lilja | last post by:

Hello, I had what I thought was normal text-file and I needed to locate a string matching a certain pattern in that file and, if found, replace that string. I thought this would be simple but I had...

C / C++

WSI test tool on analysing monitored logs (soap messages)

by: Clara Yeung | last post by:

I have captured some SOAP messages (using org.wsi.test.monitor.Monitor of the WSI test tool). When I try to analyze the messages with WSI test tool analyzer, the "message" artifact of the report...

.NET Framework

How to test performance using the ACT

by: Alexis | last post by:

Hello, I have developted a webservice application. The application has a few webservices each webservice with their own webmethods of course. I want to measure the performance of my site. I look...

.NET Framework

ASP.NET 2.0 menu: where to set alternate text for "^ up one level

by: dpomt | last post by:

When the ASP.NET menu is rendered on downlevel browers, the text "^ up one level" is displayed. Any ideas how I can change that text? I did not find a property for the menu control where I can...

ASP.NET

py.test munging strings in asserts?

by: Timothy Grant | last post by:

I'm playing around with py.test and writing a parser for it's output for use in TextMate. I've run into what appears to be a strange phenomenon, but which is likely me doing something wrong. ...

Python

my newsgroup base database. (test)

by: Netkiller | last post by:

#!/usr/bin/python # -*- coding: utf-8 -*- """ Project: Network News Transport Protocol Server Program Description: åŸºäºŽæ•°æ®åº“çš„æ–°é—»ç»„ï¼Œå®žçŽ°BBSå‰ç«¯ä½¿ç”¨NNTPåè®®æ¥è®¿é—®è´´å...

Python

Binary or text file

by: list | last post by:

Hi folks, I am new to Googlegroups. I asked my questions at other forums, since now. I have an important question: I have to check files if they are binary(.bmp, .avi, .jpg) or text(.txt,...

C / C++

Scrolling Large Text File Without Hogging Memory

by: Tom | last post by:

I don't want to re-invent the wheel and am looking for a simple implementation of a text viewer or RichTextBox in read only mode that allows rapid file positioning within large data files without...

C# / C Sharp

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware