473,661 Members | 2,477 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

checking to see if a character is UTF8


this is a function that someone has up on www.php.net:
function seemsUTF8($Str) {
// bmorel at ssi dot fr
//17-Feb-2004 01:22
//Here is an improved version of that function, compatible with 31-bit
encoding scheme of //Unicode //3.x :
for ($i=0; $i < strlen($Str); $i++) {
if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j < $n; $j++) {
# n bytes matching 10bbbbbb follow ?
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}

What is achieved by the variable $n? I don't know enough about
character codes to understand what that final inner for loop is trying
to do.

Nov 22 '05 #1
1 1869
lk******@geocit ies.com wrote:

: this is a function that someone has up on www.php.net:
: function seemsUTF8($Str) {
: // bmorel at ssi dot fr
: //17-Feb-2004 01:22
: //Here is an improved version of that function, compatible with 31-bit
: encoding scheme of //Unicode //3.x :
: for ($i=0; $i < strlen($Str); $i++) {
: if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
: elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
: elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
: elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
: elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
: elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
: else return false; # Does not match any model
: for ($j=0; $j < $n; $j++) {
: # n bytes matching 10bbbbbb follow ?
: if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
: return false;
: }
: }
: return true;
: }

: What is achieved by the variable $n? I don't know enough about
: character codes to understand what that final inner for loop is trying
: to do.

A utf-8 character can take more than one byte. Characters that are larger
(in numeric value) than 127 require more than one byte. The first byte of
a multibyte character indicates how many bytes are in the character.

There can be from two to six bytes in total (the first byte followed by 1
to 5 more bytes).

The outer loop is looking for the first byte of a multibyte character.
When it finds one then it examines the bit pattern to see how many more
bytes there are.

The inner loop is examining those bytes (the "more" in the above
sentence). It is checking that there are the correct number of
continuation bytes following the first byte.

The outer loop skips over bytes that represent single byte characters.

--

This programmer available for rent.
Nov 22 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4239
by: Axier | last post by:
I cannot get the swedish character set on images. What can be wrong? I would very much appreciate help. regards, axier I can supply some code here and a link: http://www.mymobile.nu/services/fontgen/fonttest.php Some people looks like they can get it to work with no effort at all with the same script.
7
2921
by: WindAndWaves | last post by:
Hi Folk Here I am writing my first php / mysql site, almost ready, and now this... charactersets.... The encoding that I use on my webpage is: <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8"> When people enter new data I use
0
2383
by: JJ | last post by:
Hi, I have a little, big, boring problem :) I have a utf8 txt file to import in a MySQL db, cause I must create a web-application in PHP for reading this information on-line. I have create a new DB in MYSQL 4.1.1a setting CHARACTER=utf8, then I have create a table t1 with character set utf8 and some fileds also with CHARACTER=utf8.
10
17616
by: David Komanek | last post by:
Hi all, I have a question if it is possible to manipulate the settings of character encoding in Ms Internet Explorer 5.0, 5.5 and 6.0. The problem is that the default instalation of Ms IE seems to have hard selected default encoding to "Western European (ISO)", which means iso-8859-1. When browsing pages with some Central/Eastern European characters these are converted to iso-8859-1 so displayed wrong. I would suppose the...
3
1310
by: CyberSpyders | last post by:
Hi, I have an ASP.Net website, which allows users to upload a file which is then inserted into a database. This is all fine until it reads a line with the string +Anu in it. It transforms this to this char É» (which, if Googled for, is described as Unicode Character 'LATIN SMALL LETTER TURNED R WITH HOOK' (U+027B) or, in Phonetics, as a 'Retroflex approximant'.)
2
10296
by: Jason | last post by:
Hi, I was wondering if anyone could advise me on this. Right now I am setting up a DB2 UDB V8.2.3 database with UTF8 character set, which will work with a J2EE application running on WebSphere Application Server. I have two questions: 1. How many characters, such as Chinese, Japanese, can a CHAR(128) or
2
2457
by: withers | last post by:
XML gives an error when I have a £ sign (GBP) - £ - in a string. I've fixed this by converting it to its HTML number. In case other characters may cause the same error, I'm converting characters to their HTML numbers that < 32 or 126 decimal. Is this correct? What would you advise? Thanking you in anticipation.
14
4104
by: Ioannis Vranos | last post by:
The following code does not work as expected: #include <wchar.h> #include <locale.h> #include <stdio.h> #include <stddef.h> int main() {
13
3880
by: =?Utf-8?B?YXVsZGg=?= | last post by:
i have come across a situation in my project where i read a text file with some characters greater than hex 0x7f. i need to write character (0xE0) to a new file as an exception. however when i attempt to write this via "Console.Write" or "filestream.Write" it seems the value changes. most of the output file is in text mode. if i view the original file in binary mode i see the character i'm having issue with as "e0 00" but when i...
0
8432
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8343
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8855
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
7364
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6185
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5653
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4179
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
2762
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1743
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.