Replace special characters by non-special characters

Pikkel

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

Jul 17 '05 #1

Subscribe Post Reply

47780

Michael Fesser

.oO(Pikkel)

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

Maybe strtr()?

Micha

Jul 17 '05 #2

CJ Llewellyn

"Pikkel" <pi****@de.wop> wrote in message
news:41***********************@news.xs4all.nl...

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

http://uk.php.net/manual/en/function...ecialchars.php

Jul 17 '05 #3

Pikkel

CJ Llewellyn wrote:

"Pikkel" <pi****@de.wop> wrote in message
news:41***********************@news.xs4all.nl...
i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

http://uk.php.net/manual/en/function...ecialchars.php

Thanks for you tip, but i'm not looking for html replacement but
character replacement: á --> a

Jul 17 '05 #4

Pikkel

Michael Fesser wrote:

.oO(Pikkel)

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

Maybe strtr()?

Micha

i should replace all characters by myself using this function.
i was looking for a complete [accent, cedille, umlaut etc.] strip function

Jul 17 '05 #5

Andy Hassall

On Fri, 05 Nov 2004 22:08:03 +0100, Pikkel <pi****@de.wop> wrote:

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

In what character set encoding? If it's a small one, e.g. iso-8859-15, just
list all the accented/non-accented pairs and run it through strtr.

If it's a Unicode variant, it's bit more of a challenge...

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #6

lawrence

Andy Hassall <an**@andyh.co.uk> wrote in message news:<tc********************************@4ax.com>. ..

On Fri, 05 Nov 2004 22:08:03 +0100, Pikkel <pi****@de.wop> wrote:
i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

In what character set encoding? If it's a small one, e.g. iso-8859-15, just
list all the accented/non-accented pairs and run it through strtr.

If it's a Unicode variant, it's bit more of a challenge...

I'm possibly beating this subject to death, but I've yet to think of
an answer to the problem. If a user copies text from a iso-8859-15
page and then pastes it into the textarea of a form and then submits
it to a CMS which then sends it out as UTF-8 one gets garbage
characters, as one can see on this page:

http://www.krubner.com/index.php?pageId=33396

So I'm wondering if there is a way to cycle through and find quote
marks and such that are unique to iso-8859-15?????

Jul 17 '05 #7

Andy Hassall

On 6 Nov 2004 01:19:52 -0800, lk******@geocities.com (lawrence) wrote:

I'm possibly beating this subject to death, but I've yet to think of
an answer to the problem. If a user copies text from a iso-8859-15
page and then pastes it into the textarea of a form and then submits
it to a CMS which then sends it out as UTF-8 one gets garbage
characters, as one can see on this page:

http://www.krubner.com/index.php?pageId=33396
There's probably a bit more to it than that, such as the encoding of the page
containing the form in the first place. If you just dump out ISO-8859-15
encoded data and pretend it's UTF-8, of course it won't work, except for the
shared ASCII (top bit not set, i.e. <= 127) representations between the two
encodings. I can't remember quite where you got to from the previous threads on
this subject though.
So I'm wondering if there is a way to cycle through and find quote
marks and such that are unique to iso-8859-15?????

If it's between ISO-8859-15 and UTF-8, there are no characters unique to
ISO-8859-15, since UTF-8 encodes all those characters and more. Their encoding
differs for all those with encoding >127 from ISO-8859-15 but that's a
different question. The Euro is the same character in both, but has a different
encoding in both.

But anyway, it seems to me that the simple approach is just:

(1) Present the form in UTF-8 in the first place.
(2) The user copies content from one site, in whatever encoding. Their browser
places it on the clipboard in some OS-native encoding which is hopefully
irrelevant.
(3) The user pastes it into the UTF-8 form. The browser converts the characters
into the appropriate encoding.
(4) Post the data; since the source form is UTF-8, the data is sent in UTF-8,
and you're done.
(5) You can then just reject anything that comes in as malformed UTF-8 from the
previous step.

Consider:

Two scripts, one to output iso-8859-15 and the other Codepage 1252 (with the
dread Smart Quotes and all):

<?php header('Content-type: text/html; charset=iso8859-15'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characters to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
if ($i >= 127 && $i <= 159)
continue;

print htmlspecialchars(chr($i), ENT_COMPAT, 'ISO-8859-15');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>
<?php header('Content-type: text/html; charset=Windows-1252"'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characters to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
print htmlspecialchars(chr($i), ENT_COMPAT, 'cp1252');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>
Then utf8form.php, put text in, print back out encoded as utf-8:

<?php header('Content-type: text/html; charset=utf-8'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Outputting</title>
</head>
<body>
<pre>
<?php
if (isset($_POST['text']))
{
print htmlspecialchars($_POST['text'], ENT_COMPAT, 'UTF-8');
}
?>
</pre>

<form method="post" action="utf8form.php" accept-charset="utf-8">
<textarea name="text"></textarea>
<input type="submit">
</form>

</body>
</html>
In Firefox and IE6, this appears to work for me; copying all of the output
from the first pages, which was iso-8859-15 or Codepage 1252, and pasting into
the second page and submitting the form. The output is the same set of
characters, but UTF-8 encoded.

Also worked from other character set encodings; found a page encoded in
Shift-JIS and repeated the steps. The output looked the same to me (although I
can't read Japanese).

OK - that's the purist approach, when all the tools in the chain are
apparently handling encodings properly.

But are you after some more pragmatic approach, something like:

"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or
maybe utf-8, but it's likely been copied and mangled between applications so I
can't reliably tell which. How do I clean this data up in a reasonable way so
it can be converted to UTF8 for presentation on a UTF8 encoded page?"

If all the data has values <=127 then it's easy - that's all plain ASCII which
is a common subset of all four character sets.

You can at least rule out UTF-8 by using the functions posted in previous
threads looking for malformed UTF-8. If there's a significant number of
characters >127 and it all validates as UTF-8, then the odds of it probably
being UTF-8 increase the more characters above 127 there are, but it's still
not certain.

So you've narrowed it down to one of the three single-byte character sets.

Then the major differences are:

Codepage 1252 has printable characters in the range 128-159 (with a couple of
gaps) wheras the iso8859 encodings only have non-printable characters there. So
if there's data in this range, odds are it's Codepage 1252 - so you can convert
it to UTF-8 from there.

This range holds the angled "smart" quotes, and the em-dash, which are the
characters that cause the most trouble. So alternatively, you could convert
them to plain quotes and dashes if you wanted.

If there's no characters in that range, then you haven't ruled out 1252, but
the rest of the encoding is pretty similar between 1252, iso8859-1 and
iso8859-15

See http://en.wikipedia.org/wiki/ISO_8859-15 for the differences between -1
and -15, the main character worth worrying about most is the Euro (which is
somewhere else again in 1252 - in the 128-159 range I believe).

Is this any help?

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #8

Pikkel

Andy Hassall wrote:

On 6 Nov 2004 01:19:52 -0800, lk******@geocities.com (lawrence) wrote:

I'm possibly beating this subject to death, but I've yet to think of
an answer to the problem. If a user copies text from a iso-8859-15
page and then pastes it into the textarea of a form and then submits
it to a CMS which then sends it out as UTF-8 one gets garbage
characters, as one can see on this page:

http://www.krubner.com/index.php?pageId=33396

There's probably a bit more to it than that, such as the encoding of the page
containing the form in the first place. If you just dump out ISO-8859-15
encoded data and pretend it's UTF-8, of course it won't work, except for the
shared ASCII (top bit not set, i.e. <= 127) representations between the two
encodings. I can't remember quite where you got to from the previous threads on
this subject though.

So I'm wondering if there is a way to cycle through and find quote
marks and such that are unique to iso-8859-15?????

If it's between ISO-8859-15 and UTF-8, there are no characters unique to
ISO-8859-15, since UTF-8 encodes all those characters and more. Their encoding
differs for all those with encoding >127 from ISO-8859-15 but that's a
different question. The Euro is the same character in both, but has a different
encoding in both.

But anyway, it seems to me that the simple approach is just:

(1) Present the form in UTF-8 in the first place.
(2) The user copies content from one site, in whatever encoding. Their browser
places it on the clipboard in some OS-native encoding which is hopefully
irrelevant.
(3) The user pastes it into the UTF-8 form. The browser converts the characters
into the appropriate encoding.
(4) Post the data; since the source form is UTF-8, the data is sent in UTF-8,
and you're done.
(5) You can then just reject anything that comes in as malformed UTF-8 from the
previous step.

Consider:

Two scripts, one to output iso-8859-15 and the other Codepage 1252 (with the
dread Smart Quotes and all):

<?php header('Content-type: text/html; charset=iso8859-15'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characters to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
if ($i >= 127 && $i <= 159)
continue;

print htmlspecialchars(chr($i), ENT_COMPAT, 'ISO-8859-15');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>
<?php header('Content-type: text/html; charset=Windows-1252"'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characters to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
print htmlspecialchars(chr($i), ENT_COMPAT, 'cp1252');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>
Then utf8form.php, put text in, print back out encoded as utf-8:

<?php header('Content-type: text/html; charset=utf-8'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Outputting</title>
</head>
<body>
<pre>
<?php
if (isset($_POST['text']))
{
print htmlspecialchars($_POST['text'], ENT_COMPAT, 'UTF-8');
}
?>
</pre>

<form method="post" action="utf8form.php" accept-charset="utf-8">
<textarea name="text"></textarea>
<input type="submit">
</form>

</body>
</html>
In Firefox and IE6, this appears to work for me; copying all of the output
from the first pages, which was iso-8859-15 or Codepage 1252, and pasting into
the second page and submitting the form. The output is the same set of
characters, but UTF-8 encoded.

Also worked from other character set encodings; found a page encoded in
Shift-JIS and repeated the steps. The output looked the same to me (although I
can't read Japanese).

OK - that's the purist approach, when all the tools in the chain are
apparently handling encodings properly.

But are you after some more pragmatic approach, something like:

"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or
maybe utf-8, but it's likely been copied and mangled between applications so I
can't reliably tell which. How do I clean this data up in a reasonable way so
it can be converted to UTF8 for presentation on a UTF8 encoded page?"

If all the data has values <=127 then it's easy - that's all plain ASCII which
is a common subset of all four character sets.

You can at least rule out UTF-8 by using the functions posted in previous
threads looking for malformed UTF-8. If there's a significant number of
characters >127 and it all validates as UTF-8, then the odds of it probably
being UTF-8 increase the more characters above 127 there are, but it's still
not certain.

So you've narrowed it down to one of the three single-byte character sets.

Then the major differences are:

Codepage 1252 has printable characters in the range 128-159 (with a couple of
gaps) wheras the iso8859 encodings only have non-printable characters there. So
if there's data in this range, odds are it's Codepage 1252 - so you can convert
it to UTF-8 from there.

This range holds the angled "smart" quotes, and the em-dash, which are the
characters that cause the most trouble. So alternatively, you could convert
them to plain quotes and dashes if you wanted.

If there's no characters in that range, then you haven't ruled out 1252, but
the rest of the encoding is pretty similar between 1252, iso8859-1 and
iso8859-15

See http://en.wikipedia.org/wiki/ISO_8859-15 for the differences between -1
and -15, the main character worth worrying about most is the Euro (which is
somewhere else again in 1252 - in the 128-159 range I believe).

Is this any help?

It's usefull information and I'll remember this. Thank you.
It's not the answer on my question wether there is a function which
converts characters with accents, umlauts and so on, to characters without.

Jul 17 '05 #9

Andy Hassall

On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pi****@de.wop> wrote:

It's usefull information and I'll remember this. Thank you.
It's not the answer on my question wether there is a function which
converts characters with accents, umlauts and so on, to characters without.

True, it's drifted a bit to answer lawrence's questions.

As far as your question goes - no, there isn't a built in function, you'd have
to write one. In order to do so, you have to be a lot more specific about the
character encodings you're using, which characters you want to convert to what,
and exactly what "and so on" means in your last sentence.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #10

JAS

Andy Hassall wrote:

On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pi****@de.wop> wrote:

It's usefull information and I'll remember this. Thank you.
It's not the answer on my question wether there is a function which
converts characters with accents, umlauts and so on, to characters without.

True, it's drifted a bit to answer lawrence's questions.

As far as your question goes - no, there isn't a built in function, you'd have
to write one. In order to do so, you have to be a lot more specific about the
character encodings you're using, which characters you want to convert to what,
and exactly what "and so on" means in your last sentence.

I saw a few example of how to do just this on the PHP site in the user
comments. I'm not quite sure but you can bet its on str_replace or
something like that ........

Jul 17 '05 #11

Pikkel

Andy Hassall wrote:

On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pi****@de.wop> wrote:

It's usefull information and I'll remember this. Thank you.
It's not the answer on my question wether there is a function which
converts characters with accents, umlauts and so on, to characters without.

True, it's drifted a bit to answer lawrence's questions.

As far as your question goes - no, there isn't a built in function, you'd have
to write one. In order to do so, you have to be a lot more specific about the
character encodings you're using, which characters you want to convert to what,
and exactly what "and so on" means in your last sentence.

The pages itselves use ISO-8859-1.
But I can't be sure what's the users input. This input will be used to
name and create pages, menu's, pictures and so on.

Jul 17 '05 #12

Andy Hassall

On Sun, 07 Nov 2004 11:36:31 +0100, Pikkel <pi****@de.wop> wrote:

Andy Hassall wrote:
On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pi****@de.wop> wrote:
It's usefull information and I'll remember this. Thank you.
It's not the answer on my question wether there is a function which
converts characters with accents, umlauts and so on, to characters without.

True, it's drifted a bit to answer lawrence's questions.

As far as your question goes - no, there isn't a built in function, you'd have
to write one. In order to do so, you have to be a lot more specific about the
character encodings you're using, which characters you want to convert to what,
and exactly what "and so on" means in your last sentence.

The pages itselves use ISO-8859-1.
But I can't be sure what's the users input. This input will be used to
name and create pages, menu's, pictures and so on.

Right, well strtr()'s already been pointed out a couple of days ago by Michael
Fesser in this thread, so just write an array of characters you want replaced
and run it through that - ISO-8859-1 isn't big, so you can just spend a couple
of minutes writing out a list of accented characters and what you want them
transformed into.

Looking at the manual page for the function, there's an example of a function
to do this already in the user notes. http://uk.php.net/strtr

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #13

lawrence

Andy Hassall <an**@andyh.co.uk> wrote in message news:<b0********************************@4ax.com>. ..

But are you after some more pragmatic approach, something like:

"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or
maybe utf-8, but it's likely been copied and mangled between applications so I
can't reliably tell which. How do I clean this data up in a reasonable way so
it can be converted to UTF8 for presentation on a UTF8 encoded page?"

If all the data has values <=127 then it's easy - that's all plain ASCII which
is a common subset of all four character sets.

You can at least rule out UTF-8 by using the functions posted in previous
threads looking for malformed UTF-8. If there's a significant number of
characters >127 and it all validates as UTF-8, then the odds of it probably
being UTF-8 increase the more characters above 127 there are, but it's still
not certain.

So you've narrowed it down to one of the three single-byte character sets.

Then the major differences are:

Codepage 1252 has printable characters in the range 128-159 (with a couple of
gaps) wheras the iso8859 encodings only have non-printable characters there. So
if there's data in this range, odds are it's Codepage 1252 - so you can convert
it to UTF-8 from there.

This range holds the angled "smart" quotes, and the em-dash, which are the
characters that cause the most trouble. So alternatively, you could convert
them to plain quotes and dashes if you wanted.

If there's no characters in that range, then you haven't ruled out 1252, but
the rest of the encoding is pretty similar between 1252, iso8859-1 and
iso8859-15

See http://en.wikipedia.org/wiki/ISO_8859-15 for the differences between -1
and -15, the main character worth worrying about most is the Euro (which is
somewhere else again in 1252 - in the 128-159 range I believe).

Brilliant stuff. Really educational. Still, I think I'm missing
something basic about how computers read the byte stream and figure
out how many bytes each character will be. Basically, I'm wondering
what a character is. Can you point to a basic comp sci tutorial on the
subject?

And does PHP have any function other than ord() for figuring out what
set of bytes one is dealing with?

Jul 17 '05 #14

lawrence

Andy Hassall <an**@andyh.co.uk> wrote in message news:<b0********************************@4ax.com>. ..

On 6 Nov 2004 01:19:52 -0800, lk******@geocities.com (lawrence) wrote:
But are you after some more pragmatic approach, something like:

"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or
maybe utf-8, but it's likely been copied and mangled between applications so I
can't reliably tell which. How do I clean this data up in a reasonable way so
it can be converted to UTF8 for presentation on a UTF8 encoded page?"

If all the data has values <=127 then it's easy - that's all plain ASCII which
is a common subset of all four character sets.

You can at least rule out UTF-8 by using the functions posted in previous
threads looking for malformed UTF-8. If there's a significant number of
characters >127 and it all validates as UTF-8, then the odds of it probably
being UTF-8 increase the more characters above 127 there are, but it's still
not certain.

Thinking about the pragmatics, and since I'm under considerable
pressure, I'm thinking that I might try something quick and simple and
then come back to this problem next year and deal with it more
gracefully. As near as I can see, just 6 characters are causing me
trouble:

smart quotes - both left and right
single quotes, still smart
hypens, especially em dashes and en dashes formated in word processors

I've looked at the wikipedia page here:

http://en.wikipedia.org/wiki/Windows-1251

It says that Windows-1251 encodes a smart quote as 9xx3. Not sure what
the x's are for. But couldn't I just loop through submitted text using
ord() to find this byte order, and then when I find it, replace it
with something ASCII?

6 or 7 or 8 tricky items in the top 3 or 4 encodings in use on the web
- a function to find them using ord() and replace them with ASCII -
that sounds like something I can do within the time constraints I
face. As much as I hope to educate myself further on the subject of
character encodings, I'm not going to be able to learn as much as I
like within the time limits I face.

Jul 17 '05 #15

Andy Hassall

On 21 Nov 2004 11:09:43 -0800, lk******@geocities.com (lawrence) wrote:

Andy Hassall <an**@andyh.co.uk> wrote in message news:<b0********************************@4ax.com>. ..

Brilliant stuff. Really educational. Still, I think I'm missing
something basic about how computers read the byte stream and figure
out how many bytes each character will be. Basically, I'm wondering
what a character is. Can you point to a basic comp sci tutorial on the
subject?
Haven't got a particular source handy, I'm afraid. What I know of multiple
character sets came from learning about it to deal with multibyte-enablement
(specifically UTF8) of the product from my day job, which was on Oracle
databases. And then the final block with regards to HTML fell into place thanks
to a post [1] from John Dunlop, a regular poster here, who pointed out that
HTML's document character set is Unicode, and it finally clicked for me what
that really implies.
And does PHP have any function other than ord() for figuring out what
set of bytes one is dealing with?

Given that PHP assumes all strings are single-byte, and doesn't even pretend
to know about character set encodings, you don't need another function. ord()
knows only about bytes; it knows nothing of characters.

The documentation for ord() is wrong. It claims it "Returns the ASCII value of
the first character of string". Yet it works for byte values past 127; none of
these are ASCII. If I get a chance I may submit a doc bug; the PHP maintainers
responded impressively quickly to one I raised about imagettftext a few days
ago.

[1]
http://groups.google.com/groups?selm...&output=gplain

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #16

lawrence

Andy Hassall <an**@andyh.co.uk> wrote in message

And does PHP have any function other than ord() for figuring out what
set of bytes one is dealing with?

Given that PHP assumes all strings are single-byte, and doesn't even pretend
to know about character set encodings, you don't need another function. ord()
knows only about bytes; it knows nothing of characters.

The documentation for ord() is wrong. It claims it "Returns the ASCII value of
the first character of string". Yet it works for byte values past 127; none of
these are ASCII. If I get a chance I may submit a doc bug; the PHP maintainers
responded impressively quickly to one I raised about imagettftext a few days
ago.

Does that mean that ord() steps through a string one byte at a time,
and it is up to the programmer (me) to figure out if the byte is
character by itself, or party of a multi-byte character?

I may use ord() then to look for the multi-byte characters that are
causing me grief, and remove them.

I've found another likely cause of my grief. I've been hitting all
input with this:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://in2.php.net/manual/en/function.utf8-encode.php
utf8_encode -- Encodes an ISO-8859-1 string to UTF-8
This function encodes the string data to UTF-8, and returns the
encoded version. UTF-8 is a standard mechanism used by Unicode for
encoding wide character values into a byte stream. UTF-8 is
transparent to plain ASCII characters, is self-synchronized (meaning
it is possible for a program to figure out where in the bytestream
characters start) and can be used with normal string comparison
functions for sorting and such. PHP encodes UTF-8 characters in up to
four bytes, like this:>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

But if I copy and paste a string from another site, and then input
it, and that string is not ISO-8859-1, then I'll get garbage
characters?

Jul 17 '05 #17

lkrubner

Andy Hassall wrote:

On 21 Nov 2004 11:09:43 -0800, lk******@geocities.com (lawrence) wrote:
And does PHP have any function other than ord() for figuring out whatset of bytes one is dealing with?

Given that PHP assumes all strings are single-byte, and doesn't even

pretend to know about character set encodings, you don't need another function. ord() knows only about bytes; it knows nothing of characters.

The documentation for ord() is wrong. It claims it "Returns the ASCII value of the first character of string". Yet it works for byte values past 127; none of these are ASCII.

Actually, this remark of yours was very useful to me. I feel like I'm
getting bytes and character encoding for the first time. Essentially,
walking through a big string when you don't know the character encoding
is like feeling your way through a pitch black tunnel - you've no idea
what you're running into. Using ord() is like going down that tunnel
with a very weak flashlight - you get to see one item at a time, but
you don't know if that item is actually connected to larger items (you
don't know if this byte you've got in your hand is a single byte
character or part of multi-byte character). Like an archeologist,
you've got to read the thing in your hands for clues to see if maybe it
is really part of something larger - so you look to see if it starts
with a 0 or has a top bit set, or see if the numbers on it are in a
certain range. This info gives you some clues about whether what you've
got is a standalone object (a single byte character) or part of
something larger.

So if I wanted to do something like track down Microsoft Word smart
quotes, I'd go through a string one byte at a time, looking for a
particular sequence of bytes that would be tell-tale.

Jul 17 '05 #18

Replace special characters by non-special characters

Similar topics