What I need to do is find out what characters in a string are not
supported by the UTF-8 encoding. The problem arises when someone logs
in and uses my php script to create a weblog post. They are presented
with a form that has a textarea. If they type in words and then hit
submit, then all is fine. But if they write their entry in WordPerfect
or Microsoft Word or some such, and copy and paste it, then they might
be bringing strange characters into their post.
HTML is forgiving and sends out the wrongly encoded characters, which
show up on the screen as garbage characters. I've decided that I don't
care about this issue. I don't mind garbage characters showing on HTML
pages.
XML is less forgiving, and because of it, I can not get my RSS output
to work. Again, I don't mind garbage characters, but XML is strict and
if it runs into a character that is not in the encoding that is
declared at the top, then it dies.
So what I have to do is, given a string, I have to go through that
string and find everything that is not in the UTF-8 encoding. Then I
need to turn those characters into something harmless - maybe an ASCII
question mark, or something, something in the UTF-8 encoding.
But how is this done? Given a string, how does one go through it and
find all the characters that are not UTF-8? Clearly, the RSS readers do
this easily enough, since they reject my RSS feeds on that ground, but
how do I do it too?
I had to give up on the character encoding issue for a few months, but
I'm back at it now. I think I understand the problem I face a little
clearer now.
This was a good essay:
http://www.joelonsoftware.com/articles/Unicode.html
This was also good:
http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html
This page has some interesting demos:
http://www1.tip.nl/~t876506/UnicodeDisplay.html
Doing what is suggested here sounds nice:
http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6
Where it speaks of "More than one 8-bit repertoire, but predominantly
Latin text", but how does one find out what a character is when you
don't know the encoding?