By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,403 Members | 1,690 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,403 IT Pros & Developers. It's quick & easy.

how to take a string and weed out characters that are not UTF-8?

P: n/a

What I need to do is find out what characters in a string are not
supported by the UTF-8 encoding. The problem arises when someone logs
in and uses my php script to create a weblog post. They are presented
with a form that has a textarea. If they type in words and then hit
submit, then all is fine. But if they write their entry in WordPerfect
or Microsoft Word or some such, and copy and paste it, then they might
be bringing strange characters into their post.

HTML is forgiving and sends out the wrongly encoded characters, which
show up on the screen as garbage characters. I've decided that I don't
care about this issue. I don't mind garbage characters showing on HTML

XML is less forgiving, and because of it, I can not get my RSS output
to work. Again, I don't mind garbage characters, but XML is strict and
if it runs into a character that is not in the encoding that is
declared at the top, then it dies.

So what I have to do is, given a string, I have to go through that
string and find everything that is not in the UTF-8 encoding. Then I
need to turn those characters into something harmless - maybe an ASCII
question mark, or something, something in the UTF-8 encoding.

But how is this done? Given a string, how does one go through it and
find all the characters that are not UTF-8? Clearly, the RSS readers do
this easily enough, since they reject my RSS feeds on that ground, but
how do I do it too?

I had to give up on the character encoding issue for a few months, but
I'm back at it now. I think I understand the problem I face a little
clearer now.
This was a good essay:
This was also good:
This page has some interesting demos:

Doing what is suggested here sounds nice:

Where it speaks of "More than one 8-bit repertoire, but predominantly
Latin text", but how does one find out what a character is when you
don't know the encoding?

Jul 17 '05 #1
Share this Question
Share on Google+
2 Replies

P: n/a
Simon Stienen had some great advice in the following post. Yet even
when I did as he said and looked in Wikipedia, I'm still unclear on how
I determine that something is certainly not UTF-8.

Simon Stienen Sep 29 2004, 7:37 pm
How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't
whether you define this text as UTF-8 or any ISO encoding, since the
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a
of extended ASCII/ANSI characters, too. It's impossible to be sure

Jul 17 '05 #2

P: n/a
Nevermind. This seems to have solved my problems:

Jul 17 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.