how to take a string and weed out characters that are not UTF-8?

lkrubner

What I need to do is find out what characters in a string are not
supported by the UTF-8 encoding. The problem arises when someone logs
in and uses my php script to create a weblog post. They are presented
with a form that has a textarea. If they type in words and then hit
submit, then all is fine. But if they write their entry in WordPerfect
or Microsoft Word or some such, and copy and paste it, then they might
be bringing strange characters into their post.

HTML is forgiving and sends out the wrongly encoded characters, which
show up on the screen as garbage characters. I've decided that I don't
care about this issue. I don't mind garbage characters showing on HTML
pages.

XML is less forgiving, and because of it, I can not get my RSS output
to work. Again, I don't mind garbage characters, but XML is strict and
if it runs into a character that is not in the encoding that is
declared at the top, then it dies.

So what I have to do is, given a string, I have to go through that
string and find everything that is not in the UTF-8 encoding. Then I
need to turn those characters into something harmless - maybe an ASCII
question mark, or something, something in the UTF-8 encoding.

But how is this done? Given a string, how does one go through it and
find all the characters that are not UTF-8? Clearly, the RSS readers do
this easily enough, since they reject my RSS feeds on that ground, but
how do I do it too?

I had to give up on the character encoding issue for a few months, but
I'm back at it now. I think I understand the problem I face a little
clearer now.
This was a good essay:

http://www.joelonsoftware.com/articles/Unicode.html
This was also good:

http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html
This page has some interesting demos:

http://www1.tip.nl/~t876506/UnicodeDisplay.html

Doing what is suggested here sounds nice:

http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6

Where it speaks of "More than one 8-bit repertoire, but predominantly
Latin text", but how does one find out what a character is when you
don't know the encoding?

Jul 17 '05 #1

Subscribe Reply

2050

lkrubner

Simon Stienen had some great advice in the following post. Yet even
when I did as he said and looked in Wikipedia, I'm still unclear on how
I determine that something is certainly not UTF-8.
http://groups-beta.google.com/group/...8b9bef7877408d

Simon Stienen Sep 29 2004, 7:37 pm
How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't
matter,
whether you define this text as UTF-8 or any ISO encoding, since the
first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or
higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a
sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure
about
that.

Jul 17 '05 #2

lkrubner

Nevermind. This seems to have solved my problems:

http://uk.php.net/manual/en/function...t-encoding.php

Jul 17 '05 #3

Similar topics

11634

JSP Java truncates string at the first non-english character

by: knocker | last post by:

Hi I have a problem with JSP on websphere 5. When I try save information with swedish or danish ÅÄÖ characters, the string is cut where the first of these characters occurs. The JDK used is...

Java

5473

unicode string literals and "u" prefix

by: nico | last post by:

In my python scripts, I use a lot of accented characters as I work in french. In order to do this, I put the line # -*- coding: UTF-8 -*- at the beginning of the script file. Then, when I need...

Python

49649

std::string vs. Unicode UTF-8

by: Wolfgang Draxinger | last post by:

I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size()....

C / C++

13272

HowTo: Serialize to a String without extra characters

by: Brian Reed | last post by:

I have a class that I want to serialize to an XML string. I want the XML to serialize to utf-8 encoding. When I serialize to an XML file, the data looks great. When I try to serialize to a String...

.NET Framework

5587

Multibyte string length

by: Zygmunt Krynicki | last post by:

Hello I've browsed the FAQ but apparently it lacks any questions concenring wide character strings. I'd like to calculate the length of a multibyte string without converting the whole string. ...

C / C++

23688

query string encoding/decoding

by: Mark | last post by:

I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...

ASP.NET

9334

Special Characters in Query String

by: SMG | last post by:

Hi All, I have created an application which is working fine and is in about to launch, now suddenly my mgmt says there are chances that Scrip ID( a particular id and not prim key) may have special...

ASP.NET

17735

how to convert narrow string to wide string and vice versa?

by: thinktwice | last post by:

i'm using VC++6 IDE i know i could use macros like A2T, T2A, but is there any way more decent way to do this?

C / C++

15452

I want unsigned char * string literals

by: Michael B Allen | last post by:

Hello, Early on I decided that all text (what most people call "strings" ) in my code would be unsigned char *. The reasoning is that the elements of these arrays are decidedly not signed. In...

C / C++

7205

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7287

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7348

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7006

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7467

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5021

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4685

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3166

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

744

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP