473,770 Members | 1,989 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

how to take a string and weed out characters that are not UTF-8?


What I need to do is find out what characters in a string are not
supported by the UTF-8 encoding. The problem arises when someone logs
in and uses my php script to create a weblog post. They are presented
with a form that has a textarea. If they type in words and then hit
submit, then all is fine. But if they write their entry in WordPerfect
or Microsoft Word or some such, and copy and paste it, then they might
be bringing strange characters into their post.

HTML is forgiving and sends out the wrongly encoded characters, which
show up on the screen as garbage characters. I've decided that I don't
care about this issue. I don't mind garbage characters showing on HTML
pages.

XML is less forgiving, and because of it, I can not get my RSS output
to work. Again, I don't mind garbage characters, but XML is strict and
if it runs into a character that is not in the encoding that is
declared at the top, then it dies.

So what I have to do is, given a string, I have to go through that
string and find everything that is not in the UTF-8 encoding. Then I
need to turn those characters into something harmless - maybe an ASCII
question mark, or something, something in the UTF-8 encoding.

But how is this done? Given a string, how does one go through it and
find all the characters that are not UTF-8? Clearly, the RSS readers do
this easily enough, since they reject my RSS feeds on that ground, but
how do I do it too?

I had to give up on the character encoding issue for a few months, but
I'm back at it now. I think I understand the problem I face a little
clearer now.
This was a good essay:

http://www.joelonsoftware.com/articles/Unicode.html
This was also good:

http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html
This page has some interesting demos:

http://www1.tip.nl/~t876506/UnicodeDisplay.html


Doing what is suggested here sounds nice:

http://ppewww.ph.gla.ac.uk/~flavell/...cklist.html#s6

Where it speaks of "More than one 8-bit repertoire, but predominantly
Latin text", but how does one find out what a character is when you
don't know the encoding?

Jul 17 '05 #1
2 2068
Simon Stienen had some great advice in the following post. Yet even
when I did as he said and looked in Wikipedia, I'm still unclear on how
I determine that something is certainly not UTF-8.
http://groups-beta.google.com/group/...8b9bef7877408d

Simon Stienen Sep 29 2004, 7:37 pm
How validation is done:
Take the string. If there is no character 0x80 to 0xFF, it doesn't
matter,
whether you define this text as UTF-8 or any ISO encoding, since the
first
128 characters all have the same bit sequence in these encodings.
However, if there actually *are* characters with a value of 128 or
higher,
check, whether the given sequence would be a valid UTF-8 sequence (see
UTF-8 in Wikipedia for this). If this and every other sequence is valid
UTF-8, the string itself *might* be UTF-8. Of course it could be a
sequence
of extended ASCII/ANSI characters, too. It's impossible to be sure
about
that.

Jul 17 '05 #2
Nevermind. This seems to have solved my problems:

http://uk.php.net/manual/en/function...t-encoding.php

Jul 17 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
11649
by: knocker | last post by:
Hi I have a problem with JSP on websphere 5. When I try save information with swedish or danish ÅÄÖ characters, the string is cut where the first of these characters occurs. The JDK used is 1.3.1 I've tried: String CUNM = request.getParameter("CUNM").trim(); CUNM = URLDecoder.decode(CUNM,"UTF-8");
6
5495
by: nico | last post by:
In my python scripts, I use a lot of accented characters as I work in french. In order to do this, I put the line # -*- coding: UTF-8 -*- at the beginning of the script file. Then, when I need to store accented characters in a string, I used to prefix the literal string with 'u', like this: mystring = u"prénom" But if I understand well, prefixing a unicode string literal with 'u'
32
49725
by: Wolfgang Draxinger | last post by:
I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size(). Instead one has to iterate through the string, parse all UTF-8 multibytes and count each multibyte as one character. To address this problem the GTKmm bindings for the GTK+ toolkit have implemented a own string class Glib::ustring...
5
13295
by: Brian Reed | last post by:
I have a class that I want to serialize to an XML string. I want the XML to serialize to utf-8 encoding. When I serialize to an XML file, the data looks great. When I try to serialize to a String (ala StringBuilder) I get utf-16 and instead of the parenthesis (") I get a slash and then a " (\") which makes sense when looking at a character in memory, but not in a string Here is my code XmlSerializer serializer = new XmlSerializer...
18
5618
by: Zygmunt Krynicki | last post by:
Hello I've browsed the FAQ but apparently it lacks any questions concenring wide character strings. I'd like to calculate the length of a multibyte string without converting the whole string. Zygmunt PS: The whole multibyte string vs wide character string concept is broken IMHO since it allows wchar_t not to be large enough to contain a full character (rendering both types virtually the same). What's the point of
9
23711
by: Mark | last post by:
I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the "improvements", but maybe someone here can point me in the right direction... First, it looks like asp.net will automatically read and recognize query strings encoded in utf8 and 16-bit unicode, only the latter is some mutant, non-standard...
20
9391
by: SMG | last post by:
Hi All, I have created an application which is working fine and is in about to launch, now suddenly my mgmt says there are chances that Scrip ID( a particular id and not prim key) may have special characters like '&,*,),( or /' This data(field/key) I am passing this value as a querystring. e.g. value to be passed : ABC http://localhost/myProj/abc.aspx?ScripID=ABC
4
17784
by: thinktwice | last post by:
i'm using VC++6 IDE i know i could use macros like A2T, T2A, but is there any way more decent way to do this?
33
15580
by: Michael B Allen | last post by:
Hello, Early on I decided that all text (what most people call "strings" ) in my code would be unsigned char *. The reasoning is that the elements of these arrays are decidedly not signed. In fact, they may not even represent complete characters. At this point I think of text as simple binary blobs. What charset, character encoding and termination they use should not be exposed in the interface used to operate on them. But now I have...
0
9618
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10259
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10101
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10038
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7456
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6710
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5354
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5482
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4007
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.