i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.
Jul 17 '05
17 47824
Andy Hassall wrote: On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pi****@de.wo p> wrote:
It's usefull information and I'll remember this. Thank you. It's not the answer on my question wether there is a function which converts characters with accents, umlauts and so on, to characters without.
True, it's drifted a bit to answer lawrence's questions.
As far as your question goes - no, there isn't a built in function, you'd have to write one. In order to do so, you have to be a lot more specific about the character encodings you're using, which characters you want to convert to what, and exactly what "and so on" means in your last sentence.
I saw a few example of how to do just this on the PHP site in the user
comments. I'm not quite sure but you can bet its on str_replace or
something like that ........
Andy Hassall wrote: On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pi****@de.wo p> wrote:
It's usefull information and I'll remember this. Thank you. It's not the answer on my question wether there is a function which converts characters with accents, umlauts and so on, to characters without.
True, it's drifted a bit to answer lawrence's questions.
As far as your question goes - no, there isn't a built in function, you'd have to write one. In order to do so, you have to be a lot more specific about the character encodings you're using, which characters you want to convert to what, and exactly what "and so on" means in your last sentence.
The pages itselves use ISO-8859-1.
But I can't be sure what's the users input. This input will be used to
name and create pages, menu's, pictures and so on.
On Sun, 07 Nov 2004 11:36:31 +0100, Pikkel <pi****@de.wo p> wrote: Andy Hassall wrote:
On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pi****@de.wo p> wrote:
It's usefull information and I'll remember this. Thank you. It's not the answer on my question wether there is a function which converts characters with accents, umlauts and so on, to characters without.
True, it's drifted a bit to answer lawrence's questions.
As far as your question goes - no, there isn't a built in function, you'd have to write one. In order to do so, you have to be a lot more specific about the character encodings you're using, which characters you want to convert to what, and exactly what "and so on" means in your last sentence.
The pages itselves use ISO-8859-1. But I can't be sure what's the users input. This input will be used to name and create pages, menu's, pictures and so on.
Right, well strtr()'s already been pointed out a couple of days ago by Michael
Fesser in this thread, so just write an array of characters you want replaced
and run it through that - ISO-8859-1 isn't big, so you can just spend a couple
of minutes writing out a list of accented characters and what you want them
transformed into.
Looking at the manual page for the function, there's an example of a function
to do this already in the user notes. http://uk.php.net/strtr
--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool
Andy Hassall <an**@andyh.co. uk> wrote in message news:<b0******* *************** **********@4ax. com>... But are you after some more pragmatic approach, something like:
"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or maybe utf-8, but it's likely been copied and mangled between applications so I can't reliably tell which. How do I clean this data up in a reasonable way so it can be converted to UTF8 for presentation on a UTF8 encoded page?"
If all the data has values <=127 then it's easy - that's all plain ASCII which is a common subset of all four character sets.
You can at least rule out UTF-8 by using the functions posted in previous threads looking for malformed UTF-8. If there's a significant number of characters >127 and it all validates as UTF-8, then the odds of it probably being UTF-8 increase the more characters above 127 there are, but it's still not certain.
So you've narrowed it down to one of the three single-byte character sets.
Then the major differences are:
Codepage 1252 has printable characters in the range 128-159 (with a couple of gaps) wheras the iso8859 encodings only have non-printable characters there. So if there's data in this range, odds are it's Codepage 1252 - so you can convert it to UTF-8 from there.
This range holds the angled "smart" quotes, and the em-dash, which are the characters that cause the most trouble. So alternatively, you could convert them to plain quotes and dashes if you wanted.
If there's no characters in that range, then you haven't ruled out 1252, but the rest of the encoding is pretty similar between 1252, iso8859-1 and iso8859-15
See http://en.wikipedia.org/wiki/ISO_8859-15 for the differences between -1 and -15, the main character worth worrying about most is the Euro (which is somewhere else again in 1252 - in the 128-159 range I believe).
Brilliant stuff. Really educational. Still, I think I'm missing
something basic about how computers read the byte stream and figure
out how many bytes each character will be. Basically, I'm wondering
what a character is. Can you point to a basic comp sci tutorial on the
subject?
And does PHP have any function other than ord() for figuring out what
set of bytes one is dealing with?
Andy Hassall <an**@andyh.co. uk> wrote in message news:<b0******* *************** **********@4ax. com>... On 6 Nov 2004 01:19:52 -0800, lk******@geocit ies.com (lawrence) wrote: But are you after some more pragmatic approach, something like:
"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or maybe utf-8, but it's likely been copied and mangled between applications so I can't reliably tell which. How do I clean this data up in a reasonable way so it can be converted to UTF8 for presentation on a UTF8 encoded page?"
If all the data has values <=127 then it's easy - that's all plain ASCII which is a common subset of all four character sets.
You can at least rule out UTF-8 by using the functions posted in previous threads looking for malformed UTF-8. If there's a significant number of characters >127 and it all validates as UTF-8, then the odds of it probably being UTF-8 increase the more characters above 127 there are, but it's still not certain.
Thinking about the pragmatics, and since I'm under considerable
pressure, I'm thinking that I might try something quick and simple and
then come back to this problem next year and deal with it more
gracefully. As near as I can see, just 6 characters are causing me
trouble:
smart quotes - both left and right
single quotes, still smart
hypens, especially em dashes and en dashes formated in word processors
I've looked at the wikipedia page here: http://en.wikipedia.org/wiki/Windows-1251
It says that Windows-1251 encodes a smart quote as 9xx3. Not sure what
the x's are for. But couldn't I just loop through submitted text using
ord() to find this byte order, and then when I find it, replace it
with something ASCII?
6 or 7 or 8 tricky items in the top 3 or 4 encodings in use on the web
- a function to find them using ord() and replace them with ASCII -
that sounds like something I can do within the time constraints I
face. As much as I hope to educate myself further on the subject of
character encodings, I'm not going to be able to learn as much as I
like within the time limits I face.
On 21 Nov 2004 11:09:43 -0800, lk******@geocit ies.com (lawrence) wrote: Andy Hassall <an**@andyh.co. uk> wrote in message news:<b0******* *************** **********@4ax. com>...
Brilliant stuff. Really educational. Still, I think I'm missing something basic about how computers read the byte stream and figure out how many bytes each character will be. Basically, I'm wondering what a character is. Can you point to a basic comp sci tutorial on the subject?
Haven't got a particular source handy, I'm afraid. What I know of multiple
character sets came from learning about it to deal with multibyte-enablement
(specifically UTF8) of the product from my day job, which was on Oracle
databases. And then the final block with regards to HTML fell into place thanks
to a post [1] from John Dunlop, a regular poster here, who pointed out that
HTML's document character set is Unicode, and it finally clicked for me what
that really implies.
And does PHP have any function other than ord() for figuring out what set of bytes one is dealing with?
Given that PHP assumes all strings are single-byte, and doesn't even pretend
to know about character set encodings, you don't need another function. ord()
knows only about bytes; it knows nothing of characters.
The documentation for ord() is wrong. It claims it "Returns the ASCII value of
the first character of string". Yet it works for byte values past 127; none of
these are ASCII. If I get a chance I may submit a doc bug; the PHP maintainers
responded impressively quickly to one I raised about imagettftext a few days
ago.
[1] http://groups.google.com/groups?selm...&output=gplain
--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool
Andy Hassall <an**@andyh.co. uk> wrote in message And does PHP have any function other than ord() for figuring out what set of bytes one is dealing with?
Given that PHP assumes all strings are single-byte, and doesn't even pretend to know about character set encodings, you don't need another function. ord() knows only about bytes; it knows nothing of characters.
The documentation for ord() is wrong. It claims it "Returns the ASCII value of the first character of string". Yet it works for byte values past 127; none of these are ASCII. If I get a chance I may submit a doc bug; the PHP maintainers responded impressively quickly to one I raised about imagettftext a few days ago.
Does that mean that ord() steps through a string one byte at a time,
and it is up to the programmer (me) to figure out if the byte is
character by itself, or party of a multi-byte character?
I may use ord() then to look for the multi-byte characters that are
causing me grief, and remove them.
I've found another likely cause of my grief. I've been hitting all
input with this: >>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>
http://in2.php.net/manual/en/function.utf8-encode.php
utf8_encode -- Encodes an ISO-8859-1 string to UTF-8
This function encodes the string data to UTF-8, and returns the
encoded version. UTF-8 is a standard mechanism used by Unicode for
encoding wide character values into a byte stream. UTF-8 is
transparent to plain ASCII characters, is self-synchronized (meaning
it is possible for a program to figure out where in the bytestream
characters start) and can be used with normal string comparison
functions for sorting and such. PHP encodes UTF-8 characters in up to
four bytes, like this:>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>
But if I copy and paste a string from another site, and then input
it, and that string is not ISO-8859-1, then I'll get garbage
characters?
Andy Hassall wrote: On 21 Nov 2004 11:09:43 -0800, lk******@geocit ies.com (lawrence)
wrote:And does PHP have any function other than ord() for figuring out
whatset of bytes one is dealing with?
Given that PHP assumes all strings are single-byte, and doesn't even
pretend to know about character set encodings, you don't need another
function. ord() knows only about bytes; it knows nothing of characters.
The documentation for ord() is wrong. It claims it "Returns the
ASCII value of the first character of string". Yet it works for byte values past
127; none of these are ASCII.
Actually, this remark of yours was very useful to me. I feel like I'm
getting bytes and character encoding for the first time. Essentially,
walking through a big string when you don't know the character encoding
is like feeling your way through a pitch black tunnel - you've no idea
what you're running into. Using ord() is like going down that tunnel
with a very weak flashlight - you get to see one item at a time, but
you don't know if that item is actually connected to larger items (you
don't know if this byte you've got in your hand is a single byte
character or part of multi-byte character). Like an archeologist,
you've got to read the thing in your hands for clues to see if maybe it
is really part of something larger - so you look to see if it starts
with a 0 or has a top bit set, or see if the numbers on it are in a
certain range. This info gives you some clues about whether what you've
got is a standalone object (a single byte character) or part of
something larger.
So if I wanted to do something like track down Microsoft Word smart
quotes, I'd go through a string one byte at a time, looking for a
particular sequence of bytes that would be tell-tale. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Nicolas Bouillon |
last post by:
Hi
I would like to replace accentuel chars (like "é", "è" or "à ") with non
accetued ones ("é" -> "e", "è" -> "e", "à " -> "a").
I have tried string.replace method, but it seems dislike non ascii chars...
Can you help me please ?
Thanks.
|
by: Brian |
last post by:
I want to use regxp to check that a form input contains at least 1
non-space charcter. I'd like to only run this if the browser supports
it. For DOM stuff, I'd use
if (documentGetElementById) {}
Is there an object/feature detection I can use to check for regxp string
manipulation support?
--
|
by: mike c |
last post by:
I have a search app that searches local HTML files for a specified
term. I then display the pages that contain the term.
I would like to highlight the search term within the HTML when it is
viewed.
I have the following regular expression code:
string searchTerm = "(?<STARTTAG>(<*>.*))(?<MATCHTERM>(" +
lastSearchTerm + "))(?<ENDTAG>(.*<*>))";
|
by: Jon Davis |
last post by:
I have put my users through so much crap with this bug it is an absolute
shame.
I have a product that reads/writes RSS 2.0 documents, among other things.
The RSS 2.0 spec mandates an en-US style of date formatting (RFC 822). I
have been using a variation of RFC 1123 (just change the time zone to an
offset, i.e. "-0800"). It seems to be writing okay, but it's failing to
parse.
I've tried changing the regional & language settings in my...
|
by: tshad |
last post by:
I am setting up some of my functions in a class called MyFunctions.
I am not clear as to the best time to set a function as Shared and when not
to. For example, I have the following bit manipulation routines in my
Class:
*******************************************************************************
imports System
NameSpace MyFunctions
| |
by: Cor |
last post by:
Hi Newsgroup,
I have given an answer in this newsgroup about a "Replace".
There came an answer on that I did not understand, so I have done some
tests.
I got the idea that someone said, that the split method and the
regex.replace method was better than the string.replace method and replace
function. I did not believe that.
|
by: Michael |
last post by:
In PHP there is a function called str_replace
(http://php.net/str_replace). Basically you can freed in two strings
and a "subject" string. Then it goes through the subject string
searching for occurences of the "search" string and replaces them with
the "replace" string. Is there something simular in JavaScript, or can
someone give me a solution. I am an experienced PHP user and XHTML
writer, and I have learnt Javascript to a reasonable...
|
by: Brad |
last post by:
I see the use of Javascript replace all over the web. What are all the
character sequences? (sorry I am a bit of a newbie at this).
i.value.replace(/+/g, '');
I understand that /g is global and /i is case sensitive, but what are
the rest? I am asking because I am trying to write a function that
takes an input and replaces everything but numbers and a . (for decimal
numbers).
|
by: tawright915 |
last post by:
Ok so here is my regex (--.*\n|/\*(.|\n)*?\*/). It finds all comments
just fine. However I want it to return to me all strings that are not
commented out. Is there a way to exclude the comments and only show
the non-commented strings
Here is an example of the data that I am working with
/*
select * from db2
*/
|
by: Cirene |
last post by:
How do I replace in a non-casesensitive way, but maintain the capitilization
of this example....
mystring = "Hello WORLD! I want this to work!"
mystring = mystring.Replace("world", "earth")
What I want is this to be the outcome: "Hello earth! I want this to work!"
I want "WORLD" to be replaced, regardless of the capitalization. I also
want the capitalization to be maintained.
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |