RegExp to strip accents while ignoring case

Jon Maz

Hi All,

I want to strip the accents off characters in a string so that, for example,
the (Spanish) word "práctico" comes out as "practico" - but ignoring case,
so that "PRÁCTICO" comes out as "PRACTICO".

What's the best way to do this?

TIA,

JON

--------------------------------------------------

PS First posted to aspmessageboard -
http://www.aspmessageboard.com/forum...05936&F=34&P=1 -
no answers yet

PPS The Javascript function that I'm porting to C# looks like this:

function quitaAcentos(a) {
re=new RegExp("á", "gi")
a=a.replace(re, "A")
re=new RegExp("´é", "gi")
a=a.replace(re, "E")
re=new RegExp("í", "gi")
a=a.replace(re, "I")
re=new RegExp("ó", "gi")
a=a.replace(re, "O")
re=new RegExp("ú", "gi")
a=a.replace(re, "U")
re=new RegExp("à", "gi")
a=a.replace(re, "A")
re=new RegExp("è", "gi")
a=a.replace(re, "E")
re=new RegExp("é", "gi")
a=a.replace(re, "E")
re=new RegExp("ì", "gi")
a=a.replace(re, "I")
re=new RegExp("ò", "gi")
a=a.replace(re, "O")
re=new RegExp("ó", "gi")
a=a.replace(re, "O")
re=new RegExp("ù", "gi")
a=a.replace(re, "U")
re=new RegExp("â", "gi")
a=a.replace(re, "A")
re=new RegExp("´ê", "gi")
a=a.replace(re, "E")
re=new RegExp("î", "gi")
a=a.replace(re, "I")
re=new RegExp("ô", "gi")
a=a.replace(re, "O")
re=new RegExp("û", "gi")
a=a.replace(re, "U")
re=new RegExp("ä", "gi")
a=a.replace(re, "A")
re=new RegExp("´ë", "gi")
a=a.replace(re, "E")
re=new RegExp("ï", "gi")
a=a.replace(re, "I")
re=new RegExp("ö", "gi")
a=a.replace(re, "O")
re=new RegExp("ü", "gi")
a=a.replace(re, "U")
re=new RegExp(" ", "gi")
a=a.replace(re, "")
re=new RegExp("_", "gi")
a=a.replace(re, "")
re=new RegExp("ñ", "gi")
a=a.replace(re, "N")

return a
}

Nov 16 '05 #1

Subscribe Post Reply

7433

Morten Wennevik

Hi Jon,

I have no idea if this works for all your cases, but, what you essentially want is the basic ASCII characters from a string. I believe that accented characters are all in the extended ascii set and just stripping away the most significant bit will leave you with the unaccented basic character. However, this varies with different code pages. But for all characters I have tested, codepage 1251 will convert correctly except æ Æ

string s = "áàäãâåéèëêíìïîóòöõôøúùüûýÿ";
byte[] b = Encoding.GetEncoding(1251).GetBytes(s); // 8 bit characters
string t = Encoding.ASCII.GetString(b); // 7 bit characters

t == aaaaaaeeeeiiiioooooouuuuyy

--
Happy coding!
Morten Wennevik [C# MVP]

Nov 16 '05 #2

Hans Kesting

"Morten Wennevik" <Mo************@hotmail.com> wrote in message news:opr9k4y6o1klbvpo@morten_x.edunord...

Hi Jon,

I have no idea if this works for all your cases, but, what you essentially want is the basic ASCII characters from a string. I believe that accented characters are all in the extended ascii set and just stripping away the most significant bit will leave you
with the unaccented basic character. However, this varies with different code pages. But for all characters I have tested,
codepage 1251 will convert correctly except æ Æ
string s = "áàäãâåéèëêíìïîóòöõôøúùüûýÿ";
byte[] b = Encoding.GetEncoding(1251).GetBytes(s); // 8 bit characters
string t = Encoding.ASCII.GetString(b); // 7 bit characters

t == aaaaaaeeeeiiiioooooouuuuyy

--
Happy coding!
Morten Wennevik [C# MVP]

Morten,

I have not tried your code, so it could still work. But the reason will then be the
conversion within GetBytes/GetString, not your explanation!

If it is just a matter of "stripping the most significant bit" then that bit can be thought
of to mean "use an accent". But that would mean that there is just one accented "a"
(and clearly there are more).
Or to put it another way: stripping that bit equals "subtract 128" from the character
code. If you start out with different codes (for the various accents) then you can't
end up with just one "a".

Hans Kesting

Nov 16 '05 #3

Morten Wennevik

You are correct, in fact, the conversion to 7-bit is entirely irrelevant as the byte array contains the non accented characters. This strikes me as slightly odd as I would expect the byte array to contain the characters in 8-bit, using the 1251 character set.

--
Happy coding!
Morten Wennevik [C# MVP]

Nov 16 '05 #4

Jon Maz

Hi,

But for all characters I have tested,
codepage 1251 will convert correctly
except æ Æ

Any reason to think there might be some other characters not covered by
Morten's method?

Thanks to all for the help!

JON

Nov 16 '05 #5

by: Bosconian | last post by:

Using preg_replace() is there a simple regexp to strip everything from a string except alpha and numeric chars (a-zA-Z0-9)? $input = "$tring1!"; $pattern = $input = preg_replace($pattern, "",...

PHP

String search vs regexp search

by: Anand Pillai | last post by:

To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1:...

Python

Regexp issue . . .

by: MichaelC | last post by:

Hi all. I am having a particularly difficult time with a perl script that I am writing. The problem area is a place where I need to strip some newlines out of a file. My source data is text...

Perl

RegExp() to strip comments in CSS?

by: Dr Clue | last post by:

I'm not really an expert with RegExp() , although I do use it. The problem I have is that I want to strip comments out of a CSS file using RegExp() The reason is that I'm loading and parsing to...

Javascript

RegExp to strip accents while ignoring case

by: Jon Maz | last post by:

Hi All, I want to strip the accents off characters in a string so that, for example, the (Spanish) word "práctico" comes out as "practico" - but ignoring case, so that "PRÁCTICO" comes out as...

ASP.NET

unexpected behaviour for python regexp: caret symbol almost useless?

by: conan | last post by:

This regexp '<widget class=".*" id=".*">' works well with 'grep' for matching lines of the kind <widget class="GtkWindow" id="window1"> on a XML .glade file However that's not true for the...

Python

I need some help with a regexp please

by: codefire | last post by:

Hi, I am trying to get a regexp to validate email addresses but can't get it quite right. The problem is I can't quite find the regexp to deal with ignoring the case james..kirk@fred.com, which...

Python

Assign a PERL style regex in the RegExp constructor?

by: jgarrard | last post by:

Hi, I have an array of strings which are regular expressions in the PERL syntax (ie / / delimeters). I wish to create a RegExp in order to do some useful work, but am stuck for a way of...

Javascript

Regexp: Case-insensitive matching | N factorial

by: gentsquash | last post by:

In a setting where I can specify only a JS regular expression, but not the JS code that will use it, I seek a regexp component that matches a string of letters, ignoring case. E.g, for "cat" I'd...

Javascript

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

RegExp to strip accents while ignoring case

Similar topics