469,275 Members | 1,871 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,275 developers. It's quick & easy.

Removing non-ascii characters from a string

Eps
Hi there,

I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).

There is an article on code project which kind of looks like it does
what I want but I can't help thinking it makes it more complex than it
needs to be.

I have looked at the msdn pages to do with Encodings but I am not very
familiar with this topic.

If I can get a list of ascii characters then it should be easy to write
a method that checks each char against the list and performs the replace
or remove operation if necessary. Yet I can't find anything exactly
like this with trusty old google, is there something I am missing ?.

If it helps the reason I need this is because I am writing a front end
for the lame command line mp3 encoder, it doesn't like being passed, or
asked to output to, file paths containing unicode characters.

--
Eps
Aug 29 '08 #1
13 16795
"Eps" <ms**********@epscylonb.comwrote in message
news:er*************@TK2MSFTNGP05.phx.gbl...
Hi there,

I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).

There is an article on code project which kind of looks like it does
what I want but I can't help thinking it makes it more complex than it
needs to be.

I have looked at the msdn pages to do with Encodings but I am not very
familiar with this topic.

If I can get a list of ascii characters then it should be easy to write
a method that checks each char against the list and performs the replace
or remove operation if necessary. Yet I can't find anything exactly
like this with trusty old google, is there something I am missing ?.

If it helps the reason I need this is because I am writing a front end
for the lame command line mp3 encoder, it doesn't like being passed, or
asked to output to, file paths containing unicode characters.

Perhaps I'm missing something this code:-

byte[] asciiChars = Encoding.ASCII.GetBytes("AB CD");
string result = Encoding.ASCII.GetString(asciiChars);
Console.WriteLine(result);

creates the string:-

AB ? CD

--
Anthony Jones - MVP ASP/ASP.NET
Aug 29 '08 #2
Eps
Anthony Jones wrote:
Perhaps I'm missing something this code:-

byte[] asciiChars = Encoding.ASCII.GetBytes("AB CD");
string result = Encoding.ASCII.GetString(asciiChars);
Console.WriteLine(result);

creates the string:-

AB ? CD
I have seen this code before, can anyone explain why the
Encoding.ASCII.GetString() method does not accept a string as a parameter ?.

--
Eps
Aug 29 '08 #3
On Aug 29, 1:12*pm, Eps <msnewsgro...@epscylonb.comwrote:
* * byte[] asciiChars = Encoding.ASCII.GetBytes("AB CD");
* * string result = Encoding.ASCII.GetString(asciiChars);
* * Console.WriteLine(result);
creates the string:-
AB ? CD

I have seen this code before, can anyone explain why the
Encoding.ASCII.GetString() method does not accept a string as a parameter?.
Because Encoding classes encode and decode CLR strings (which are
_always_ Unicode) to/from byte arrays in specified encoding, typically
for serialization or interop purposes. There's no such thing as a non-
Unicode System.String (well, you could treat a string as a plain array
of char, but any .NET function will still treat string as UTF-16).

What you ask is still possible, because ASCII is a pure subset of
Unicode. With LINQ, you could use this one-liner:

string ascii = new string(s.Where(c =(int)c >= 0 && (int)c <=
127).ToArray());

Note however that "ascii" would still be a Unicode string - it just
wouldn't contain any non-ASCII characters.
Aug 29 '08 #4
"Eps" <ms**********@epscylonb.comwrote in message
news:ef**************@TK2MSFTNGP02.phx.gbl...
Anthony Jones wrote:
Perhaps I'm missing something this code:-

byte[] asciiChars = Encoding.ASCII.GetBytes("AB CD");
string result = Encoding.ASCII.GetString(asciiChars);
Console.WriteLine(result);

creates the string:-

AB ? CD

I have seen this code before, can anyone explain why the
Encoding.ASCII.GetString() method does not accept a string as a parameter
?.
>
As you have already identified all strings in .NET are unicode. Hence you'd
be asking GetString to take a unicode string and return a unicode string.

I understand what you are saying, you would like it to take unicode string
that contains any characters and convert it to a unicode string that
contains only characters that are found in the ASCII range.

It would reduce this:-

string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s ))

to this:-

string sOut = Encoding.ASCII.GetString(s)

However if the .NET framework reduced every possible scenario of that sort
the framework would become huge and unwieldy.

If you have C# 3 you can do this for yourself:-

public static class Exts
{
public static string GetString(this Encoding enc, string s)
{
return enc.GetString(enc.GetBytes(s));
}
}

When a code file has a using statement include the namespace to which the
above class belongs instances of Encoding types will now have the overload
you desire.
--
Anthony Jones - MVP ASP/ASP.NET
Aug 29 '08 #5
"Pavel Minaev" <in****@gmail.comwrote in message
news:46**********************************@x41g2000 hsb.googlegroups.com...
On Aug 29, 1:12 pm, Eps <msnewsgro...@epscylonb.comwrote:
string ascii = new string(s.Where(c =(int)c >= 0 && (int)c <=
127).ToArray());
The problem with a new shiny tool is that we start to search for excuses to
use it. Even when its not the appropriate tool for the job. ;)

--
Anthony Jones - MVP ASP/ASP.NET
Aug 29 '08 #6
"Eps" <ms**********@epscylonb.comwrote in message
news:ef**************@TK2MSFTNGP02.phx.gbl...
I have seen this code before, can anyone explain why the
Encoding.ASCII.GetString() method does not accept a string as a parameter
?.
The Encoding classes have a pair of functions that convert between
arrays of bytes and strings.
GetBytes takes as input a Unicode String and returns an array of bytes
that represent that String converted to the chosen encoding.
GetString is provided to perform the opposite conversion (from the byte
array into a Unicode String), so that's why it takes a byte array instead of
a string.
Note that Unicode Strings are the only kind of strings that .Net
suports; any other kind is treated as a byte array, so it wouldn't make
sense to write a GetString taking a String and returning a String, because
it would do nothing. That is if by String we refer to System.String, which
is meant to contain Unicode. Nothing stops you from writing
MyNamespace.MyString and using that class to encapsulate anything that you
want.

Aug 29 '08 #7
Eps
Pavel Minaev wrote:
Because Encoding classes encode and decode CLR strings (which are
_always_ Unicode) to/from byte arrays in specified encoding, typically
for serialization or interop purposes. There's no such thing as a non-
Unicode System.String (well, you could treat a string as a plain array
of char, but any .NET function will still treat string as UTF-16).

What you ask is still possible, because ASCII is a pure subset of
Unicode. With LINQ, you could use this one-liner:

string ascii = new string(s.Where(c =(int)c >= 0 && (int)c <=
127).ToArray());

Note however that "ascii" would still be a Unicode string - it just
wouldn't contain any non-ASCII characters.
Thanks for your replies, I think I have a better understanding of it
now. I think that code above is pretty much exactly what I am looking
for, all I need to do is make sure the strings that I pass to lame only
contain ascii characters, I can't test it now but it should work.

--
Eps
Aug 29 '08 #8
Eps wrote:
I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).
I would use:

s = Regex.Replace(s, @"[^\u0000-\u007F]", "");

Arne
Aug 30 '08 #9
I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).
You have some solutions, but make sure what you do with them.
If you have to validate some input, then check if there are non-ascii
characters and show an error.
Just removing what you don't like is guaranteed to result in junk.
(what is the meaning of a French word with missing characters?)

A bit like this:
- you ask for a number.
- I give you 0x1A3C7
- you want a decimal number

Option 1: validate and complain that this is not a decimal numer
Option 2: remove x, A and C, then interpret 0137 as a decimal number
This is the equivalent of what you do now with non-ascii stuff :-)
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Aug 30 '08 #10
Eps
Mihai N. wrote:
>I believe all strings in .net are unicode by default, I am looking for a
way to remove all non ascii characters from a string (or optionally
replace them).

You have some solutions, but make sure what you do with them.
If you have to validate some input, then check if there are non-ascii
characters and show an error.
Just removing what you don't like is guaranteed to result in junk.
(what is the meaning of a French word with missing characters?)

A bit like this:
- you ask for a number.
- I give you 0x1A3C7
- you want a decimal number

Option 1: validate and complain that this is not a decimal numer
Option 2: remove x, A and C, then interpret 0137 as a decimal number
This is the equivalent of what you do now with non-ascii stuff :-)
I do agree, if this were a business app I would definitely be
complaining about bad input data.

Its a tool for transcoding mp3's for my own personal use, I doubt anyone
else will ever use it. But you are correct, certain unicode strings
like for example...

Góðan daginn by Sigur Rós from Með suð * eyrum við spilum endalaust

come out badly mangled after the unicode to ascii conversion, I don't
see how I can avoid this unfortunately and its something I am personally
willing to put up with.

--
Eps
Aug 30 '08 #11
Eps
Eps wrote:
I do agree, if this were a business app I would definitely be
complaining about bad input data.

Its a tool for transcoding mp3's for my own personal use, I doubt anyone
else will ever use it. But you are correct, certain unicode strings
like for example...

Góðan daginn by Sigur Rós from Með suð * eyrum við spilum endalaust

come out badly mangled after the unicode to ascii conversion, I don't
see how I can avoid this unfortunately and its something I am personally
willing to put up with.
Actually I could output to a temporary file and then use unicode .net to
copy the file to the correct output file path.

I will look into that.

--
Eps
Aug 30 '08 #12
Actually I could output to a temporary file and then use unicode .net to
copy the file to the correct output file path.
I am not sure I understand, but puting together the mp3 part with the path
part I think that you try to automatically organize/rename mp3 files based
on the mp3 id tags inside the mp3 (or something similar).

Although you can use Unicode file names from .NET, you might have
some troubles with many of the mp3 players out there that are still
in the 1990s and never heard of Unicode :-)

If this is what you need, you might be able to do better:
1. Use full Unicode files and choose a good MP3 playes
(Windows Media Player can do that, no probles, and I am sure
there must be others)
2. don't convert to ascii, but to 1252 (Western-European code page)
If you are on a system that uses 1252 code page as ansi code page
this will preserve a lot of the characters and non-unicode players
will still work
3. Don't remove the characters with accents, but try to remove the
accents and keep "the base" character. Here is how:
http://blogs.msdn.com/michkap/archiv...19/376617.aspx
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Aug 30 '08 #13
Eps
Mihai N. wrote:
I am not sure I understand, but puting together the mp3 part with the path
part I think that you try to automatically organize/rename mp3 files based
on the mp3 id tags inside the mp3 (or something similar).

Although you can use Unicode file names from .NET, you might have
some troubles with many of the mp3 players out there that are still
in the 1990s and never heard of Unicode :-)

If this is what you need, you might be able to do better:
1. Use full Unicode files and choose a good MP3 playes
(Windows Media Player can do that, no probles, and I am sure
there must be others)
2. don't convert to ascii, but to 1252 (Western-European code page)
If you are on a system that uses 1252 code page as ansi code page
this will preserve a lot of the characters and non-unicode players
will still work
3. Don't remove the characters with accents, but try to remove the
accents and keep "the base" character. Here is how:
http://blogs.msdn.com/michkap/archiv...19/376617.aspx
Good points.

I don't write any tags using lame, I use TagLibSharp to read the tags
from the original file and copy it to the output file. I am not aware
of any Encoding issues but I will look in to it.

--
Eps

Aug 30 '08 #14

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by simon place | last post: by
4 posts views Thread by Alexandre Soares | last post: by
13 posts views Thread by scorpion53061 | last post: by
102 posts views Thread by tom fredriksen | last post: by
7 posts views Thread by David Lozzi | last post: by
7 posts views Thread by Benjamin Goudey | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.