By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,965 Members | 1,450 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,965 IT Pros & Developers. It's quick & easy.

way to remove all non-ascii characters from a file?

P: n/a
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?
Jul 18 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Something simple like following will work for files
that fit in memory:

def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char

f=open('filename.ext','r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)

For larger files you will need to loop and read
the data in chunks.

-Larry Bates
----------------------------
"omission9" <ru******@salemstate.edu> wrote in message
news:de**************************@posting.google.c om...
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?

Jul 18 '05 #2

P: n/a
omission9 wrote:
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?


file("file2","w").write("".join(
[ch for ch in file("file1", "r").read()
if ch in string.ascii_letters]))

but this will also strip line breaks and whatnot :)

(n.b. I didn't actualy test the above code, and wrote it because of
amusement value :) )

Jul 18 '05 #3

P: n/a
omission9 wrote:
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?


Read it in chunks, then remove the non-ascii charactors like so:
t = "".join(map(chr, range(256)))
d = "".join(map(chr, range(128,256)))
"Törichte Logik böser Kobold".translate(t,d) 'Trichte Logik bser Kobold'

and finally write the maimed chunks to a file. However, it's not clear to
me, how removing characters could be a good idea in the first place.
Replacing them at least gives some mimimal hints that something is missing:
t = "".join(map(chr, range(128))) + "?" * 128
"Törichte Logik böser Kobold".translate(t)

'T?richte Logik b?ser Kobold'

Peter
Jul 18 '05 #4

P: n/a
omission9 wrote:
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?


Here's a simple example that does what you want:
orig = "Häring"
"".join([x for x in orig if ord(x) < 128])

'Hring'

-- Gerhard

Jul 18 '05 #5

P: n/a
Gerhard Häring wrote:

omission9 wrote:
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?


Here's a simple example that does what you want:
>>> orig = "Häring"
>>> "".join([x for x in orig if ord(x) < 128]) 'Hring'

Or, if performance is critical, it's possible something like this would
be faster. (A regex might be even better, avoiding the redundant identity
transformation step.) :
from string import maketrans, translate
table = maketrans('', '')
translate(orig, table, table[128:])

'Hring'
-Peter
Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.