472,374 Members | 1,482 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,374 software developers and data experts.

way to remove all non-ascii characters from a file?

I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?
Jul 18 '05 #1
5 18791
Something simple like following will work for files
that fit in memory:

def onlyascii(char):
if ord(char) < 48 or ord(char) > 127: return ''
else: return char

f=open('filename.ext','r')
data=f.read()
f.close()
filtered_data=filter(onlyascii, data)

For larger files you will need to loop and read
the data in chunks.

-Larry Bates
----------------------------
"omission9" <ru******@salemstate.edu> wrote in message
news:de**************************@posting.google.c om...
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?

Jul 18 '05 #2
omission9 wrote:
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?


file("file2","w").write("".join(
[ch for ch in file("file1", "r").read()
if ch in string.ascii_letters]))

but this will also strip line breaks and whatnot :)

(n.b. I didn't actualy test the above code, and wrote it because of
amusement value :) )

Jul 18 '05 #3
omission9 wrote:
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?


Read it in chunks, then remove the non-ascii charactors like so:
t = "".join(map(chr, range(256)))
d = "".join(map(chr, range(128,256)))
"Törichte Logik böser Kobold".translate(t,d) 'Trichte Logik bser Kobold'

and finally write the maimed chunks to a file. However, it's not clear to
me, how removing characters could be a good idea in the first place.
Replacing them at least gives some mimimal hints that something is missing:
t = "".join(map(chr, range(128))) + "?" * 128
"Törichte Logik böser Kobold".translate(t)

'T?richte Logik b?ser Kobold'

Peter
Jul 18 '05 #4
omission9 wrote:
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?


Here's a simple example that does what you want:
orig = "Häring"
"".join([x for x in orig if ord(x) < 128])

'Hring'

-- Gerhard

Jul 18 '05 #5
Gerhard Häring wrote:

omission9 wrote:
I have a text file which contains the occasional non-ascii charcter.
What is the best way to remove all of these in python?


Here's a simple example that does what you want:
>>> orig = "Häring"
>>> "".join([x for x in orig if ord(x) < 128]) 'Hring'

Or, if performance is critical, it's possible something like this would
be faster. (A regex might be even better, avoiding the redundant identity
transformation step.) :
from string import maketrans, translate
table = maketrans('', '')
translate(orig, table, table[128:])

'Hring'
-Peter
Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Christopher Armstrong | last post by:
Hello! I'm trying to write a part of a program that will remove all files in its directory. I have tried the std::remove feature of the standard library, but I don't know its syntax. Also, what's...
12
by: Oberon | last post by:
I have a large HTML document. It has hundreds of <span>s which have no attributes so these <span>s are redundant. How can I remove these tags automatically? The document also has <span>s with...
3
by: Phil Powell | last post by:
URL: http://www3.brinskter.com/soa/soa_samp.asp?refURL=valsignalandet This URL produces a nice page with form material to produce user-inputted band search information Problem is that...
11
by: lovecreatesbeauty | last post by:
Hello experts, Is const_cast only applied to pointers or references? If I have a constant object, then how can I remove constant attribute from it? #include <vector> #include <string>...
6
by: Henry Combrinck | last post by:
Hello all I've been approached by the development people about removing the 'public' schema. They complain about having to manually remove the 'public_' tag from table names generated by their...
80
by: Andrew R | last post by:
Hi I'm creating a series of forms, each with with around 15-20 text boxes. The text boxes will show data from tables, but are unbound to make them more flexible. I want the form to be used...
3
by: Mark Poppers | last post by:
Assume the following sequence of user actions starting with a Form and an e.g. StatusStrip: 1.) User doubleclicks on the StatusStrip 2.) VisualStudio jumps to the new generated source code ...
33
by: llothar | last post by:
I'm afraid that the GIL is killing the usefullness of python for some types of applications now where 4,8 oder 64 threads on a chip are here or comming soon. What is the status about that for...
3
by: Allen Chen [MSFT] | last post by:
Hi Richard, Quote from Richard================================================== However I also want to be able to remove the panes. I have tried to include this, but find that when I first...
12
by: milk242 | last post by:
Hi, I'm having a problem explaining why this loop does what it does. string isbn = "--0---13-6--15--250-3"; // Go through isbn, remove any non digits for ( int x = 0 ; x < 10 ; x++) {...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was proposed, which integrated multiple engines and...
1
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...
2
by: Ricardo de Mila | last post by:
Dear people, good afternoon... I have a form in msAccess with lots of controls and a specific routine must be triggered if the mouse_down event happens in any control. Than I need to discover what...
1
by: Johno34 | last post by:
I have this click event on my form. It speaks to a Datasheet Subform Private Sub Command260_Click() Dim r As DAO.Recordset Set r = Form_frmABCD.Form.RecordsetClone r.MoveFirst Do If...
0
DizelArs
by: DizelArs | last post by:
Hi all) Faced with a problem, element.click() event doesn't work in Safari browser. Tried various tricks like emulating touch event through a function: let clickEvent = new Event('click', {...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.