Hello tlistmembers,
I am using the encoding function to convert unicode to ascii. At one point
this code was working just fine, however, now it has broken.
I am reading a text file that has is in unicode (I am unsure of which
flavour or bit depth). as I read in the file one line at a time
(readlines()) it converts to ascii. Simple enough. At the same time I am
copressing to bz2 with the bz2 module but that works just fine. The code
is and error reported appears below. I am unsure what to do.
I assume that because it is reporting that ordinal is not in range, that
something to do with the character width that I am reading?
Peter W.
def encode_file(file_path, encode_type, compress='N'):
"""
Changes encoding of file
"""
new_encode = encode_type
old_file_path = file_path + '.old'
new_file_path = file_path
os.rename(file_path,old_file_path)
file_in = file(old_file_path,'r')
if compress == 'Y' or compress == 'y':
bz_file_path = file_path + '.bz2'
bz_file_out = bz2.BZ2File(bz_file_path, 'w')
for line in file_in.readlines():
bz_file_out.write(line.encode(new_encode))
bz_file_out.close()
else:
file_out = file(file_path,'w')
for line in file_in.readlines():
file_out.write(line.encode(new_encode))
file_out.close()
file_in.close()
os.remove(old_file_path)
ERROR Reported:
Parsing
X:\GenomeQuebec_repository\microarray\HIS\M15K\Ste p_1_repository\HISH0224.txt
Traceback (most recent call last):
File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
433, in _do_start
self.kdb.run(code_ob, locals, locals)
File "C:\Python23\lib\bdb.py", line 350, in run
exec cmd in globals, locals
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 158, in ?
main()
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 75, in main
encode_file(fileToProcess, options.encode, 'Y')
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 144, in encode_file
bz_file_out.write(line.encode(new_encode))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128) 12 10555
"Peter Wilkinson" <pw********@videotron.ca> wrote in message
news:ma**************************************@pyth on.org... Hello tlistmembers,
I am using the encoding function to convert unicode to ascii. At one point this code was working just fine, however, now it has broken.
I am reading a text file that has is in unicode (I am unsure of which flavour or bit depth). as I read in the file one line at a time (readlines()) it converts to ascii. Simple enough. At the same time I am copressing to bz2 with the bz2 module but that works just fine. The code is and error reported appears below. I am unsure what to do.
I assume that because it is reporting that ordinal is not in range, that something to do with the character width that I am reading?
Peter W.
def encode_file(file_path, encode_type, compress='N'): """ Changes encoding of file """ new_encode = encode_type old_file_path = file_path + '.old' new_file_path = file_path os.rename(file_path,old_file_path) file_in = file(old_file_path,'r')
if compress == 'Y' or compress == 'y': bz_file_path = file_path + '.bz2' bz_file_out = bz2.BZ2File(bz_file_path, 'w') for line in file_in.readlines(): bz_file_out.write(line.encode(new_encode)) bz_file_out.close()
else: file_out = file(file_path,'w') for line in file_in.readlines(): file_out.write(line.encode(new_encode)) file_out.close()
file_in.close() os.remove(old_file_path)
ERROR Reported:
Parsing
X:\GenomeQuebec_repository\microarray\HIS\M15K\Ste p_1_repository\HISH0224.tx
t Traceback (most recent call last): File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line 433, in _do_start self.kdb.run(code_ob, locals, locals) File "C:\Python23\lib\bdb.py", line 350, in run exec cmd in globals, locals File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 158, in ? main() File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 75, in main encode_file(fileToProcess, options.encode, 'Y') File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 144, in encode_file bz_file_out.write(line.encode(new_encode)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
I've encountered this problem before and the solution I've come up with a
fix that works but is probably not the best,
def is_ord (strng):
new_text = ''
for i in strng:
if ord(i) > 127:
new_text = new_text + ''
else:
new_text = new_text + i
return new_text
#Then just,
text_from_file = is_ord(text_from_file)
Tom
Thanks Tom B.,
I will try that for now ....
It would be good to find out _why_ this happens in the first place. I will
keep do a little searching on this for a few days.
Peter W.
At 02:04 PM 8/6/2004, Tom B. wrote: "Peter Wilkinson" <pw********@videotron.ca> wrote in message news:ma**************************************@pyt hon.org... Hello tlistmembers,
I am using the encoding function to convert unicode to ascii. At one point this code was working just fine, however, now it has broken.
I am reading a text file that has is in unicode (I am unsure of which flavour or bit depth). as I read in the file one line at a time (readlines()) it converts to ascii. Simple enough. At the same time I am copressing to bz2 with the bz2 module but that works just fine. The code is and error reported appears below. I am unsure what to do.
I assume that because it is reporting that ordinal is not in range, that something to do with the character width that I am reading?
Peter W.
def encode_file(file_path, encode_type, compress='N'): """ Changes encoding of file """ new_encode = encode_type old_file_path = file_path + '.old' new_file_path = file_path os.rename(file_path,old_file_path) file_in = file(old_file_path,'r')
if compress == 'Y' or compress == 'y': bz_file_path = file_path + '.bz2' bz_file_out = bz2.BZ2File(bz_file_path, 'w') for line in file_in.readlines(): bz_file_out.write(line.encode(new_encode)) bz_file_out.close()
else: file_out = file(file_path,'w') for line in file_in.readlines(): file_out.write(line.encode(new_encode)) file_out.close()
file_in.close() os.remove(old_file_path)
ERROR Reported:
Parsing X:\GenomeQuebec_repository\microarray\HIS\M15K\St ep_1_repository\HISH0224.tx t Traceback (most recent call last): File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line 433, in _do_start self.kdb.run(code_ob, locals, locals) File "C:\Python23\lib\bdb.py", line 350, in run exec cmd in globals, locals File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 158, in ? main() File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 75, in main encode_file(fileToProcess, options.encode, 'Y') File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 144, in encode_file bz_file_out.write(line.encode(new_encode)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) I've encountered this problem before and the solution I've come up with a fix that works but is probably not the best,
def is_ord (strng): new_text = '' for i in strng: if ord(i) > 127: new_text = new_text + '' else: new_text = new_text + i return new_text
#Then just,
text_from_file = is_ord(text_from_file)
Tom
-- http://mail.python.org/mailman/listinfo/python-list
Peter Wilkinson <pw********@videotron.ca> writes: It would be good to find out _why_ this happens in the first place. I will keep do a little searching on this for a few days.
Most likely because you have characters in that file that are not in the
ASCII character set. ASCII is after all only a very small subset of
unicode. E.g. u"ä".encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
If it's OK to lose information, you could use the error argument to
..encode like
u"ä".encode("ascii", "ignore")
''
or
u"ä".encode("ascii", "replace")
'?'
Bernhard
--
Intevation GmbH http://intevation.de/
Skencil http://sketch.sourceforge.net/
Thuban http://thuban.intevation.org/
I tried the function, actually this does not seem to work as I expected.
What happens is that the character encoding seems to change in the
following way: placing what is the equivalent of some return character
after each character ... or when I view the file in excel there is a blank
row in between between each row.
Its very strange.
back to the drawing board
At 02:17 PM 8/6/2004, Peter Wilkinson wrote: Thanks Tom B.,
I will try that for now ....
It would be good to find out _why_ this happens in the first place. I will keep do a little searching on this for a few days.
Peter W.
At 02:04 PM 8/6/2004, Tom B. wrote:
"Peter Wilkinson" <pw********@videotron.ca> wrote in message news:ma**************************************@py thon.org... > Hello tlistmembers, > > I am using the encoding function to convert unicode to ascii. At one point > this code was working just fine, however, now it has broken. > > I am reading a text file that has is in unicode (I am unsure of which > flavour or bit depth). as I read in the file one line at a time > (readlines()) it converts to ascii. Simple enough. At the same time I am > copressing to bz2 with the bz2 module but that works just fine. The code > is and error reported appears below. I am unsure what to do. > > I assume that because it is reporting that ordinal is not in range, that > something to do with the character width that I am reading? > > Peter W. > > def encode_file(file_path, encode_type, compress='N'): > """ > Changes encoding of file > """ > new_encode = encode_type > old_file_path = file_path + '.old' > new_file_path = file_path > os.rename(file_path,old_file_path) > file_in = file(old_file_path,'r') > > if compress == 'Y' or compress == 'y': > bz_file_path = file_path + '.bz2' > bz_file_out = bz2.BZ2File(bz_file_path, 'w') > for line in file_in.readlines(): > bz_file_out.write(line.encode(new_encode)) > bz_file_out.close() > > else: > file_out = file(file_path,'w') > for line in file_in.readlines(): > file_out.write(line.encode(new_encode)) > file_out.close() > > file_in.close() > os.remove(old_file_path) > > ERROR Reported: > > Parsing > X:\GenomeQuebec_repository\microarray\HIS\M15K\S tep_1_repository\HISH0224.tx t > Traceback (most recent call last): > File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line > 433, in _do_start > self.kdb.run(code_ob, locals, locals) > File "C:\Python23\lib\bdb.py", line 350, in run > exec cmd in globals, locals > File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", > line 158, in ? > main() > File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", > line 75, in main > encode_file(fileToProcess, options.encode, 'Y') > File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", > line 144, in encode_file > bz_file_out.write(line.encode(new_encode)) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: > ordinal not in range(128) > I've encountered this problem before and the solution I've come up with a fix that works but is probably not the best,
def is_ord (strng): new_text = '' for i in strng: if ord(i) > 127: new_text = new_text + '' else: new_text = new_text + i return new_text
#Then just,
text_from_file = is_ord(text_from_file)
Tom
-- http://mail.python.org/mailman/listinfo/python-list
-- http://mail.python.org/mailman/listinfo/python-list
Well this is interestingly annoying:
u"ä".encode("ascii", "ignore") -> '' # works just fine but as I have
written
aa = "ä"
aa.encode("ascii","ignore") ->
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0:
ordinal not in range(128)
So I am guessing that I don't understand something about the syntax
Peter
At 02:31 PM 8/6/2004, Bernhard Herzog wrote: Peter Wilkinson <pw********@videotron.ca> writes:
It would be good to find out _why_ this happens in the first place. I will keep do a little searching on this for a few days.
Most likely because you have characters in that file that are not in the ASCII character set. ASCII is after all only a very small subset of unicode. E.g.
u"ä".encode("ascii")Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
If it's OK to lose information, you could use the error argument to .encode like u"ä".encode("ascii", "ignore")''
or u"ä".encode("ascii", "replace")
'?'
Bernhard
-- Intevation GmbH http://intevation.de/ Skencil http://sketch.sourceforge.net/ Thuban http://thuban.intevation.org/ -- http://mail.python.org/mailman/listinfo/python-list
Peter Wilkinson wrote: Hello tlistmembers,
I am using the encoding function to convert unicode to ascii. At one point this code was working just fine, however, now it has broken.
I am reading a text file that has is in unicode (I am unsure of which flavour or bit depth). as I read in the file one line at a time (readlines()) it converts to ascii. Simple enough. At the same time I am copressing to bz2 with the bz2 module but that works just fine. The code is and error reported appears below. I am unsure what to do.
I assume that because it is reporting that ordinal is not in range, that something to do with the character width that I am reading?
Peter W.
def encode_file(file_path, encode_type, compress='N'): """ Changes encoding of file """ new_encode = encode_type old_file_path = file_path + '.old' new_file_path = file_path os.rename(file_path,old_file_path) file_in = file(old_file_path,'r')
if compress == 'Y' or compress == 'y': bz_file_path = file_path + '.bz2' bz_file_out = bz2.BZ2File(bz_file_path, 'w') for line in file_in.readlines(): bz_file_out.write(line.encode(new_encode)) bz_file_out.close()
else: file_out = file(file_path,'w') for line in file_in.readlines(): file_out.write(line.encode(new_encode)) file_out.close()
file_in.close() os.remove(old_file_path)
ERROR Reported:
Parsing X:\GenomeQuebec_repository\microarray\HIS\M15K\Ste p_1_repository\HISH0224.txt
Traceback (most recent call last): File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line 433, in _do_start self.kdb.run(code_ob, locals, locals) File "C:\Python23\lib\bdb.py", line 350, in run exec cmd in globals, locals File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 158, in ? main() File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 75, in main encode_file(fileToProcess, options.encode, 'Y') File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py", line 144, in encode_file bz_file_out.write(line.encode(new_encode)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
0xff in position 0? If there is a 0xfe is in position 1, I would suspect
your dealing with the Byte Order Mark for a UTF-16 encoded file (UTF-16
LE to be precise). What happens if you skip the first 2 bytes of the file?
--
Vincent Wehren
Peter Wilkinson wrote: UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
That error actually says what happened: You have the byte with the
numeric value 0xff in the input, and the ASCII (American Standard
Code for Information Interchange) converter cannot convert that
into a Unicode character. This is because ASCII is a 7-bit character
set, i.e. it goes from 0..127. 0xFF is 255, so it is out of range.
Now, the line triggering this is
bz_file_out.write(line.encode(new_encode))
and it invokes *encode*, not *decode*. Why would it give a decode error
then?
Because:
decode: take a byte string, return a Unicode string
encode: take a Unicode string, take a byte string
So line should be a Unicode string, for .encode to be a meaningful thing
to do. Unfortunately, Python supports .encode also for byte strings.
If new_encode defines a character encoding, this does
class str:
def encode(self, encoding):
unistr = unicode(self)
return unistr.encode(encoding)
So it first tries to convert the current string into unicode, which
uses the system default encoding, which is us-ascii. Hence the error.
HTH,
Martin
thanks for the clear explanation.
I modified my code and now this works :)
Peter
At 03:46 PM 8/6/2004, Martin v. Löwis wrote: Peter Wilkinson wrote:UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
That error actually says what happened: You have the byte with the numeric value 0xff in the input, and the ASCII (American Standard Code for Information Interchange) converter cannot convert that into a Unicode character. This is because ASCII is a 7-bit character set, i.e. it goes from 0..127. 0xFF is 255, so it is out of range.
Now, the line triggering this is
bz_file_out.write(line.encode(new_encode))
and it invokes *encode*, not *decode*. Why would it give a decode error then?
Because:
decode: take a byte string, return a Unicode string encode: take a Unicode string, take a byte string
So line should be a Unicode string, for .encode to be a meaningful thing to do. Unfortunately, Python supports .encode also for byte strings. If new_encode defines a character encoding, this does
class str: def encode(self, encoding): unistr = unicode(self) return unistr.encode(encoding)
So it first tries to convert the current string into unicode, which uses the system default encoding, which is us-ascii. Hence the error.
HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list
Hi !
Try :
aa = u"ä"
aa.encode("ascii","ignore")
Sorry !
The COMPLETE script is :
# -*- coding: cp1252 -*-
aa = u"ä"
aa.encode("ascii","ignore")
Thanks for the help,
I have got it working the problem was that I was not reading into the
string as unicode.
Peter
At 04:22 AM 8/7/2004, Michel Claveau - abstraction méta-galactique nonwrote: Sorry !
The COMPLETE script is :
# -*- coding: cp1252 -*- aa = u"ä" aa.encode("ascii","ignore") -- http://mail.python.org/mailman/listinfo/python-list
Michel> # -*- coding: cp1252 -*-
Michel> aa = u"ä"
Michel> aa.encode("ascii","ignore")
A somewhat less destructive solution might be to try my latscii codec: http://manatee.mojam.com/~skip/python/latscii.py
(assuming your input is encoded as latin-1).
Skip This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Randell D. |
last post by:
Folks,
I have two related questions:
1. I have seen unicode being mentioned in my javascript pocket book - is
this the same as ascii codes? I think not though I'm not sure and I
can't find...
|
by: N |
last post by:
Hi,
I'm writing a small web service (using C#) which is going to receive a text
file, add a line to it and send it back.
Input is a string with each line ending with "\r\n".
The problem is in the...
|
by: Thomas W |
last post by:
I'm getting really annoyed with python in regards to
unicode/ascii-encoding problems.
The string below is the encoding of the norwegian word "fødselsdag".
I stored the string as "fødselsdag"...
|
by: ronrsr |
last post by:
I have an MySQL database called zingers. The structure is:
zid - integer, key, autoincrement
keyword - varchar
citation - text
quotation - text
I am having trouble storing text, as typed in...
|
by: Robbie |
last post by:
Hi again all, here's something I'm stuck on...
I'm making a function to convert a unicode character into the kind of code you need to put on a UTF-8 encoded web page (ampersand, hash, digits,...
|
by: Kemmylinns12 |
last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
|
by: Naresh1 |
last post by:
What is WebLogic Admin Training?
WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
|
by: antdb |
last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine
In the overall architecture, a new "hyper-convergence" concept was proposed, which integrated multiple engines and...
|
by: Matthew3360 |
last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function.
Here is my code.
header("Location:".$urlback);
Is this the right layout the...
|
by: Matthew3360 |
last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it so the python app could use a http request to get...
|
by: Arjunsri |
last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
|
by: Oralloy |
last post by:
Hello Folks,
I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA.
My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
|
by: BLUEPANDA |
last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...
|
by: Rahul1995seven |
last post by:
Introduction:
In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python has gained popularity among beginners and experts...
| |