By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,341 Members | 1,395 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,341 IT Pros & Developers. It's quick & easy.

unicode to ascii converting

P: n/a
Hello tlistmembers,

I am using the encoding function to convert unicode to ascii. At one point
this code was working just fine, however, now it has broken.

I am reading a text file that has is in unicode (I am unsure of which
flavour or bit depth). as I read in the file one line at a time
(readlines()) it converts to ascii. Simple enough. At the same time I am
copressing to bz2 with the bz2 module but that works just fine. The code
is and error reported appears below. I am unsure what to do.

I assume that because it is reporting that ordinal is not in range, that
something to do with the character width that I am reading?

Peter W.

def encode_file(file_path, encode_type, compress='N'):
"""
Changes encoding of file
"""
new_encode = encode_type
old_file_path = file_path + '.old'
new_file_path = file_path
os.rename(file_path,old_file_path)
file_in = file(old_file_path,'r')

if compress == 'Y' or compress == 'y':
bz_file_path = file_path + '.bz2'
bz_file_out = bz2.BZ2File(bz_file_path, 'w')
for line in file_in.readlines():
bz_file_out.write(line.encode(new_encode))
bz_file_out.close()

else:
file_out = file(file_path,'w')
for line in file_in.readlines():
file_out.write(line.encode(new_encode))
file_out.close()

file_in.close()
os.remove(old_file_path)

ERROR Reported:

Parsing
X:\GenomeQuebec_repository\microarray\HIS\M15K\Ste p_1_repository\HISH0224.txt
Traceback (most recent call last):
File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
433, in _do_start
self.kdb.run(code_ob, locals, locals)
File "C:\Python23\lib\bdb.py", line 350, in run
exec cmd in globals, locals
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 158, in ?
main()
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 75, in main
encode_file(fileToProcess, options.encode, 'Y')
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 144, in encode_file
bz_file_out.write(line.encode(new_encode))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)

Jul 18 '05 #1
Share this Question
Share on Google+
12 Replies


P: n/a

"Peter Wilkinson" <pw********@videotron.ca> wrote in message
news:ma**************************************@pyth on.org...
Hello tlistmembers,

I am using the encoding function to convert unicode to ascii. At one point
this code was working just fine, however, now it has broken.

I am reading a text file that has is in unicode (I am unsure of which
flavour or bit depth). as I read in the file one line at a time
(readlines()) it converts to ascii. Simple enough. At the same time I am
copressing to bz2 with the bz2 module but that works just fine. The code
is and error reported appears below. I am unsure what to do.

I assume that because it is reporting that ordinal is not in range, that
something to do with the character width that I am reading?

Peter W.

def encode_file(file_path, encode_type, compress='N'):
"""
Changes encoding of file
"""
new_encode = encode_type
old_file_path = file_path + '.old'
new_file_path = file_path
os.rename(file_path,old_file_path)
file_in = file(old_file_path,'r')

if compress == 'Y' or compress == 'y':
bz_file_path = file_path + '.bz2'
bz_file_out = bz2.BZ2File(bz_file_path, 'w')
for line in file_in.readlines():
bz_file_out.write(line.encode(new_encode))
bz_file_out.close()

else:
file_out = file(file_path,'w')
for line in file_in.readlines():
file_out.write(line.encode(new_encode))
file_out.close()

file_in.close()
os.remove(old_file_path)

ERROR Reported:

Parsing
X:\GenomeQuebec_repository\microarray\HIS\M15K\Ste p_1_repository\HISH0224.tx
t Traceback (most recent call last):
File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
433, in _do_start
self.kdb.run(code_ob, locals, locals)
File "C:\Python23\lib\bdb.py", line 350, in run
exec cmd in globals, locals
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 158, in ?
main()
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 75, in main
encode_file(fileToProcess, options.encode, 'Y')
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 144, in encode_file
bz_file_out.write(line.encode(new_encode))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)

I've encountered this problem before and the solution I've come up with a
fix that works but is probably not the best,

def is_ord (strng):
new_text = ''
for i in strng:
if ord(i) > 127:
new_text = new_text + ''
else:
new_text = new_text + i
return new_text

#Then just,

text_from_file = is_ord(text_from_file)

Tom
Jul 18 '05 #2

P: n/a
Thanks Tom B.,

I will try that for now ....

It would be good to find out _why_ this happens in the first place. I will
keep do a little searching on this for a few days.
Peter W.
At 02:04 PM 8/6/2004, Tom B. wrote:
"Peter Wilkinson" <pw********@videotron.ca> wrote in message
news:ma**************************************@pyt hon.org...
Hello tlistmembers,

I am using the encoding function to convert unicode to ascii. At one point
this code was working just fine, however, now it has broken.

I am reading a text file that has is in unicode (I am unsure of which
flavour or bit depth). as I read in the file one line at a time
(readlines()) it converts to ascii. Simple enough. At the same time I am
copressing to bz2 with the bz2 module but that works just fine. The code
is and error reported appears below. I am unsure what to do.

I assume that because it is reporting that ordinal is not in range, that
something to do with the character width that I am reading?

Peter W.

def encode_file(file_path, encode_type, compress='N'):
"""
Changes encoding of file
"""
new_encode = encode_type
old_file_path = file_path + '.old'
new_file_path = file_path
os.rename(file_path,old_file_path)
file_in = file(old_file_path,'r')

if compress == 'Y' or compress == 'y':
bz_file_path = file_path + '.bz2'
bz_file_out = bz2.BZ2File(bz_file_path, 'w')
for line in file_in.readlines():
bz_file_out.write(line.encode(new_encode))
bz_file_out.close()

else:
file_out = file(file_path,'w')
for line in file_in.readlines():
file_out.write(line.encode(new_encode))
file_out.close()

file_in.close()
os.remove(old_file_path)

ERROR Reported:

Parsing

X:\GenomeQuebec_repository\microarray\HIS\M15K\St ep_1_repository\HISH0224.tx
t
Traceback (most recent call last):
File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
433, in _do_start
self.kdb.run(code_ob, locals, locals)
File "C:\Python23\lib\bdb.py", line 350, in run
exec cmd in globals, locals
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 158, in ?
main()
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 75, in main
encode_file(fileToProcess, options.encode, 'Y')
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 144, in encode_file
bz_file_out.write(line.encode(new_encode))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)

I've encountered this problem before and the solution I've come up with a
fix that works but is probably not the best,

def is_ord (strng):
new_text = ''
for i in strng:
if ord(i) > 127:
new_text = new_text + ''
else:
new_text = new_text + i
return new_text

#Then just,

text_from_file = is_ord(text_from_file)

Tom
--
http://mail.python.org/mailman/listinfo/python-list


Jul 18 '05 #3

P: n/a
Peter Wilkinson <pw********@videotron.ca> writes:
It would be good to find out _why_ this happens in the first place. I
will keep do a little searching on this for a few days.


Most likely because you have characters in that file that are not in the
ASCII character set. ASCII is after all only a very small subset of
unicode. E.g.
u"".encode("ascii") Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
If it's OK to lose information, you could use the error argument to
..encode like
u"".encode("ascii", "ignore") ''

or
u"".encode("ascii", "replace")

'?'
Bernhard

--
Intevation GmbH http://intevation.de/
Skencil http://sketch.sourceforge.net/
Thuban http://thuban.intevation.org/
Jul 18 '05 #4

P: n/a
I tried the function, actually this does not seem to work as I expected.

What happens is that the character encoding seems to change in the
following way: placing what is the equivalent of some return character
after each character ... or when I view the file in excel there is a blank
row in between between each row.

Its very strange.

back to the drawing board
At 02:17 PM 8/6/2004, Peter Wilkinson wrote:
Thanks Tom B.,

I will try that for now ....

It would be good to find out _why_ this happens in the first place. I will
keep do a little searching on this for a few days.
Peter W.
At 02:04 PM 8/6/2004, Tom B. wrote:
"Peter Wilkinson" <pw********@videotron.ca> wrote in message
news:ma**************************************@py thon.org...
> Hello tlistmembers,
>
> I am using the encoding function to convert unicode to ascii. At one point
> this code was working just fine, however, now it has broken.
>
> I am reading a text file that has is in unicode (I am unsure of which
> flavour or bit depth). as I read in the file one line at a time
> (readlines()) it converts to ascii. Simple enough. At the same time I am
> copressing to bz2 with the bz2 module but that works just fine. The code
> is and error reported appears below. I am unsure what to do.
>
> I assume that because it is reporting that ordinal is not in range, that
> something to do with the character width that I am reading?
>
> Peter W.
>
> def encode_file(file_path, encode_type, compress='N'):
> """
> Changes encoding of file
> """
> new_encode = encode_type
> old_file_path = file_path + '.old'
> new_file_path = file_path
> os.rename(file_path,old_file_path)
> file_in = file(old_file_path,'r')
>
> if compress == 'Y' or compress == 'y':
> bz_file_path = file_path + '.bz2'
> bz_file_out = bz2.BZ2File(bz_file_path, 'w')
> for line in file_in.readlines():
> bz_file_out.write(line.encode(new_encode))
> bz_file_out.close()
>
> else:
> file_out = file(file_path,'w')
> for line in file_in.readlines():
> file_out.write(line.encode(new_encode))
> file_out.close()
>
> file_in.close()
> os.remove(old_file_path)
>
> ERROR Reported:
>
> Parsing
>

X:\GenomeQuebec_repository\microarray\HIS\M15K\S tep_1_repository\HISH0224.tx
t
> Traceback (most recent call last):
> File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
> 433, in _do_start
> self.kdb.run(code_ob, locals, locals)
> File "C:\Python23\lib\bdb.py", line 350, in run
> exec cmd in globals, locals
> File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
> line 158, in ?
> main()
> File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
> line 75, in main
> encode_file(fileToProcess, options.encode, 'Y')
> File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
> line 144, in encode_file
> bz_file_out.write(line.encode(new_encode))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
> ordinal not in range(128)
>

I've encountered this problem before and the solution I've come up with a
fix that works but is probably not the best,

def is_ord (strng):
new_text = ''
for i in strng:
if ord(i) > 127:
new_text = new_text + ''
else:
new_text = new_text + i
return new_text

#Then just,

text_from_file = is_ord(text_from_file)

Tom
--
http://mail.python.org/mailman/listinfo/python-list


--
http://mail.python.org/mailman/listinfo/python-list


Jul 18 '05 #5

P: n/a
Well this is interestingly annoying:

u"".encode("ascii", "ignore") -> '' # works just fine but as I have
written

aa = ""
aa.encode("ascii","ignore") ->

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0:
ordinal not in range(128)

So I am guessing that I don't understand something about the syntax

Peter

At 02:31 PM 8/6/2004, Bernhard Herzog wrote:
Peter Wilkinson <pw********@videotron.ca> writes:
It would be good to find out _why_ this happens in the first place. I
will keep do a little searching on this for a few days.


Most likely because you have characters in that file that are not in the
ASCII character set. ASCII is after all only a very small subset of
unicode. E.g.
u"".encode("ascii")Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
If it's OK to lose information, you could use the error argument to
.encode like
u"".encode("ascii", "ignore")''

or
u"".encode("ascii", "replace")

'?'
Bernhard

--
Intevation GmbH http://intevation.de/
Skencil http://sketch.sourceforge.net/
Thuban http://thuban.intevation.org/
--
http://mail.python.org/mailman/listinfo/python-list


Jul 18 '05 #6

P: n/a
Peter Wilkinson wrote:
Hello tlistmembers,

I am using the encoding function to convert unicode to ascii. At one
point this code was working just fine, however, now it has broken.

I am reading a text file that has is in unicode (I am unsure of which
flavour or bit depth). as I read in the file one line at a time
(readlines()) it converts to ascii. Simple enough. At the same time I am
copressing to bz2 with the bz2 module but that works just fine. The
code is and error reported appears below. I am unsure what to do.

I assume that because it is reporting that ordinal is not in range, that
something to do with the character width that I am reading?

Peter W.

def encode_file(file_path, encode_type, compress='N'):
"""
Changes encoding of file
"""
new_encode = encode_type
old_file_path = file_path + '.old'
new_file_path = file_path
os.rename(file_path,old_file_path)
file_in = file(old_file_path,'r')

if compress == 'Y' or compress == 'y':
bz_file_path = file_path + '.bz2'
bz_file_out = bz2.BZ2File(bz_file_path, 'w')
for line in file_in.readlines():
bz_file_out.write(line.encode(new_encode))
bz_file_out.close()

else:
file_out = file(file_path,'w')
for line in file_in.readlines():
file_out.write(line.encode(new_encode))
file_out.close()

file_in.close()
os.remove(old_file_path)

ERROR Reported:

Parsing
X:\GenomeQuebec_repository\microarray\HIS\M15K\Ste p_1_repository\HISH0224.txt

Traceback (most recent call last):
File "C:\Program Files\ActiveState Komodo 2.5\callkomodo\kdb.py", line
433, in _do_start
self.kdb.run(code_ob, locals, locals)
File "C:\Python23\lib\bdb.py", line 350, in run
exec cmd in globals, locals
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 158, in ?
main()
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 75, in main
encode_file(fileToProcess, options.encode, 'Y')
File "C:\Python23\Lib\site-packages\xBio\Scripts\unicodeToAscii.py",
line 144, in encode_file
bz_file_out.write(line.encode(new_encode))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)


0xff in position 0? If there is a 0xfe is in position 1, I would suspect
your dealing with the Byte Order Mark for a UTF-16 encoded file (UTF-16
LE to be precise). What happens if you skip the first 2 bytes of the file?

--
Vincent Wehren
Jul 18 '05 #7

P: n/a
Peter Wilkinson wrote:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)


That error actually says what happened: You have the byte with the
numeric value 0xff in the input, and the ASCII (American Standard
Code for Information Interchange) converter cannot convert that
into a Unicode character. This is because ASCII is a 7-bit character
set, i.e. it goes from 0..127. 0xFF is 255, so it is out of range.

Now, the line triggering this is

bz_file_out.write(line.encode(new_encode))

and it invokes *encode*, not *decode*. Why would it give a decode error
then?

Because:

decode: take a byte string, return a Unicode string
encode: take a Unicode string, take a byte string

So line should be a Unicode string, for .encode to be a meaningful thing
to do. Unfortunately, Python supports .encode also for byte strings.
If new_encode defines a character encoding, this does

class str:
def encode(self, encoding):
unistr = unicode(self)
return unistr.encode(encoding)

So it first tries to convert the current string into unicode, which
uses the system default encoding, which is us-ascii. Hence the error.

HTH,
Martin
Jul 18 '05 #8

P: n/a
thanks for the clear explanation.

I modified my code and now this works :)
Peter
At 03:46 PM 8/6/2004, Martin v. Lwis wrote:
Peter Wilkinson wrote:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
ordinal not in range(128)


That error actually says what happened: You have the byte with the numeric
value 0xff in the input, and the ASCII (American Standard
Code for Information Interchange) converter cannot convert that
into a Unicode character. This is because ASCII is a 7-bit character
set, i.e. it goes from 0..127. 0xFF is 255, so it is out of range.

Now, the line triggering this is

bz_file_out.write(line.encode(new_encode))

and it invokes *encode*, not *decode*. Why would it give a decode error
then?

Because:

decode: take a byte string, return a Unicode string
encode: take a Unicode string, take a byte string

So line should be a Unicode string, for .encode to be a meaningful thing
to do. Unfortunately, Python supports .encode also for byte strings.
If new_encode defines a character encoding, this does

class str:
def encode(self, encoding):
unistr = unicode(self)
return unistr.encode(encoding)

So it first tries to convert the current string into unicode, which
uses the system default encoding, which is us-ascii. Hence the error.

HTH,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Jul 18 '05 #9

P: n/a
Hi !

Try :

aa = u""
aa.encode("ascii","ignore")

Jul 18 '05 #10

P: n/a
Sorry !

The COMPLETE script is :

# -*- coding: cp1252 -*-
aa = u""
aa.encode("ascii","ignore")


Jul 18 '05 #11

P: n/a
Thanks for the help,

I have got it working the problem was that I was not reading into the
string as unicode.

Peter

At 04:22 AM 8/7/2004, Michel Claveau - abstraction mta-galactique nonwrote:
Sorry !

The COMPLETE script is :

# -*- coding: cp1252 -*-
aa = u""
aa.encode("ascii","ignore")


--
http://mail.python.org/mailman/listinfo/python-list


Jul 18 '05 #12

P: n/a

Michel> # -*- coding: cp1252 -*-
Michel> aa = u""
Michel> aa.encode("ascii","ignore")

A somewhat less destructive solution might be to try my latscii codec:

http://manatee.mojam.com/~skip/python/latscii.py

(assuming your input is encoded as latin-1).

Skip
Jul 18 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.