By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,647 Members | 1,178 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,647 IT Pros & Developers. It's quick & easy.

Python 2.1 / 2.3: xreadlines not working with codecs.open

P: n/a
Hi all,

I just found a problem in the xreadlines method/module when used with codecs.open: the codec specified in the open does not seem to be taken into account by xreadlines which also returns byte-strings instead of unicode strings.

For example, if a file foo.txt contains some text encoded in latin1:
import codecs
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
[l for l in f.xreadlines()] ['\xe9\xe0\xe7\xf9\n']

But:
import codecs
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
f.readlines()

[u'\ufffd\ufffd']

The characters in latin1 are correctly "dumped" with readlines, but are still in latin1 encoding in byte-strings with xreadlines.

I tested with Python 2.1 and 2.3 on Linux and Windows: same result (I haven't Python 2.4 installed here)

Can anybody confirm the problem? Is this a bug? I searched this usegroup and the known Python bugs, but the problem did not seem to be reported yet.

TIA
--
python -c "print ''.join([chr(154 - ord(c)) for c in 'U(17zX(%,5.zmz5(17;8(%,5.Z65\'*9--56l7+-'])"
Jul 19 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On Thu, 23 Jun 2005 14:23:34 +0200, Eric Brunel <er*********@despammed.com> wrote:
Hi all,

I just found a problem in the xreadlines method/module when used with codecs.open: the codec specified in the open does not seem to be taken into account by xreadlines which also returns byte-strings instead of unicode strings.

For example, if a file foo.txt contains some text encoded in latin1:
import codecs
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
[l for l in f.xreadlines()] ['\xe9\xe0\xe7\xf9\n']

But:
import codecs
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
f.readlines() [u'\ufffd\ufffd']

The characters in latin1 are correctly "dumped" with readlines, but are still in latin1 encoding in byte-strings with xreadlines.


Replying to myself. One more funny thing:
import codecs, xreadlines
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
[l for l in xreadlines.xreadlines(f)]

[u'\ufffd\ufffd']

So f.xreadlines does not work, but xreadlines.xreadlines(f) does. And this happens in Python 2.3, but also in Python 2.1, where the implementation for f.xreadlines() calls xreadlines.xreadlines(f) (?!?). Something's escaping me here... Reading the source didn't help.

At least, it does provide a workaround...
--
python -c "print ''.join([chr(154 - ord(c)) for c in 'U(17zX(%,5.zmz5(17;8(%,5.Z65\'*9--56l7+-'])"
Jul 19 '05 #2

P: n/a
Eric Brunel wrote:
I just found a problem in the xreadlines method/module when used with
codecs.open: the codec specified in the open does not seem to be taken
into account by xreadlines which also returns byte-strings instead of
unicode strings. So f.xreadlines does not work, but xreadlines.xreadlines(f) does. And this
happens in Python 2.3, but also in Python 2.1, where the implementation
for f.xreadlines() calls xreadlines.xreadlines(f) (?!?). Something's
escaping me here... Reading the source didn't help.
codecs.StreamReaderWriter seems to delegate everything it doesn't implement
itself to the underlying file instance which is ignorant of the encoding.
The culprit:

def __getattr__(self, name,
getattr=getattr):

""" Inherit all other methods from the underlying stream.
"""
return getattr(self.stream, name)
At least, it does provide a workaround...


Note that the xreadlines module hasn't made it into Python 2.4.

Peter

Jul 19 '05 #3

P: n/a

"Eric Brunel" <er*********@despammed.com> wrote in message news:op**************@eb.pragmadev...

Replying to myself. One more funny thing:
import codecs, xreadlines
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
[l for l in xreadlines.xreadlines(f)]

[u'\ufffd\ufffd']


You've specified utf-8 as the encoding instead of iso8859-1,
by the way.
Jul 19 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.