utf8 and ftplib

Richard Lewis

Hi there,

I'm having a problem with unicode files and ftplib (using Python 2.3.5).

I've got this code:

xml_source = codecs.open("fo o.xml", 'w+b', "utf8")
#xml_source = file("foo.xml", 'w+b')

ftp.retrbinary( "RETR foo.xml", xml_source.writ e)
#ftp.retrlines( "RETR foo.xml", xml_source.writ e)

It opens a new local file using utf8 encoding and then reads from a file
on an FTP server (also utf8 encoded) into that local file. It comes up
with an error, however, on calling the xml_source.writ e callback (I
think) saying that:

"File "myscript.p y", line 75, in get_content
ftp.retrbinary( "RETR foo.xml", xml_source.writ e)
File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary
callback(data)
File "/usr/lib/python2.3/codecs.py", line 400, in write
return self.writer.wri te(data)
File "/usr/lib/python2.3/codecs.py", line 178, in write
data, consumed = self.encode(obj ect, self.errors)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc2 in position 76:
ordinal not in range(128)"

I've tried using both the commented lines of code in the above example
(i.e. using file() instead of codecs.open() and retlines() instead of
retbinary()). retlines() makes no difference, but if I use file()
instead of codecs.open() I can open the file, but the extended
characters from the source file (e.g. foreign characters, copyright
symbol, etc.) all appear with an extra character in front of them
(because of the two char width in utf8?).

Is the xml_source.writ e callback causing the problem here? Or is it
something else? Is there any way that I can correctly retrieve a utf8
encoded file from an FTP server?

Cheers,
Richard

Jul 19 '05 #1

Subscribe Reply

6919

John Roth

"Richard Lewis" <ri**********@f astmail.co.uk> wrote in message
news:ma******** *************** *************** @python.org...

Hi there,

I'm having a problem with unicode files and ftplib (using Python 2.3.5).

I've got this code:

xml_source = codecs.open("fo o.xml", 'w+b', "utf8")
#xml_source = file("foo.xml", 'w+b')

ftp.retrbinary( "RETR foo.xml", xml_source.writ e)
#ftp.retrlines( "RETR foo.xml", xml_source.writ e)

It opens a new local file using utf8 encoding and then reads from a file
on an FTP server (also utf8 encoded) into that local file. It comes up
with an error, however, on calling the xml_source.writ e callback (I
think) saying that:

"File "myscript.p y", line 75, in get_content
ftp.retrbinary( "RETR foo.xml", xml_source.writ e)
File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary
callback(data)
File "/usr/lib/python2.3/codecs.py", line 400, in write
return self.writer.wri te(data)
File "/usr/lib/python2.3/codecs.py", line 178, in write
data, consumed = self.encode(obj ect, self.errors)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc2 in position 76:
ordinal not in range(128)"

I've tried using both the commented lines of code in the above example
(i.e. using file() instead of codecs.open() and retlines() instead of
retbinary()). retlines() makes no difference, but if I use file()
instead of codecs.open() I can open the file, but the extended
characters from the source file (e.g. foreign characters, copyright
symbol, etc.) all appear with an extra character in front of them
(because of the two char width in utf8?).

Is the xml_source.writ e callback causing the problem here? Or is it
something else? Is there any way that I can correctly retrieve a utf8
encoded file from an FTP server?
It looks like there are at least two problems here. The major one
is that you seem to have a misconception about utf-8 encoding.

The _disk_ version of the file is what is encoded in utf-8, and it has
to be decoded to unicode on being read later. In other words,
what you got is what you should have put on disk without any
conversion. As you noted, when you did that, the FTP part of
the process worked.

Whatever program you are using to read it has to then decode
it from utf-8 into unicode. Failure to do this is what is causing
the extra characters on output.

The object returned by codecs.open raised an exception
because it expected a
unicode string on input; it got a character string already
encoded in utf-8 format. The internal mechanism is first
going to try to decode that into unicode before then
encoding it into utf-8. Unfortunately, the default for
encoding or decoding (outside of special contexts) is
ASCII-7. So everything outside of the ASCII range
is invalid.

Amusingly, this would have worked:

xml_source = codecs.EncodedF ile("foo.xml", "utf-8", "utf-8")

It is, of course, an expensive way of doing nothing, but
it at least has the virtue of being good documentation.

HTH

John Roth

Cheers,
Richard

Jul 19 '05 #2

John Machin

Richard Lewis wrote:

Hi there,

I'm having a problem with unicode files and ftplib (using Python 2.3.5).

I've got this code:

xml_source = codecs.open("fo o.xml", 'w+b', "utf8")
#xml_source = file("foo.xml", 'w+b')

ftp.retrbinary( "RETR foo.xml", xml_source.writ e)
#ftp.retrlines( "RETR foo.xml", xml_source.writ e)

It opens a new local file using utf8 encoding and then reads from a file
on an FTP server (also utf8 encoded) into that local file. It comes up
with an error, however, on calling the xml_source.writ e callback (I
think) saying that:

"File "myscript.p y", line 75, in get_content
ftp.retrbinary( "RETR foo.xml", xml_source.writ e)
File "/usr/lib/python2.3/ftplib.py", line 384, in retrbinary
callback(data)
File "/usr/lib/python2.3/codecs.py", line 400, in write
return self.writer.wri te(data)
File "/usr/lib/python2.3/codecs.py", line 178, in write
data, consumed = self.encode(obj ect, self.errors)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xc2 in position 76:
ordinal not in range(128)"

I've tried using both the commented lines of code in the above example
(i.e. using file() instead of codecs.open() and retlines() instead of
retbinary()). retlines() makes no difference, but if I use file()
instead of codecs.open() I can open the file, but the extended
characters from the source file (e.g. foreign characters, copyright
symbol, etc.) all appear with an extra character in front of them
(because of the two char width in utf8?).
Saying "appear with an extra character in front of them" is close to
useless for diagnostic purposes -- print repr(sample_str ing) would be
more informative.

In any case, the file with the "foreign" [attitude?] characters may well
be what you want.

Is the xml_source.writ e callback causing the problem here? Or is it
something else? Is there any way that I can correctly retrieve a utf8
encoded file from an FTP server?

To get an exact copy of a file via FTP -- doesn't matter whether it's
encoded in utf8 or ESCII or whatever -- use the following combination:

xml_source = file("foo.xml", 'w+b')
ftp.retrbinary( "RETR foo.xml", xml_source.writ e)

If you were using a command-line FTP client, you would use the "binary"
command before doing a "get" or "mget".

HTH,
John

Jul 19 '05 #3

Richard Lewis

On Thu, 16 Jun 2005 12:06:50 -0600, "John Roth"
<ne********@jhr othjr.com> said:

"Richard Lewis" <ri**********@f astmail.co.uk> wrote in message
news:ma******** *************** *************** @python.org...
Hi there,

I'm having a problem with unicode files and ftplib (using Python 2.3.5).

I've got this code:

xml_source = codecs.open("fo o.xml", 'w+b', "utf8")
#xml_source = file("foo.xml", 'w+b')

ftp.retrbinary( "RETR foo.xml", xml_source.writ e)
#ftp.retrlines( "RETR foo.xml", xml_source.writ e)

It looks like there are at least two problems here. The major one
is that you seem to have a misconception about utf-8 encoding.

Who doesn't? ;-)

Whatever program you are using to read it has to then decode
it from utf-8 into unicode. Failure to do this is what is causing
the extra characters on output.

Amusingly, this would have worked:

xml_source = codecs.EncodedF ile("foo.xml", "utf-8", "utf-8")

It is, of course, an expensive way of doing nothing, but
it at least has the virtue of being good documentation.

OK, I've fiddled around a bit more but I still haven't managed to get it
to work. I get the fact that its not the FTP operation thats causing the
problem so it must be either the xml.minidom.par se() function (and
whatever sort of file I give that) or the way that I write my results to
output files after I've done my DOM processing. I'll post some more
detailed code:

def open_file(file_ name):
ftp = ftplib.FTP(self .host)
ftp.login(self. login, self.passwd)

content_file = file(file_name, 'w+b')
ftp.retrbinary( "RETR " + self.path, content_file.wr ite)
ftp.quit()
content_file.cl ose()

## Case 1:
#self.document = parse(file_name )

## Case 2:
#self.document = parse(codecs.op en(file_name, 'r+b', "utf-8"))

# Case 3:
content_file = codecs.open(fil e_name, 'r', "utf-8")
self.document = parse(codecs.En codedFile(conte nt_file, "utf-8",
"utf-8"))
content_file.cl ose()

In Case1 I get the incorrectly encoded characters.

In Case 2 I get the exception:
"UnicodeEncodeE rror: 'ascii' codec can't encode character u'\xe6' in
position 5208: ordinal not in range(128)"
when it calls the xml.minidom.par se() function.

In Case 3 I get the exception:
"UnicodeEncodeE rror: 'ascii' codec can't encode character u'\xe6' in
position 5208: ordinal not in range(128)"
when it calls the xml.minidom.par se() function.

The character at position 5208 is an 'a' (assuming Emacs' goto-char
function has the same idea about file positions as
xml.minidom.par se()?). When I first tried these two new cases it came up
with an unencodable character at another position. By replacing the
large dash at this position with an ordinary minus sign I stopped it
from raising the exception at that point in the file. I checked the
character xe6 and (assuming I know what I'm doing) its a small ae
ligature.

Anyway, later on in the program I create a *very* large unicode string
after doing some playing with the DOM tree. I then write this to a file
using:
html_file = codecs.open(fil e_name, "w+b", "utf8")
html_file.write (very_large_uni code_string)

The problem could be here?

Cheers,
Richard

Jul 19 '05 #4

John Roth

"Richard Lewis" <ri**********@f astmail.co.uk> wrote in message
news:ma******** *************** *************** @python.org...

On Thu, 16 Jun 2005 12:06:50 -0600, "John Roth"
<ne********@jhr othjr.com> said:
"Richard Lewis" <ri**********@f astmail.co.uk> wrote in message
news:ma******** *************** *************** @python.org...
> Hi there,
>
> I'm having a problem with unicode files and ftplib (using Python
> 2.3.5).
>
> I've got this code:
>
> xml_source = codecs.open("fo o.xml", 'w+b', "utf8")
> #xml_source = file("foo.xml", 'w+b')
>
> ftp.retrbinary( "RETR foo.xml", xml_source.writ e)
> #ftp.retrlines( "RETR foo.xml", xml_source.writ e)
>
It looks like there are at least two problems here. The major one
is that you seem to have a misconception about utf-8 encoding.

Who doesn't? ;-)

Lots of people. It's not difficult to understand, it just takes a
bit of attention to the messy details.

The basic concept is that Unicode is _always_ processed using
a unicode string _in the program_. On disk or across the internet,
it's _always_ stored in an encoded form, frequently but not always
utf-8. A regular string _never_ stores raw unicode; it's always
some encoding.

When you read text data from the internet, it's _always_ in some
encoding. If that encoding is one of the utf- encodings, it needs
to be converted to unicode to be processed, but it does not need
to be changed at all to write it to disk.

Whatever program you are using to read it has to then decode
it from utf-8 into unicode. Failure to do this is what is causing
the extra characters on output.

Amusingly, this would have worked:

xml_source = codecs.EncodedF ile("foo.xml", "utf-8", "utf-8")

It is, of course, an expensive way of doing nothing, but
it at least has the virtue of being good documentation.

OK, I've fiddled around a bit more but I still haven't managed to get it
to work. I get the fact that its not the FTP operation thats causing the
problem so it must be either the xml.minidom.par se() function (and
whatever sort of file I give that) or the way that I write my results to
output files after I've done my DOM processing. I'll post some more
detailed code:

Please post _all_ of the relevant code. It wastes people's time
when you post incomplete examples. The critical issue is frequently
in the part that you didn't post.

def open_file(file_ name):
ftp = ftplib.FTP(self .host)
ftp.login(self. login, self.passwd)

content_file = file(file_name, 'w+b')
ftp.retrbinary( "RETR " + self.path, content_file.wr ite)
ftp.quit()
content_file.cl ose()

## Case 1:
#self.document = parse(file_name )

## Case 2:
#self.document = parse(codecs.op en(file_name, 'r+b', "utf-8"))

# Case 3:
content_file = codecs.open(fil e_name, 'r', "utf-8")
self.document = parse(codecs.En codedFile(conte nt_file, "utf-8",
"utf-8"))
content_file.cl ose()

In Case1 I get the incorrectly encoded characters.

In Case 2 I get the exception:
"UnicodeEncodeE rror: 'ascii' codec can't encode character u'\xe6' in
position 5208: ordinal not in range(128)"
when it calls the xml.minidom.par se() function.

In Case 3 I get the exception:
"UnicodeEncodeE rror: 'ascii' codec can't encode character u'\xe6' in
position 5208: ordinal not in range(128)"
when it calls the xml.minidom.par se() function.
That's exactly what you should expect. In the first case, the file
on disk is encoded as utf-8, and this is aparently what mini-dom
is expecting.

The documentation shows a simple read, it does not show any
kind of encoding or decoding.
Anyway, later on in the program I create a *very* large unicode string
after doing some playing with the DOM tree. I then write this to a file
using:
html_file = codecs.open(fil e_name, "w+b", "utf8")
html_file.write (very_large_uni code_string)

The problem could be here?
That should work. The problem, as I said in the first post,
is that whatever program you are using to render the file
to screen or print is _not_ treating the file as utf-8 encoded.
It either needs to be told that the file is in utf-8 encoding,
or you need to get a better rendering program.

Many renderers, including most renderers inside of
programming tools like file inspectors and debuggers,
assume that the encoding is latin-1 or windows-1252.
This will throw up funny characters if you try to read
a utf-8 (or any multi-byte encoded) file using them.

One trick that sometimes works is to insure that the first
character is the BOM (byte order mark, or unicode signature).
Properly written Windows programs will use this as an
encoding signature. Unixoid programs frequently won't,
but that's arguably a violation of the Unicode standard.
This is a single unicode character which is three characters
in utf-8 encoding.

John Roth

Cheers,
Richard

Jul 19 '05 #5

Fredrik Lundh

Richard Lewis wrote:

OK, I've fiddled around a bit more but I still haven't managed to get it
to work. I get the fact that its not the FTP operation thats causing the
problem so it must be either the xml.minidom.par se() function (and
whatever sort of file I give that) or the way that I write my results to
output files after I've done my DOM processing. I'll post some more
detailed code:

def open_file(file_ name):
ftp = ftplib.FTP(self .host)
ftp.login(self. login, self.passwd)

content_file = file(file_name, 'w+b')
ftp.retrbinary( "RETR " + self.path, content_file.wr ite)
ftp.quit()
content_file.cl ose()

## Case 1:
#self.document = parse(file_name )

## Case 2:
#self.document = parse(codecs.op en(file_name, 'r+b', "utf-8"))

# Case 3:
content_file = codecs.open(fil e_name, 'r', "utf-8")
self.document = parse(codecs.En codedFile(conte nt_file, "utf-8",
"utf-8"))
content_file.cl ose()

In Case1 I get the incorrectly encoded characters.

case 1 is the only one where you use the XML parser as it is designed to
be used (on the stream level, XML is defined in terms of encoded text,
not Unicode characters. the parser will decode things for you)

given that he XML tree returned by the parser contains *decoded* Uni-
code characters (in Unicode string objects), what makes you so sure that
you're getting "incorrectl y encoded characters" from the parser?

</F>

(I wonder why this is so hard for so many people? hardly any programmer has
any problem telling the difference between, say, a 32-bit binary floating point
value on disk, a floating point object, and the string representation of a float.
but replace the float with a Unicode character, and anglocentric programmers
immediately resort to poking-with-a-stick-in-the-dark programming. I'll figure
it out, some day...)

Jul 19 '05 #6

Similar topics

19839

ftplib question: how to upload files?

by: python | last post by:

Hi: I want to write a procedure to automatically upload some files for me, but I'm getting stuck. Before I write my own gruesome put() function, I wanted to check if there is an easier way. Here's what I can do so far: >>>import ftplib >>>conn = ftplib.FTP('ftp.example.com') >>>conn.login(user='userid', passwd='passwd')

Python

7836

ftplib - uploading files using transfercmd?

by: Kevin Ollivier | last post by:

Hi all, I've come across a problem that has me stumped, and I thought I'd send a message to the gurus to see if this makes sense to anyone else. =) Basically, I'm trying to upload a series of files via FTP. I'm using ftplib to do it, and for each file I'm using transfercmd("STOR " + myfile) to get the socket, then uploading 4096 bytes at a time and providing status updates via a GUI interface. Finally, I close the socket, set it to...

Python

2124

Sending file to print server using ftplib

by: Joshua Burvill | last post by:

Hello, I am trying to print something to a print server using the following function but I get errors, does anyone have any pointers? Rgds, Josh Traceback (most recent call last): File "<pyshell#2>", line 1, in ?

Python

2950

ftplib callbacks

by: Matija Papec | last post by:

I would like to reimplement ftplib "nlst" using ftplib.dir (ftp server doesn't support nlst) so I'm trying to guess how to use ftp callbacks. Any help is appreciated. tia! ============ #!/usr/bin/python # nlst vs list

Python

2262

ftplib strange behaviour

by: siggy2 | last post by:

Hi, I'm using Python 2.3.4 (#53, May 25 2004, 21:17:02) on win32 I've noticed a strange (= not deterministic) behaviour of ftplib.py: sometimes (not always) it fails (after a variable number of minutes from 15 to 130) downloading a 150 MB BINARY file (a big gzipped ascii file) with the traceback reported below. IMVHO this is not a timeout error because my script import

Python

9579

FTPlib

by: Harlin Seritt | last post by:

Using ftplib from Python I am trying to get all files in a particular directory using ftplib and then send those same files to another ftp server. I have tried using commands like 'get *' and 'mget *' with no success. I am using the following: srcFtp = FTP(srcHost) srcFtp.login(srcUser, srcPass) srcDir = srcFtp.nlst('.')

Python

2996

i have error then use ftplib

by: Ëåîíîâ Àëåêñåé | last post by:

Hello! I use this code: from ftplib import FTP def handleDownload(block): file.write(block) print "." file = open('1', 'wb') ftp = FTP('ftp.utk.ru')

Python

5782

ftplib error- Large file

by: half.italian | last post by:

Hi all, I'm using ftplib to transfer large files to remote sites. The process seems to work perfectly with small files, but when the file gets to large ~20GB I begin getting errors that sometimes seem to be non- fatal, and other times the transfer does not complete. I've debugged the hell out of the code, and can't figure out what is going on..is it the remote site? Packet loss? Traceback (most recent call last):

Python

6518

ftplib returns EOFError

by: Jon Bowlas | last post by:

Hi All, I've written a little method to connect to an ftpserver which works well, however when I send a file using this ftp connection oddly I _sometimes_ get returned an EOFError from ftplib.getline even though my file is actually transferred. Here's my script: def uploadViaFtp(self, file, filename):

Python

8968

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9473

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9259

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8208

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6750

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6053

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4824

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3279

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2193

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General