By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,907 Members | 2,031 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,907 IT Pros & Developers. It's quick & easy.

Get document as normal text and not as binary data

P: n/a
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÀt, for example. How can I get the
contents as normal text?

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)
contents = f.read()
print contents
f.close()

Thanks!

Markus
Jul 18 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Markus Franz wrote:
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÀt, for example. How can I get the
contents as normal text?


You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.

--
Regards,

Diez B. Roggisch
Jul 18 '05 #2

P: n/a
Markus Franz wrote:
I used urllib2 to load a html-document through http. But my problem
is: The loaded contents are returned as binary data, that means that every
character is displayed like l?t, for example. How can I get the
contents as normal text?

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)
adding

print f.headers

and checking the header fields (especially the content-type) may help you
figure out what's going on...
contents = f.read()
print contents
f.close()


</F>

Jul 18 '05 #3

P: n/a
Diez B. Roggisch wrote:
You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.


And how can I convert those binary data to a "normal" string with
"normal" characters?

Best regards

Markus
Jul 18 '05 #4

P: n/a
Markus Franz wrote:
Diez B. Roggisch wrote:
You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.


And how can I convert those binary data to a "normal" string with
"normal" characters?


There is no "normal" - it's just bytes, and a string is just bytes. No
difference, no translation necessary.

As others have said: look into the http header what the server is trying to
transmit - maybe an image. The mimetype header is telling you that.

Or use wget to fetch the url and look what you get - it shouldn't look
different.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #5

P: n/a
Addendum: If you give us the url you're fetching data from, we might be able
to look at the delivered data ourselves.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #6

P: n/a
Markus Franz wrote:
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lÀt, for example. How can I get the
contents as normal text?
My guess is the html is utf-8 encoded - your sample looks like utf-8-interpreted-as-latin-1. Try
contents = f.read().decode('utf-8')

Kent

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)
contents = f.read()
print contents
f.close()

Thanks!

Markus

Jul 18 '05 #7

P: n/a
Kent Johnson wrote:
My guess is the html is utf-8 encoded - your sample looks like
utf-8-interpreted-as-latin-1. Try
contents = f.read().decode('utf-8')


YES! That helped!

I used the following:

....
contents = f.read().decode('utf-8')
contents = contents.encode('iso-8859-15')
....

That was the perfect solution for my problem! Thanks a lot!

Best regards

Markus
Jul 18 '05 #8

P: n/a
Diez B. Roggisch wrote:
Addendum: If you give us the url you're fetching data from, we might be able
to look at the delivered data ourselves.


To guess my problem please have a look at the document title of
<http://portal.suse.de/sdb/de/1997/01/xntp.html>

Markus
Jul 18 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.