473,320 Members | 1,979 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Get document as normal text and not as binary data

Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lĂ€Ăt, for example. How can I get the
contents as normal text?

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)
contents = f.read()
print contents
f.close()

Thanks!

Markus
Jul 18 '05 #1
8 2203
Markus Franz wrote:
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lĂ€Ăt, for example. How can I get the
contents as normal text?


You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.

--
Regards,

Diez B. Roggisch
Jul 18 '05 #2
Markus Franz wrote:
I used urllib2 to load a html-document through http. But my problem
is: The loaded contents are returned as binary data, that means that every
character is displayed like lĂ?Ăt, for example. How can I get the
contents as normal text?

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)
adding

print f.headers

and checking the header fields (especially the content-type) may help you
figure out what's going on...
contents = f.read()
print contents
f.close()


</F>

Jul 18 '05 #3
Diez B. Roggisch wrote:
You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.


And how can I convert those binary data to a "normal" string with
"normal" characters?

Best regards

Markus
Jul 18 '05 #4
Markus Franz wrote:
Diez B. Roggisch wrote:
You get what the server sends. That is always binary - either it _is_ a
binary file, or maybe in an unknown encoding.


And how can I convert those binary data to a "normal" string with
"normal" characters?


There is no "normal" - it's just bytes, and a string is just bytes. No
difference, no translation necessary.

As others have said: look into the http header what the server is trying to
transmit - maybe an image. The mimetype header is telling you that.

Or use wget to fetch the url and look what you get - it shouldn't look
different.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #5
Addendum: If you give us the url you're fetching data from, we might be able
to look at the delivered data ourselves.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #6
Markus Franz wrote:
Hi.

I used urllib2 to load a html-document through http. But my problem
is:
The loaded contents are returned as binary data, that means that every
character is displayed like lĂ€Ăt, for example. How can I get the
contents as normal text?
My guess is the html is utf-8 encoded - your sample looks like utf-8-interpreted-as-latin-1. Try
contents = f.read().decode('utf-8')

Kent

My script was:

import urllib2
req = urllib2.Request(url)
f = urllib2.urlopen(req)
contents = f.read()
print contents
f.close()

Thanks!

Markus

Jul 18 '05 #7
Kent Johnson wrote:
My guess is the html is utf-8 encoded - your sample looks like
utf-8-interpreted-as-latin-1. Try
contents = f.read().decode('utf-8')


YES! That helped!

I used the following:

....
contents = f.read().decode('utf-8')
contents = contents.encode('iso-8859-15')
....

That was the perfect solution for my problem! Thanks a lot!

Best regards

Markus
Jul 18 '05 #8
Diez B. Roggisch wrote:
Addendum: If you give us the url you're fetching data from, we might be able
to look at the delivered data ourselves.


To guess my problem please have a look at the document title of
<http://portal.suse.de/sdb/de/1997/01/xntp.html>

Markus
Jul 18 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Rune Froysa | last post by:
Trying something like:: import xmlrpclib svr = xmlrpclib.Server("http://127.0.0.1:8000") svr.test("\x1btest") Failes on the server with:: xml.parsers.expat.ExpatError: not well-formed (invalid...
5
by: Michael G. Schneider | last post by:
I know that using Word Automation inside an ASP page is no good idea. Anything I want to do in the current project is: open document, change some text, save and close document. Basically changing...
27
by: Eric | last post by:
Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there...
4
by: Dante | last post by:
Hello. I have a Javascript that gets data from an XML document and displays it through javascript. The problem is that when I do dcfile.getElementsByTagName("subhead").firstChild.nodeName all I...
8
by: Asma | last post by:
Dear Sir, I am trying to find a way to open a Word document using C language and read the text of word doc into a variable. (Turbo C on Dos 6.0). Can anyone please tell me which libraries in...
12
by: Rob Nicholson | last post by:
We've implemented functionality to allow a user to download a document (any document type) from the IIS server using the following code: Response.Clear() Response.ContentType =...
10
by: Antoine De Groote | last post by:
Hi there, I have a word document containing pictures and text. This documents holds several 'ABCDEF' strings which serve as a placeholder for names. Now I want to replace these occurences with...
1
by: Taras_96 | last post by:
Hi everyone, I'm using PHP and a DB to control access to files that have been uploaded by users. I am using the following PHP code snippet to deliver the file after the access rights have been...
0
by: PracticalApps | last post by:
I looked to find a canned solution to create a Word document in my application and just couldn't find anything that just gets to the point. I would think, and I may be making too strong of an...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.