471,594 Members | 2,588 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,594 software developers and data experts.

save gb-2312 web page in a .html file

I am trying to read a web page and save it in a .html file. The problem is
that the web page is GB-2312 encoded, and I want to save it to the file with
the same encoding or unicode. I have some code like this:
url = 'http://blah/'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

file = open('btchina.html','wb')
file.write(page.encode('gb-2312'))
file.close()

It is obviously not working, and I am hoping someone can help me.

Dec 26 '07 #1
6 2590
Peter Pei wrote:
I am trying to read a web page and save it in a .html file. The problem is
that the web page is GB-2312 encoded, and I want to save it to the file with
the same encoding or unicode. I have some code like this:
url = 'http://blah/'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

file = open('btchina.html','wb')
file.write(page.encode('gb-2312'))
file.close()

It is obviously not working, and I am hoping someone can help me.
..read() returns the bytes exactly how it downloads them. It doesn't
interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).

(BTW, don't use 'file' as a variable name. It's an alias of the 'open()'
function.)
--
Dec 26 '07 #2
You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()

Dec 26 '07 #3
Peter Pei wrote:
You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()
I dunno. The file does specify its charset, so unless IE ignores that
and tries to guess and fails, it should work fine.
--
Dec 26 '07 #4
.read() returns the bytes exactly how it downloads them. It doesn't
interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).
Alternatively, if the page is *not* gb-2312, you must first *decode*
it from its original encoding. Suppose the original encoding is
windows-1252, you do

page = page.decode("windows-1252")
page = page.encode("gb-2312")

Of course, for HTML, that may be tricky, as the file may include
an encoding declaration (XML declaration or http-equiv header). So if
you recode it, you might have to change such declarations as well.

Regards,
Martin
Dec 26 '07 #5
I "view sourced" the original web page in IE7, and it does specify:

<meta http-equiv="MSThemeCompatible" content="Yes">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

So sounds like the encoding is gb2312...
Dec 26 '07 #6

----- Original Message -----
From: "Peter Pei" <ya****@telus.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Wednesday, December 26, 2007 8:22 PM
Subject: Re: save gb-2312 web page in a .html file

You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file
(with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()

--
http://mail.python.org/mailman/listinfo/python-list
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.17.9/1197 - Release Date:
25/12/2007 20:04

Dec 27 '07 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Antar | last post: by
4 posts views Thread by David Mc | last post: by
5 posts views Thread by rsanan | last post: by
reply views Thread by XIAOLAOHU | last post: by
reply views Thread by Anwar ali | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.