473,231 Members | 1,578 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,231 software developers and data experts.

save gb-2312 web page in a .html file

I am trying to read a web page and save it in a .html file. The problem is
that the web page is GB-2312 encoded, and I want to save it to the file with
the same encoding or unicode. I have some code like this:
url = 'http://blah/'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

file = open('btchina.html','wb')
file.write(page.encode('gb-2312'))
file.close()

It is obviously not working, and I am hoping someone can help me.

Dec 26 '07 #1
6 2667
Peter Pei wrote:
I am trying to read a web page and save it in a .html file. The problem is
that the web page is GB-2312 encoded, and I want to save it to the file with
the same encoding or unicode. I have some code like this:
url = 'http://blah/'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }

req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

file = open('btchina.html','wb')
file.write(page.encode('gb-2312'))
file.close()

It is obviously not working, and I am hoping someone can help me.
..read() returns the bytes exactly how it downloads them. It doesn't
interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).

(BTW, don't use 'file' as a variable name. It's an alias of the 'open()'
function.)
--
Dec 26 '07 #2
You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()

Dec 26 '07 #3
Peter Pei wrote:
You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file (with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()
I dunno. The file does specify its charset, so unless IE ignores that
and tries to guess and fails, it should work fine.
--
Dec 26 '07 #4
.read() returns the bytes exactly how it downloads them. It doesn't
interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).
Alternatively, if the page is *not* gb-2312, you must first *decode*
it from its original encoding. Suppose the original encoding is
windows-1252, you do

page = page.decode("windows-1252")
page = page.encode("gb-2312")

Of course, for HTML, that may be tricky, as the file may include
an encoding declaration (XML declaration or http-equiv header). So if
you recode it, you might have to change such declarations as well.

Regards,
Martin
Dec 26 '07 #5
I "view sourced" the original web page in IE7, and it does specify:

<meta http-equiv="MSThemeCompatible" content="Yes">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

So sounds like the encoding is gb2312...
Dec 26 '07 #6

----- Original Message -----
From: "Peter Pei" <ya****@telus.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Wednesday, December 26, 2007 8:22 PM
Subject: Re: save gb-2312 web page in a .html file

You must be right, since I tried one page and it worked. But there is
something wrong with this particular page:
http://overseas.btchina.net/?categoryid=-1. When I open the saved file
(with
IE7), it is all messed up.

url = 'http://overseas.btchina.net/?categoryid=-1'
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)' }
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()

htmlfile = open('btchina.html','w')
htmlfile.write(page)
htmlfile.close()

--
http://mail.python.org/mailman/listinfo/python-list
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.17.9/1197 - Release Date:
25/12/2007 20:04

Dec 27 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Antar | last post by:
Hi, I'm kind of a newbie on DB management but I have to deal with a huge DB used for real time operations. I got a temporal table where current data is stored to work with frecuently, and then a...
22
by: Smutny30 | last post by:
Hello, I am preparing a database that will store 10 n * GBs - 100 n * GBs of data. I calculated to have 1,2 GB of bufferpools. I run the DB2 v. 8.2.1 alone on 4 GB box. I obtain : ...
2
by: Bubba | last post by:
I know it's possible, just don't know how to do it. I have a spreadsheet that I imported into access. Two of the columns in the table have Hard Drive space values listed for example 2.45 GB and 453...
0
by: Red | last post by:
I've an Access '97 application that works fine on Clients (or Terminal Server) with 128 - 1 GB of RAM. If the PC has more than 1 GB of RAM there's some problems: Unexpectly appears an error...
1
by: GB | last post by:
Hello: I have a table like this: ID | AccNo -------------------------------- 1 0059 2 0059 3 0059 4 0194 5 0194
8
by: clsmyth | last post by:
Folks, Hi, I have never posted to a language group before so please excuse me if this is inappropriate. I have posted this to comp.unix.solaris (well, I am one of the folks on the thread at...
4
by: David Mc | last post by:
We recently installed the 1.1 redist of .Net on a new server. Only after installing .Net did I realize that the regional settings of the server had not been localized. The application we have...
5
by: rsanan | last post by:
How do I convert a datetime from en-GB to en-US format here is my code - (not working for the clients outside of US) /*******************CODE*****************/ System.Globalization.CultureInfo...
1
by: mchi55 | last post by:
I have Windows 2003 Enterprise..../PAE is set. SQL 2000 ENT, sp4, AWE is set. In the startup SQL log...I see that AWE is enabled. If I fix the memory below 8 GIG...it will change to whatever I...
1
by: metaglossary | last post by:
I'd like use more than 4 GB of memory for a single python process. Is this possible with a 64-bit processor? I'm using a Woodcrest processor, which I presume supports 64-bit addressing. I've...
0
by: VivesProcSPL | last post by:
Obviously, one of the original purposes of SQL is to make data query processing easy. The language uses many English-like terms and syntax in an effort to make it easy to learn, particularly for...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.