473,401 Members | 2,146 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,401 software developers and data experts.

Can't get the real contents form page in internet as the tag "no-chche"

using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?

Mar 23 '06 #1
7 1567
dongdong wrote:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>


The page is in Chinese (I think), when you print the data it is printing
in your console encoding which is apparently not Chinese. What did you
expect to see?

Kent
Mar 23 '06 #2
yeah,u r right, the page uses chinese.(I'm a chinese too.^_^,)

using urllib2.urlopen('............').read(),I can't get the contents
between '<body>' and '</body>' ,the reason isn't the chinese encoding
but the 'no-cache' set,I think.

I want to get the contents between....

can you find the problem why i can't read the contents? thanks.

Mar 23 '06 #3
I V
dongdong wrote:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">


This line here instructs the browser to go to
http://tech.163.com/04/1110/12/14QUR2BR0009159H.html . If you try
loading that with urllib2, do you get the right content?

If the people behind that web page new how to use the web, they
wouldn't use the META HTTP-EQUIV hack, and instead would have
instructed their web server to return a 300 redirect response, which
would have allowed urllib2 to follow the redirect and get the right
content automatically. If you have any influence with them, you could
try and persuade them to set up their web server properly.

Mar 23 '06 #4
"dongdong" <do***********@hotmail.com> wrote:

using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()

the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?


No, that's not the reason. The reason is that this includes a redirect.

As an HTML consumer, you are supposed to parse that content and notice the
<meta http-equiv> tag, which says "here is something that should have been
one of the HTTP headers".

In this case, it wants you to act as though you saw:
Refresh: 0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html
Pragma: no-cache

In this case, the "Refresh" header means that you are supposed to go fetch
the contents of that new page immediately. Try using urllib2.open on THAT
address, and you should get your content.

This is one way to handle a web site reorganization and still allow older
URLs to work.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Mar 23 '06 #5
dongdong wrote:
using web browser can get page's content formally, but when use
urllib2.open("http://tech.163.com/2004w11/12732/2004w11_1100059465339.html").read()
the result is

<html><head><META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?


No, the reason is the <META HTTP-EQUIV=REFRESH
CONTENT="0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159H.html">

that redirects you to the real site. Extract that url from the page and
request that. Or maybe you can use webunit, which acts more like a "real"
http-client with interpreting such content.

diez
Mar 23 '06 #6
oh~~~! offer my thanks to Tim Roberts and all persons above!
I see now, it's the different url causes!
contents can only be got from the later (real ) url.
I made a mistick not to look at the different urls taking effect.

Mar 23 '06 #7
"dongdong" <do***********@hotmail.com> writes:
oh~~~! offer my thanks to Tim Roberts and all persons above!
I see now, it's the different url causes!
contents can only be got from the later (real ) url.
I made a mistick not to look at the different urls taking effect.


If you use ClientCookie.urlopen() in place of urllib2.urlopen(), it
will handle Refreshes and HTTP-EQUIV for you transparently.

Actually, you have to explicitly ask for that functionality:

import ClientCookie
opener = ClientCookie.build_opener(ClientCookie.HTTPEquivPr ocessor,
ClientCookie.HTTPRefreshProcessor,
)
ClientCookie.install_opener(opener)

print ClientCookie.urlopen(url).read()
If you want to do even less of this stuff "by hand", class Browser
from module mechanize is a subclass of the class of "opener" above,
but behaves much more like a web browser in various ways. Still
alpha, but very near now to stable release.
FWIW, you can also use ClientCookie.HTTPRefreshProcessor,
ClientCookie.HTTPEquivProcessor etc. with Python 2.4's urllib2, as
long as you follow the instructions under the heading "Notes about
ClientCookie, urllib2 and cookielib" in the ClientCookie README file
(specifically, if you want to use ClientCookie.RefreshProcessor with
Python 2.4's urllib2, you must also use
ClientCookie.HTTPRedirectHandler).
John

Mar 23 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

17
by: Tobiah | last post by:
Ok, I miss the idiom that my other languages use, something like: while( foo = getmore()){ process(foo); }
38
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
4
by: TonyJeffs | last post by:
<SCRIPT language="JavaScript"> function anna18(){ window.open ("Slideshow/anna18.htm","","scrollbars=no,fullscreen=yes") } </SCRIPT> Why does the above still have a scrollbar? The actual web...
1
by: Grey | last post by:
i use "this.Page.RegisterStartupScript()" to display the client-side script of confirmation box ...but I want to know that how can I know the user was clicked "YES" or "NO" as I need to do different...
5
by: balakrishnan.dinesh | last post by:
hi Frnds, I need Confirm box with "yes" or "no" buttons, Is that possible in JAVASCRIPT , Can anyone tell me the solution for this or anyother way to create confirm box with "yes" or "no" button?...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.