473,803 Members | 3,899 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Can't get the real contents form page in internet as the tag "no-chche"

using web browser can get page's content formally, but when use
urllib2.open("h ttp://tech.163.com/2004w11/12732/2004w11_1100059 465339.html").r ead()

the result is

<html><head><ME TA HTTP-EQUIV=REFRESH
CONTENT="0;URL= http://tech.163.com/04/1110/12/14QUR2BR0009159 H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú 'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?

Mar 23 '06 #1
7 1586
dongdong wrote:
using web browser can get page's content formally, but when use
urllib2.open("h ttp://tech.163.com/2004w11/12732/2004w11_1100059 465339.html").r ead()

the result is

<html><head><ME TA HTTP-EQUIV=REFRESH
CONTENT="0;URL= http://tech.163.com/04/1110/12/14QUR2BR0009159 H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú 'ò?aò3??...</body></html>


The page is in Chinese (I think), when you print the data it is printing
in your console encoding which is apparently not Chinese. What did you
expect to see?

Kent
Mar 23 '06 #2
yeah,u r right, the page uses chinese.(I'm a chinese too.^_^,)

using urllib2.urlopen ('............' ).read(),I can't get the contents
between '<body>' and '</body>' ,the reason isn't the chinese encoding
but the 'no-cache' set,I think.

I want to get the contents between....

can you find the problem why i can't read the contents? thanks.

Mar 23 '06 #3
I V
dongdong wrote:
using web browser can get page's content formally, but when use
urllib2.open("h ttp://tech.163.com/2004w11/12732/2004w11_1100059 465339.html").r ead()

the result is

<html><head><ME TA HTTP-EQUIV=REFRESH
CONTENT="0;URL= http://tech.163.com/04/1110/12/14QUR2BR0009159 H.html">


This line here instructs the browser to go to
http://tech.163.com/04/1110/12/14QUR2BR0009159H.html . If you try
loading that with urllib2, do you get the right content?

If the people behind that web page new how to use the web, they
wouldn't use the META HTTP-EQUIV hack, and instead would have
instructed their web server to return a 300 redirect response, which
would have allowed urllib2 to follow the redirect and get the right
content automatically. If you have any influence with them, you could
try and persuade them to set up their web server properly.

Mar 23 '06 #4
"dongdong" <do***********@ hotmail.com> wrote:

using web browser can get page's content formally, but when use
urllib2.open(" http://tech.163.com/2004w11/12732/2004w11_1100059 465339.html").r ead()

the result is

<html><head><M ETA HTTP-EQUIV=REFRESH
CONTENT="0;URL =http://tech.163.com/04/1110/12/14QUR2BR0009159 H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?ú 'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?


No, that's not the reason. The reason is that this includes a redirect.

As an HTML consumer, you are supposed to parse that content and notice the
<meta http-equiv> tag, which says "here is something that should have been
one of the HTTP headers".

In this case, it wants you to act as though you saw:
Refresh: 0;URL=http://tech.163.com/04/1110/12/14QUR2BR0009159 H.html
Pragma: no-cache

In this case, the "Refresh" header means that you are supposed to go fetch
the contents of that new page immediately. Try using urllib2.open on THAT
address, and you should get your content.

This is one way to handle a web site reorganization and still allow older
URLs to work.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Mar 23 '06 #5
dongdong wrote:
using web browser can get page's content formally, but when use
urllib2.open("h ttp://tech.163.com/2004w11/12732/2004w11_1100059 465339.html").r ead()
the result is

<html><head><ME TA HTTP-EQUIV=REFRESH
CONTENT="0;URL= http://tech.163.com/04/1110/12/14QUR2BR0009159 H.html">
<META http-equiv="Pragma"
content="no-cache"></HEAD><body>?y?à º'ò?aò3??...</body></html>

,I think the reson is the no-cache, are there person would help me?


No, the reason is the <META HTTP-EQUIV=REFRESH
CONTENT="0;URL= http://tech.163.com/04/1110/12/14QUR2BR0009159 H.html">

that redirects you to the real site. Extract that url from the page and
request that. Or maybe you can use webunit, which acts more like a "real"
http-client with interpreting such content.

diez
Mar 23 '06 #6
oh~~~! offer my thanks to Tim Roberts and all persons above!
I see now, it's the different url causes!
contents can only be got from the later (real ) url.
I made a mistick not to look at the different urls taking effect.

Mar 23 '06 #7
"dongdong" <do***********@ hotmail.com> writes:
oh~~~! offer my thanks to Tim Roberts and all persons above!
I see now, it's the different url causes!
contents can only be got from the later (real ) url.
I made a mistick not to look at the different urls taking effect.


If you use ClientCookie.ur lopen() in place of urllib2.urlopen (), it
will handle Refreshes and HTTP-EQUIV for you transparently.

Actually, you have to explicitly ask for that functionality:

import ClientCookie
opener = ClientCookie.bu ild_opener(Clie ntCookie.HTTPEq uivProcessor,
ClientCookie.HT TPRefreshProces sor,
)
ClientCookie.in stall_opener(op ener)

print ClientCookie.ur lopen(url).read ()
If you want to do even less of this stuff "by hand", class Browser
from module mechanize is a subclass of the class of "opener" above,
but behaves much more like a web browser in various ways. Still
alpha, but very near now to stable release.
FWIW, you can also use ClientCookie.HT TPRefreshProces sor,
ClientCookie.HT TPEquivProcesso r etc. with Python 2.4's urllib2, as
long as you follow the instructions under the heading "Notes about
ClientCookie, urllib2 and cookielib" in the ClientCookie README file
(specifically, if you want to use ClientCookie.Re freshProcessor with
Python 2.4's urllib2, you must also use
ClientCookie.HT TPRedirectHandl er).
John

Mar 23 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

17
3069
by: Tobiah | last post by:
Ok, I miss the idiom that my other languages use, something like: while( foo = getmore()){ process(foo); }
38
5746
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I find an answer to this question (don't find it in the W3C_char_entities document). -- Haines Brown brownh@hartford-hwp.com
4
7117
by: TonyJeffs | last post by:
<SCRIPT language="JavaScript"> function anna18(){ window.open ("Slideshow/anna18.htm","","scrollbars=no,fullscreen=yes") } </SCRIPT> Why does the above still have a scrollbar? The actual web site (poor photos of 18th birthday party!) is at: http://www.tonyjeffs.com/Anna18th
1
2139
by: Grey | last post by:
i use "this.Page.RegisterStartupScript()" to display the client-side script of confirmation box ...but I want to know that how can I know the user was clicked "YES" or "NO" as I need to do different action for these two choices. Million Thanks..
5
28258
by: balakrishnan.dinesh | last post by:
hi Frnds, I need Confirm box with "yes" or "no" buttons, Is that possible in JAVASCRIPT , Can anyone tell me the solution for this or anyother way to create confirm box with "yes" or "no" button? Thanks & Rgrds Dinesh...
0
9564
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10546
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10310
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10068
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9121
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7603
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5498
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4275
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3796
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.