473,770 Members | 1,644 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

What is the best way to "get" a web page?

I have the following code:
>>web_page = urllib.urlopen( "http://www.python.org" )
file = open("temp.html ", "w")
web_page_cont ents = web_page.read()
file.write(we b_page_contents )
file.close
<built-in method close of file object at 0xb7cc76e0>
>>>
The file "temp.html" is created, but it doesn't look like the page at
www.python.org. I'm guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close "?

Thanks,
Pete

Sep 24 '06 #1
10 2270
"Pete" <ha************ **@post.comwrot e in message
news:11******** **************@ i3g2000cwc.goog legroups.com...
>I have the following code:
>>>web_page = urllib.urlopen( "http://www.python.org" )
file = open("temp.html ", "w")
web_page_con tents = web_page.read()
file.write(w eb_page_content s)
file.close
<built-in method close of file object at 0xb7cc76e0>
>>>>

The file "temp.html" is created, but it doesn't look like the page at
www.python.org. I'm guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close "?

Thanks,
Pete
A. You didn't actually invoke the close method, you simply referenced it,
which is why you got the output line after file.close. Python is not VB.
To call close, you have to follow it with ()'s, as in:

file.close()

This will have the added benefit of flushing the output to temp.html,
probably containing the missing content you were looking for.

B. Don't name variables "file", or "list", "str", "dict", "int", etc. Doing
so masks global names of builtin data types. Try "tempFile" instead.

-- Paul
Sep 24 '06 #2
I have the following code:
>>web_page = urllib.urlopen( "http://www.python.org" )
file = open("temp.html ", "w")
web_page_cont ents = web_page.read()
file.write(we b_page_contents )
file.close
<built-in method close of file object at 0xb7cc76e0>
>>>
The file "temp.html" is created, but it doesn't look like the page at
www.python.org. I'm guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close "?

Thanks,
Pete

A. You didn't actually invoke the close method, you simply referenced it,
which is why you got the output line after file.close. Python is not VB.
To call close, you have to follow it with ()'s, as in:

file.close()
Ahhhh. Thank you very much!
This will have the added benefit of flushing the output to temp.html,
probably containing the missing content you were looking for.

B. Don't name variables "file", or "list", "str", "dict", "int", etc. Doing
so masks global names of builtin data types. Try "tempFile" instead.
Oh. Thanks again!
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?

Thanks,
Pete
-- Paul
Sep 24 '06 #3
Pete wrote:
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?
If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here's the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet " />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet " />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet " />
<link media="screen" href="styles/largestyles.css " type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.cs s" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George

Sep 24 '06 #4
Can anyone point me to a tutorial or other reference on how to "get" all
of the html contents at a particular page?
Why not use httrack?

http://www.satzbau-gmbh.de/staff/abel/httrack-py/

Sincerely,

Wolfgang Keller

--
My email-address is correct.
Do NOT remove ".nospam" to reply.

Sep 24 '06 #5
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?

If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here's the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet " />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet " />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet " />
<link media="screen" href="styles/largestyles.css " type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.cs s" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George
Thanks for the information on CSS. I'll look into that later, but now
my question is on the first two lines of HTML code. Here's my latest
python code:
>>import urllib
web_page = urllib.urlopen( "http://www.python.org" )
fileTemp = open("temp.html ", "w")
web_page_cont ents = web_page.read()
fileTemp.writ e(web_page_cont ents)
fileTemp.clos e()
Here are the first two lines of temp.html:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dt d">
2 <html lang="en" xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">

Here are the first two lines of www.python.org as saved from Firefox:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dt d">
2 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"
lang="en"><head >

Lines one are identical. Lines two are different. Why would lines two
differ? Hmmmm...

Thanks,
Pete

Sep 24 '06 #6
Can anyone point me to a tutorial or other reference on how to "get" all
of the html contents at a particular page?

Why not use httrack?

http://www.satzbau-gmbh.de/staff/abel/httrack-py/

Sincerely,

Wolfgang Keller

--
My email-address is correct.
Do NOT remove ".nospam" to reply.
Thanks for the tip. I'll check that out. Is that your code?
--
Pete

Sep 24 '06 #7

Pete wrote:
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?
If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here's the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet " />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet " />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet " />
<link media="screen" href="styles/largestyles.css " type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.cs s" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George

Thanks for the information on CSS. I'll look into that later, but now
my question is on the first two lines of HTML code. Here's my latest
python code:
>import urllib
web_page = urllib.urlopen( "http://www.python.org" )
fileTemp = open("temp.html ", "w")
web_page_conte nts = web_page.read()
fileTemp.write (web_page_conte nts)
fileTemp.close ()

Here are the first two lines of temp.html:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dt d">
2 <html lang="en" xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">

Here are the first two lines of www.python.org as saved from Firefox:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dt d">
2 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"
lang="en"><head >

Lines one are identical. Lines two are different. Why would lines two
differ? Hmmmm...
Functionally they are the same, but third line included in Firefox.
Opera View Source command produces the same result as Python. It looks
like Firefox will do some cosmetic changes to source but nothing that
would change the way code works. Notice that attributes in second line
are re-arranged in order only?
>
Thanks,
Pete
Sep 24 '06 #8
24 Sep 2006 10:09:16 -0700, Rainy <ak@silmarill.o rg>:
Functionally they are the same, but third line included in Firefox.
Opera View Source command produces the same result as Python.
[snip]

It's better to compare with the result of a downloader-only (instead
of a parser), like wget on Unix. That way you'll get exactly the same
bytes (assuming the page is static).

--
Felipe.
Sep 24 '06 #9
Functionally they are the same, but third line included in Firefox.
Opera View Source command produces the same result as Python.
[snip]

It's better to compare with the result of a downloader-only (instead
of a parser), like wget on Unix. That way you'll get exactly the same
bytes (assuming the page is static).

--
Felipe.
Ahhhh. wget - most cool. My temp.html matches wget. Now to capture that
pesky css stuff...

Thanks,
Pete

Sep 24 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
2711
by: Leif Wessman | last post by:
I have a php-webpage that needs the database in the beginning and in the end of the script. In the middle there is a lot of processing that takes several seconds - during that time I don't use the database connection. What is the best approach? Should I close the connection after I'm done with it OR should I reuse the connection thru my whole script? Where can I read more about this? Any pointers?
16
3925
by: Dave Opstad | last post by:
In this snippet: d = {'x': 1} value = d.get('x', bigscaryfunction()) the bigscaryfunction is always called, even though 'x' is a valid key. Is there a "short-circuit" version of get that doesn't evaluate the second argument if the first is a valid key? For now I'll code around it, but this behavior surprised me a bit...
5
5276
by: Harry | last post by:
Hi All, Can anyone clever out there tell me why the below script does not work! - I have a page with two radio boxes with values of "agree" and "not_agree". - The form is set to GET which goes to the below script for processing. - No matter which of the two radio boxes are selected, it always goes to the page "/broadband/order.asp".
1
5391
by: Pete Mahoney | last post by:
Ok I use a textarea to store data input by the user, and then upon them clicking the submit button I store this data to a database. The problem is once the user inputs too much data (about 3 paragraphs or 2020 characters) when they click on the submit button nothing happens. When I say nothing happens I mean just that, nothing at all happens the page just sits there as if nothing at all happened. If I remove one line for the textarea,...
2
2067
by: Greg Heilers | last post by:
Hola all.... I need to code a site for a friend who wants a 3-box layout, such as this: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
5
4587
by: Jeff | last post by:
Visual Studio 2003 DotNet framework 1.1 Windows 2000 Pro I create two pages in an Asp.net application, one is html page with a form in it: .... <form id="testForm" method="post" action="test.aspx" runat="server"> <input type="hidden" id="hiddenTest" value="hello, world"> <input type="submit" id="btnSubmit" value="submit">
1
6203
by: Trygve Lorentzen | last post by:
Hi, my webservice is running on Win2000 SP4, IIS 5.0 fully patched, connecting to a MySQL database and mainly returning Typed DataSet's from webmethods. After running for a while, generally a few days, the webservice stops responding and the .NET windows app client fails with no informative error message. When I try to run any webmethod from the webservice locally in the browser I first get a "This page cannot be display IE error...
6
8755
by: James MA | last post by:
I'm now writing a small program to communicate a web server to simulate a web client. I use te httpwebrequest to talk with the server, and it works find for "POST" method, however, when i test other link using "GET" method, i found that the cookies data has not included in the request. Here is the sample: ' sURL is the URL of server page ' pCookies is a varible contain the cookies data
7
24865
by: vvkl | last post by:
I have readed a example code from MSDN about FormsAuthenticationTicket calss, but there's a line I can't understand : 'strRedirect = Request;' What's the mean in which square brackets? Thank you! A Chinese student.
0
9595
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9432
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10232
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10059
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
7420
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6682
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5313
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5454
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3974
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.