What is the best way to "get" a web page?

I have the following code:

>>web_page = urllib.urlopen( "http://www.python.org" )
file = open("temp.html ", "w")
web_page_cont ents = web_page.read()
file.write(we b_page_contents )
file.close

<built-in method close of file object at 0xb7cc76e0>

>>>

The file "temp.html" is created, but it doesn't look like the page at
www.python.org. I'm guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close "?

Thanks,
Pete

Sep 24 '06 #1

Subscribe Reply

2270

Paul McGuire

"Pete" <ha************ **@post.comwrot e in message
news:11******** **************@ i3g2000cwc.goog legroups.com...

>I have the following code:

>>>web_page = urllib.urlopen( "http://www.python.org" )
file = open("temp.html ", "w")
web_page_con tents = web_page.read()
file.write(w eb_page_content s)
file.close

<built-in method close of file object at 0xb7cc76e0>

>>>>

The file "temp.html" is created, but it doesn't look like the page at
www.python.org. I'm guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close "?

Thanks,
Pete

A. You didn't actually invoke the close method, you simply referenced it,
which is why you got the output line after file.close. Python is not VB.
To call close, you have to follow it with ()'s, as in:

file.close()

This will have the added benefit of flushing the output to temp.html,
probably containing the missing content you were looking for.

B. Don't name variables "file", or "list", "str", "dict", "int", etc. Doing
so masks global names of builtin data types. Try "tempFile" instead.

-- Paul

Sep 24 '06 #2

Pete

I have the following code:

>>web_page = urllib.urlopen( "http://www.python.org" )
file = open("temp.html ", "w")
web_page_cont ents = web_page.read()
file.write(we b_page_contents )
file.close
<built-in method close of file object at 0xb7cc76e0>
>>>
The file "temp.html" is created, but it doesn't look like the page at
www.python.org. I'm guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close "?

Thanks,
Pete

A. You didn't actually invoke the close method, you simply referenced it,
which is why you got the output line after file.close. Python is not VB.
To call close, you have to follow it with ()'s, as in:

file.close()

Ahhhh. Thank you very much!

This will have the added benefit of flushing the output to temp.html,
probably containing the missing content you were looking for.

B. Don't name variables "file", or "list", "str", "dict", "int", etc. Doing
so masks global names of builtin data types. Try "tempFile" instead.

Oh. Thanks again!
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?

Thanks,
Pete

-- Paul

Sep 24 '06 #3

George Sakkis

Pete wrote:

The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?

If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here's the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet " />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet " />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet " />
<link media="screen" href="styles/largestyles.css " type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.cs s" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George

Sep 24 '06 #4

Wolfgang Keller

Can anyone point me to a tutorial or other reference on how to "get" all

of the html contents at a particular page?

Why not use httrack?

http://www.satzbau-gmbh.de/staff/abel/httrack-py/

Sincerely,

Wolfgang Keller

--
My email-address is correct.
Do NOT remove ".nospam" to reply.

Sep 24 '06 #5

Pete

The file "temp.html" is definitely different than the first run, but

still not anything close to www.python.org . Any other suggestions?

If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here's the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet " />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet " />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet " />
<link media="screen" href="styles/largestyles.css " type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.cs s" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George

Thanks for the information on CSS. I'll look into that later, but now
my question is on the first two lines of HTML code. Here's my latest
python code:

>>import urllib
web_page = urllib.urlopen( "http://www.python.org" )
fileTemp = open("temp.html ", "w")
web_page_cont ents = web_page.read()
fileTemp.writ e(web_page_cont ents)
fileTemp.clos e()

Here are the first two lines of temp.html:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dt d">
2 <html lang="en" xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">

Here are the first two lines of www.python.org as saved from Firefox:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dt d">
2 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"
lang="en"><head >

Lines one are identical. Lines two are different. Why would lines two
differ? Hmmmm...

Thanks,
Pete

Sep 24 '06 #6

Pete

Can anyone point me to a tutorial or other reference on how to "get" all

of the html contents at a particular page?

Why not use httrack?

http://www.satzbau-gmbh.de/staff/abel/httrack-py/

Sincerely,

Wolfgang Keller

--
My email-address is correct.
Do NOT remove ".nospam" to reply.

Thanks for the tip. I'll check that out. Is that your code?
--
Pete

Sep 24 '06 #7

Rainy

Pete wrote:

The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?
If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here's the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet " />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet " />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet " />
<link media="screen" href="styles/largestyles.css " type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.cs s" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George

Thanks for the information on CSS. I'll look into that later, but now
my question is on the first two lines of HTML code. Here's my latest
python code:

>import urllib
web_page = urllib.urlopen( "http://www.python.org" )
fileTemp = open("temp.html ", "w")
web_page_conte nts = web_page.read()
fileTemp.write (web_page_conte nts)
fileTemp.close ()

Here are the first two lines of temp.html:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dt d">
2 <html lang="en" xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">

Here are the first two lines of www.python.org as saved from Firefox:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dt d">
2 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"
lang="en"><head >

Lines one are identical. Lines two are different. Why would lines two
differ? Hmmmm...

Functionally they are the same, but third line included in Firefox.
Opera View Source command produces the same result as Python. It looks
like Firefox will do some cosmetic changes to source but nothing that
would change the way code works. Notice that attributes in second line
are re-arranged in order only?

>
Thanks,
Pete

Sep 24 '06 #8

Felipe Almeida Lessa

24 Sep 2006 10:09:16 -0700, Rainy <ak@silmarill.o rg>:

Functionally they are the same, but third line included in Firefox.
Opera View Source command produces the same result as Python.

[snip]

It's better to compare with the result of a downloader-only (instead
of a parser), like wget on Unix. That way you'll get exactly the same
bytes (assuming the page is static).

--
Felipe.

Sep 24 '06 #9

Pete

Functionally they are the same, but third line included in Firefox.

Opera View Source command produces the same result as Python.
[snip]

It's better to compare with the result of a downloader-only (instead
of a parser), like wget on Unix. That way you'll get exactly the same
bytes (assuming the page is static).

--
Felipe.

Ahhhh. wget - most cool. My temp.html matches wget. Now to capture that
pesky css stuff...

Thanks,
Pete

Sep 24 '06 #10

Similar topics

2711

DB connection handling - i get "too many connection" - what's the solution?

by: Leif Wessman | last post by:

I have a php-webpage that needs the database in the beginning and in the end of the script. In the middle there is a lot of processing that takes several seconds - during that time I don't use the database connection. What is the best approach? Should I close the connection after I'm done with it OR should I reuse the connection thru my whole script? Where can I read more about this? Any pointers?

PHP

3925

Is there a short-circuiting dictionary "get" method?

by: Dave Opstad | last post by:

In this snippet: d = {'x': 1} value = d.get('x', bigscaryfunction()) the bigscaryfunction is always called, even though 'x' is a valid key. Is there a "short-circuit" version of get that doesn't evaluate the second argument if the first is a valid key? For now I'll code around it, but this behavior surprised me a bit...

Python

5276

ASP Forms "Get" Server Variables

by: Harry | last post by:

Hi All, Can anyone clever out there tell me why the below script does not work! - I have a page with two radio boxes with values of "agree" and "not_agree". - The form is set to GET which goes to the below script for processing. - No matter which of the two radio boxes are selected, it always goes to the page "/broadband/order.asp".

ASP / Active Server Pages

5391

too much text in textarea, submit does nothing with "get" how to use "post"

by: Pete Mahoney | last post by:

Ok I use a textarea to store data input by the user, and then upon them clicking the submit button I store this data to a database. The problem is once the user inputs too much data (about 3 paragraphs or 2020 characters) when they click on the submit button nothing happens. When I say nothing happens I mean just that, nothing at all happens the page just sits there as if nothing at all happened. If I remove one line for the textarea,...

ASP / Active Server Pages

2067

How Top Get "Boxed In"

by: Greg Heilers | last post by:

Hola all.... I need to code a site for a friend who wants a 3-box layout, such as this: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

HTML / CSS

4587

My Asp.net aspx page always reports HttpMethod is "GET" even it is "POST" from client, ?

by: Jeff | last post by:

Visual Studio 2003 DotNet framework 1.1 Windows 2000 Pro I create two pages in an Asp.net application, one is html page with a form in it: .... <form id="testForm" method="post" action="test.aspx" runat="server"> <input type="hidden" id="hiddenTest" value="hello, world"> <input type="submit" id="btnSubmit" value="submit">

ASP.NET

6203

Get "System.InvalidOperationException: Request format is unrecognized" after running a while

by: Trygve Lorentzen | last post by:

Hi, my webservice is running on Win2000 SP4, IIS 5.0 fully patched, connecting to a MySQL database and mainly returning Typed DataSet's from webmethods. After running for a while, generally a few days, the webservice stops responding and the .NET windows app client fails with no informative error message. When I try to run any webmethod from the webservice locally in the browser I first get a "This page cannot be display IE error...

.NET Framework

8755

How to use the httpwebrequest with Cookies in "GET" method

by: James MA | last post by:

I'm now writing a small program to communicate a web server to simulate a web client. I use te httpwebrequest to talk with the server, and it works find for "POST" method, however, when i test other link using "GET" method, i found that the cookies data has not included in the request. Here is the sample: ' sURL is the URL of server page ' pCookies is a varible contain the cookies data

Visual Basic .NET

24865

How does this code work?(Request["ReturnUrl"])

by: vvkl | last post by:

I have readed a example code from MSDN about FormsAuthenticationTicket calss, but there's a line I can't understand : 'strRedirect = Request;' What's the mean in which square brackets? Thank you! A Chinese student.

ASP.NET

9595

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9432

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10232

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10059

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

7420

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6682

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5313

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5454

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3974

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp