Hello,
I am trying to input a spreadsheet of possible domain names and output the length of the sourcecode of the webpage (if it exists). In doing this, I have three small questions (I am a newbie and apologize if the questions are simple):
1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.
2. What is the best way to handle errors when a domain phrase doesn't lead to a good website? This will happen (I think) with the line z=br.open('http://www.'+domainTerm)
for which the domainTerm might not lead to an active website.
3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?
Thank you,
Mitch
from mechanize import Browser
import re, time, urllib2
def MakeBrowser():
b = Browser()
headerString = 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.7.12) ' + \
'gecko/20050922 firefox/1.0.7 (debian package 1.0.7-1)'
h = [('User-agent', headerString)]
b.addheaders = h
b.set_handle_robots(False)
return(b)
f = open('bizornot1.csv','r')
lines = f.readlines()
f.close()
f2 = open('bizornot1_new.csv','w')
f2.write(lines[0].rstrip()+',PageSize'+"\n")
print(lines[0].rstrip()+",PageSize")
for i in range(len(lines)-1):
domainTerm = domainTerms[i]
br = MakeBrowser()
z=br.open('http://www.'+domainTerm)
page=z.read()
f2.write(lines[i+1].rstrip()+','+len(page)+"\n")
print(lines[i+1].rstrip()+','+len(page))
f2.close()
4 1451 bvdet 2,851
Expert Mod 2GB
Hello,
I am trying to input a spreadsheet of possible domain names and output the length of the sourcecode of the webpage (if it exists). In doing this, I have three small questions (I am a newbie and apologize if the questions are simple):
1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.
2. What is the best way to handle errors when a domain phrase doesn't lead to a good website? This will happen (I think) with the line z=br.open('http://www.'+domainTerm)
for which the domainTerm might not lead to an active website.
3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?
Thank you,
Mitch
from mechanize import Browser
import re, time, urllib2
def MakeBrowser():
b = Browser()
headerString = 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.7.12) ' + \
'gecko/20050922 firefox/1.0.7 (debian package 1.0.7-1)'
h = [('User-agent', headerString)]
b.addheaders = h
b.set_handle_robots(False)
return(b)
f = open('bizornot1.csv','r')
lines = f.readlines()
f.close()
f2 = open('bizornot1_new.csv','w')
f2.write(lines[0].rstrip()+',PageSize'+"\n")
print(lines[0].rstrip()+",PageSize")
for i in range(len(lines)-1):
domainTerm = domainTerms[i]
br = MakeBrowser()
z=br.open('http://www.'+domainTerm)
page=z.read()
f2.write(lines[i+1].rstrip()+','+len(page)+"\n")
print(lines[i+1].rstrip()+','+len(page))
f2.close()
I am not familiar with the 'mechanize' module. You can do the following with the 'urllib' module: - from urllib import urlopen
-
-
h = urlopen('http://www.somewebsite.com/')
-
-
# source = h.read() # read page into a string
-
lineList = h.readlines() # read page into a list of strings
-
info = h.info()
-
trueURL = h.geturl()
-
-
print ('The number of lines is %d' % len(lineList))
-
-
print 'The number of words is %d' % sum([len(line.strip().split()) for line in lineList])
-
-
h.close()
-
-
try:
-
h = urlopen('http://www.invalidURL.com/')
-
except IOError, e:
-
print e
-
>>> The number of lines is 110
-
The number of words is 807
-
[Errno socket error] (7, 'getaddrinfo failed')
-
-
>>> info
-
<httplib.HTTPMessage instance at 0x00E7A3C8>
-
>>> print info
-
Date: Thu, 09 Aug 2007 02:25:43 GMT
-
Server: Apache
-
Last-Modified: Tue, 07 Aug 2007 02:41:23 GMT
-
ETag: "744141-2b29-46b7dbd3"
-
Accept-Ranges: bytes
-
Content-Length: 11049
-
Connection: close
-
Content-Type: text/html
-
-
>>> trueURL
-
'http://www.bvdetailing.com/'
-
>>>
-
'''
1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.
Please clarify. If you have read the page into a variable, it is (likely) already a string. The str() function converts to a string but I do not think that is what you want.
2. What is the best way to handle errors when a domain phrase doesn't lead to a good website?
What error does your module spit back? Wrap the code you have in a try/except block specifying this error, and then do what you want when it happens. For example, if the error is BadSillyError, do this: -
try:
-
# code to run
-
except BadSillyError:
-
# what to do on failure
-
else:
-
# continue with rest of cdoe if error does not happen
-
3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?
If you know lines are separated by carriage returns you can do something like: -
lines = page.split('\n')
-
number_of_lines = len(lines)
-
But then you might want to eliminate blank lines. Or comments.
Thank you very much for your comments. What I meant for the destring function was actually just the str() function you provided. I tried out many of your suggestions today and, after trying out more tomorrow, if I continue to have questions, I will repost.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: DvDmanDT |
last post by:
Hello, I have an intresting problem: I want to let users upload sourcecode
and then compile it using my cygwin gcc... But one thing at a time... I
can't get that gcc to execute...
...
|
by: Mark Buch |
last post by:
Hi, is it possible to protect the python sourcecode?
I have a nice little script and i dont want to show everbody the source.
Im using python on a windows pc.
Thank you - Mark
|
by: MJB |
last post by:
I never get the above exception in Windows 2k. It only happens in
Windows XP, which is the first oddity. My application is multi-threaded and
I use the webbrowser control and media player. The...
|
by: Aaron |
last post by:
I would like to display my sourcecode(plain text) in a html page.
I found this website that does exactly that.
http://www.manoli.net/csharpformat/
Does anyone know where I can download the...
|
by: HarryMangurian |
last post by:
I have a memo field in an ACCES database (up to 650000 characters).
I have set up an adapter and a dataset for the database. I have a datagrid
bound to the dataset. There are 4 fields in the...
|
by: messagedog |
last post by:
maybe, we may together study windows sourcecode.
and u?
if u need,u may download in http://activex.126.com/
|
by: Larry Tate |
last post by:
I am wanting to get those cool html error pages that ms produces when I hit
an error in asp.net. For instance, when I get a compilation error I get an
html error page that shows me the
...
|
by: Dave |
last post by:
Hi,
Is there anyone knows howto obtain HTML sourcecode in a string.
In VB6 I used "inet" to do the job but it won't work in VB.net.
Thank
Dave
|
by: DolphinDB |
last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation.
Take...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: jfyes |
last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: af34tf |
last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
|
by: Faith0G |
last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
| |