By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,172 Members | 762 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,172 IT Pros & Developers. It's quick & easy.

Getting the size of sourcecode

P: 7
Hello,

I am trying to input a spreadsheet of possible domain names and output the length of the sourcecode of the webpage (if it exists). In doing this, I have three small questions (I am a newbie and apologize if the questions are simple):

1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.

2. What is the best way to handle errors when a domain phrase doesn't lead to a good website? This will happen (I think) with the line z=br.open('http://www.'+domainTerm)
for which the domainTerm might not lead to an active website.

3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?

Thank you,
Mitch

from mechanize import Browser
import re, time, urllib2

def MakeBrowser():
b = Browser()
headerString = 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.7.12) ' + \
'gecko/20050922 firefox/1.0.7 (debian package 1.0.7-1)'
h = [('User-agent', headerString)]
b.addheaders = h
b.set_handle_robots(False)
return(b)

f = open('bizornot1.csv','r')
lines = f.readlines()
f.close()
f2 = open('bizornot1_new.csv','w')
f2.write(lines[0].rstrip()+',PageSize'+"\n")
print(lines[0].rstrip()+",PageSize")

for i in range(len(lines)-1):
domainTerm = domainTerms[i]
br = MakeBrowser()
z=br.open('http://www.'+domainTerm)
page=z.read()
f2.write(lines[i+1].rstrip()+','+len(page)+"\n")
print(lines[i+1].rstrip()+','+len(page))

f2.close()
Aug 9 '07 #1
Share this Question
Share on Google+
4 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
Hello,

I am trying to input a spreadsheet of possible domain names and output the length of the sourcecode of the webpage (if it exists). In doing this, I have three small questions (I am a newbie and apologize if the questions are simple):

1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.

2. What is the best way to handle errors when a domain phrase doesn't lead to a good website? This will happen (I think) with the line z=br.open('http://www.'+domainTerm)
for which the domainTerm might not lead to an active website.

3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?

Thank you,
Mitch

from mechanize import Browser
import re, time, urllib2

def MakeBrowser():
b = Browser()
headerString = 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.7.12) ' + \
'gecko/20050922 firefox/1.0.7 (debian package 1.0.7-1)'
h = [('User-agent', headerString)]
b.addheaders = h
b.set_handle_robots(False)
return(b)

f = open('bizornot1.csv','r')
lines = f.readlines()
f.close()
f2 = open('bizornot1_new.csv','w')
f2.write(lines[0].rstrip()+',PageSize'+"\n")
print(lines[0].rstrip()+",PageSize")

for i in range(len(lines)-1):
domainTerm = domainTerms[i]
br = MakeBrowser()
z=br.open('http://www.'+domainTerm)
page=z.read()
f2.write(lines[i+1].rstrip()+','+len(page)+"\n")
print(lines[i+1].rstrip()+','+len(page))

f2.close()
I am not familiar with the 'mechanize' module. You can do the following with the 'urllib' module:
Expand|Select|Wrap|Line Numbers
  1. from urllib import urlopen
  2.  
  3. h = urlopen('http://www.somewebsite.com/')
  4.  
  5. # source = h.read() # read page into a string
  6. lineList = h.readlines() # read page into a list of strings
  7. info = h.info()
  8. trueURL = h.geturl()
  9.  
  10. print ('The number of lines is %d' % len(lineList))
  11.  
  12. print 'The number of words is %d' % sum([len(line.strip().split()) for line in lineList])
  13.  
  14. h.close()
  15.  
  16. try:
  17.     h = urlopen('http://www.invalidURL.com/')
  18. except IOError, e:
  19.     print e
Expand|Select|Wrap|Line Numbers
  1. >>> The number of lines is 110
  2. The number of words is 807
  3. [Errno socket error] (7, 'getaddrinfo failed')
  4.  
  5. >>> info
  6. <httplib.HTTPMessage instance at 0x00E7A3C8>
  7. >>> print info
  8. Date: Thu, 09 Aug 2007 02:25:43 GMT
  9. Server: Apache
  10. Last-Modified: Tue, 07 Aug 2007 02:41:23 GMT
  11. ETag: "744141-2b29-46b7dbd3"
  12. Accept-Ranges: bytes
  13. Content-Length: 11049
  14. Connection: close
  15. Content-Type: text/html
  16.  
  17. >>> trueURL
  18. 'http://www.bvdetailing.com/'
  19. >>> 
  20. '''
Aug 9 '07 #2

P: 5
1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.
Please clarify. If you have read the page into a variable, it is (likely) already a string. The str() function converts to a string but I do not think that is what you want.

2. What is the best way to handle errors when a domain phrase doesn't lead to a good website?
What error does your module spit back? Wrap the code you have in a try/except block specifying this error, and then do what you want when it happens. For example, if the error is BadSillyError, do this:
Expand|Select|Wrap|Line Numbers
  1. try:
  2.     # code to run
  3. except BadSillyError:
  4.     # what to do on failure
  5. else:
  6.     # continue with rest of cdoe if error does not happen
  7.  
3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?
If you know lines are separated by carriage returns you can do something like:
Expand|Select|Wrap|Line Numbers
  1. lines = page.split('\n')
  2. number_of_lines = len(lines)
  3.  
But then you might want to eliminate blank lines. Or comments.
Aug 9 '07 #3

P: 5
I have just published a full Line Of Code Counter that you can adapt to your purpose.
Aug 9 '07 #4

P: 7
Thank you very much for your comments. What I meant for the destring function was actually just the str() function you provided. I tried out many of your suggestions today and, after trying out more tomorrow, if I continue to have questions, I will repost.
Aug 10 '07 #5

Post your reply

Sign in to post your reply or Sign up for a free account.