473,320 Members | 1,916 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Getting the size of sourcecode

7
Hello,

I am trying to input a spreadsheet of possible domain names and output the length of the sourcecode of the webpage (if it exists). In doing this, I have three small questions (I am a newbie and apologize if the questions are simple):

1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.

2. What is the best way to handle errors when a domain phrase doesn't lead to a good website? This will happen (I think) with the line z=br.open('http://www.'+domainTerm)
for which the domainTerm might not lead to an active website.

3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?

Thank you,
Mitch

from mechanize import Browser
import re, time, urllib2

def MakeBrowser():
b = Browser()
headerString = 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.7.12) ' + \
'gecko/20050922 firefox/1.0.7 (debian package 1.0.7-1)'
h = [('User-agent', headerString)]
b.addheaders = h
b.set_handle_robots(False)
return(b)

f = open('bizornot1.csv','r')
lines = f.readlines()
f.close()
f2 = open('bizornot1_new.csv','w')
f2.write(lines[0].rstrip()+',PageSize'+"\n")
print(lines[0].rstrip()+",PageSize")

for i in range(len(lines)-1):
domainTerm = domainTerms[i]
br = MakeBrowser()
z=br.open('http://www.'+domainTerm)
page=z.read()
f2.write(lines[i+1].rstrip()+','+len(page)+"\n")
print(lines[i+1].rstrip()+','+len(page))

f2.close()
Aug 9 '07 #1
4 1451
bvdet
2,851 Expert Mod 2GB
Hello,

I am trying to input a spreadsheet of possible domain names and output the length of the sourcecode of the webpage (if it exists). In doing this, I have three small questions (I am a newbie and apologize if the questions are simple):

1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.

2. What is the best way to handle errors when a domain phrase doesn't lead to a good website? This will happen (I think) with the line z=br.open('http://www.'+domainTerm)
for which the domainTerm might not lead to an active website.

3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?

Thank you,
Mitch

from mechanize import Browser
import re, time, urllib2

def MakeBrowser():
b = Browser()
headerString = 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.7.12) ' + \
'gecko/20050922 firefox/1.0.7 (debian package 1.0.7-1)'
h = [('User-agent', headerString)]
b.addheaders = h
b.set_handle_robots(False)
return(b)

f = open('bizornot1.csv','r')
lines = f.readlines()
f.close()
f2 = open('bizornot1_new.csv','w')
f2.write(lines[0].rstrip()+',PageSize'+"\n")
print(lines[0].rstrip()+",PageSize")

for i in range(len(lines)-1):
domainTerm = domainTerms[i]
br = MakeBrowser()
z=br.open('http://www.'+domainTerm)
page=z.read()
f2.write(lines[i+1].rstrip()+','+len(page)+"\n")
print(lines[i+1].rstrip()+','+len(page))

f2.close()
I am not familiar with the 'mechanize' module. You can do the following with the 'urllib' module:
Expand|Select|Wrap|Line Numbers
  1. from urllib import urlopen
  2.  
  3. h = urlopen('http://www.somewebsite.com/')
  4.  
  5. # source = h.read() # read page into a string
  6. lineList = h.readlines() # read page into a list of strings
  7. info = h.info()
  8. trueURL = h.geturl()
  9.  
  10. print ('The number of lines is %d' % len(lineList))
  11.  
  12. print 'The number of words is %d' % sum([len(line.strip().split()) for line in lineList])
  13.  
  14. h.close()
  15.  
  16. try:
  17.     h = urlopen('http://www.invalidURL.com/')
  18. except IOError, e:
  19.     print e
Expand|Select|Wrap|Line Numbers
  1. >>> The number of lines is 110
  2. The number of words is 807
  3. [Errno socket error] (7, 'getaddrinfo failed')
  4.  
  5. >>> info
  6. <httplib.HTTPMessage instance at 0x00E7A3C8>
  7. >>> print info
  8. Date: Thu, 09 Aug 2007 02:25:43 GMT
  9. Server: Apache
  10. Last-Modified: Tue, 07 Aug 2007 02:41:23 GMT
  11. ETag: "744141-2b29-46b7dbd3"
  12. Accept-Ranges: bytes
  13. Content-Length: 11049
  14. Connection: close
  15. Content-Type: text/html
  16.  
  17. >>> trueURL
  18. 'http://www.bvdetailing.com/'
  19. >>> 
  20. '''
Aug 9 '07 #2
1. How do I convert the length of the page to a string? I have looked around the web for Python 'tostring' and found several individually created functions, but I tried a few and had problems.
Please clarify. If you have read the page into a variable, it is (likely) already a string. The str() function converts to a string but I do not think that is what you want.

2. What is the best way to handle errors when a domain phrase doesn't lead to a good website?
What error does your module spit back? Wrap the code you have in a try/except block specifying this error, and then do what you want when it happens. For example, if the error is BadSillyError, do this:
Expand|Select|Wrap|Line Numbers
  1. try:
  2.     # code to run
  3. except BadSillyError:
  4.     # what to do on failure
  5. else:
  6.     # continue with rest of cdoe if error does not happen
  7.  
3. Instead of getting the total number of characters on the sourcepage (which I get by looking at len(page) ), is there any way to get the number of lines?
If you know lines are separated by carriage returns you can do something like:
Expand|Select|Wrap|Line Numbers
  1. lines = page.split('\n')
  2. number_of_lines = len(lines)
  3.  
But then you might want to eliminate blank lines. Or comments.
Aug 9 '07 #3
I have just published a full Line Of Code Counter that you can adapt to your purpose.
Aug 9 '07 #4
mh121
7
Thank you very much for your comments. What I meant for the destring function was actually just the str() function you provided. I tried out many of your suggestions today and, after trying out more tomorrow, if I continue to have questions, I will repost.
Aug 10 '07 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: DvDmanDT | last post by:
Hello, I have an intresting problem: I want to let users upload sourcecode and then compile it using my cygwin gcc... But one thing at a time... I can't get that gcc to execute... ...
12
by: Mark Buch | last post by:
Hi, is it possible to protect the python sourcecode? I have a nice little script and i dont want to show everbody the source. Im using python on a windows pc. Thank you - Mark
4
by: MJB | last post by:
I never get the above exception in Windows 2k. It only happens in Windows XP, which is the first oddity. My application is multi-threaded and I use the webbrowser control and media player. The...
4
by: Aaron | last post by:
I would like to display my sourcecode(plain text) in a html page. I found this website that does exactly that. http://www.manoli.net/csharpformat/ Does anyone know where I can download the...
0
by: HarryMangurian | last post by:
I have a memo field in an ACCES database (up to 650000 characters). I have set up an adapter and a dataset for the database. I have a datagrid bound to the dataset. There are 4 fields in the...
7
by: messagedog | last post by:
maybe, we may together study windows sourcecode. and u? if u need,u may download in http://activex.126.com/
4
by: Larry Tate | last post by:
I am wanting to get those cool html error pages that ms produces when I hit an error in asp.net. For instance, when I get a compilation error I get an html error page that shows me the ...
4
by: Dave | last post by:
Hi, Is there anyone knows howto obtain HTML sourcecode in a string. In VB6 I used "inet" to do the job but it won't work in VB.net. Thank Dave
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.