473,372 Members | 1,000 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,372 software developers and data experts.

wgetting the crawled links only

8
Hi everyone,
this is my first thread coz I just joined. Does anyone know how to crawl a particular URL using Python? I tried to build a breadth-first sort of crawler but have little success.
With wget, if you are more familiar with it than me, how can get it to output the crawled links (links not the actual HTML content) to a file?

currently i have something like:
wget -q -E -O outfile --proxy-user=username -E --proxy-password=mypassword -r --recursive http://www.museum.vic.gov.au

but it outputs the actual crawled HTML content to 'outfile', but I only want the crawled links.

Thank you in advance
Jul 23 '07 #1
11 1757
bartonc
6,596 Expert 4TB
Hi everyone,
this is my first thread coz I just joined. Does anyone know how to crawl a particular URL using Python? I tried to build a breadth-first sort of crawler but have little success.
With wget, if you are more familiar with it than me, how can get it to output the crawled links (links not the actual HTML content) to a file?

currently i have something like:
wget -q -E -O outfile --proxy-user=username -E --proxy-password=mypassword -r --recursive http://www.museum.vic.gov.au

but it outputs the actual crawled HTML content to 'outfile', but I only want the crawled links.

Thank you in advance
The closest thing that my searches turned up was this thread, but it didn't get very far along.

If kudos were around, he'd know what to tell you.
Sorry to not be of more help than that.

Welcome to the Python Forum on TheScripts.
Jul 23 '07 #2
Motoma
3,237 Expert 2GB
You can use the urllib module to read the raw HTML from a page. You could then run a regex or a find to snag all of the href tags from pages and append them to your list of URLs to search.

I am interested in what you are doing. Would you mind posting your code, or giving an explanation of what you are working on?
Jul 23 '07 #3
ymic8
8
You can use the urllib module to read the raw HTML from a page. You could then run a regex or a find to snag all of the href tags from pages and append them to your list of URLs to search.

I am interested in what you are doing. Would you mind posting your code, or giving an explanation of what you are working on?
Hi there my friends, thank you all for the replies.
My current code actually uses the urllib and regex such as re.findall(r'<p.*p>', page) (which grabs all the paragraphed-words in a greedy mode, but somehow doesn't work for some HTMLs, wonder why...)
I am working on an honour's yearlong project in partnership with the Melbourne Museum, Australia. The project develops a software prototype that automatically and appropriately recommends exhibits to a user to be visited next if and only if the recommended exhibit is judged to be of the user's personal needs and interests. In comparison to the research proposal, the goals of this software have been refined as:
1. To predict when and where should a recommendation be made. This is the automatic component of the software;
2. To choose an exhibit that best represents or satisfies the user's information need at a given time, based on a dynamically determined profile for this particular user. This is the user-personalisation component of the software.
...
i just copied and pasted some stuff from my progress report. But for the web crawling part, i just need to crawl the museum webspace and get all the museum links (maybe not all, i just want ~5000 of them) so i can extract the semantic content of the webpage using these URLs, and process these content in some way.

wget wont give me the links, it only gives me the content, which I don't actually want at this stage. I only want the links - so i can first of all filter out those .mp3s, .exes, .phps ... etc.
w3mir doesn't get me much further too, because every time i try to get the links, it gives me 'connection-timeout'
...
that's why I want to use my own 'personalised' version of crawler, which works fine, but that's not the centre of my project, but it's still interesting tho...
thank you for your attention to a newbie like me :-)
Jul 28 '07 #4
ymic8
8
my crawler is BFS by the way, my initial version of the crawler was DFS, which formed a spider trap out of the web.
there is still some minor modifications, after that, if anyone wants, i will post it here.
Jul 28 '07 #5
bartonc
6,596 Expert 4TB
my crawler is BFS by the way, my initial version of the crawler was DFS, which formed a spider trap out of the web.
there is still some minor modifications, after that, if anyone wants, i will post it here.
Yes, please do. Many good things may come from sharing ideas and experience in this way.

Thank you for the wonderful details of your project, too.
Jul 28 '07 #6
ymic8
8
no worries, as i said, I am a newbie, and I am sure many of you could see that I have many redundant or clumsy codes, or if you know some easy/Python-built-in methods that I could've used, please feel free to point them out, and i will thank you in advance

actually before the code, can i just ask something else?:
I want to grab the textual content of http://www.museum.vic.gov.au/dinosau...immensity.html
now, i tried
[code]
page =urlopen('http://www.museum.vic.gov.au/dinosaurs/time-immensity.html').read()
print re.findall(r'<p>.*p>', page)

so I want it to give me all the content (including some tagged stuff) between the leftmost <p> and rightmost </p>. Now, this works for some URLs, but not the above one, can anyone please suggest why and give me a better regex? thx

This is the museum_crawler.py : please comment on my amateur code thx
Expand|Select|Wrap|Line Numbers
  1. import re, string, os, sys
  2. from nltk_lite.corpora import brown, extract, stopwords
  3. from nltk_lite.wordnet import *
  4. from nltk_lite.probability import FreqDist
  5. from urllib import urlopen
  6.  
  7. docs = {}
  8. pagerank = {}
  9. MAX_DOCS = 3000 
  10. MIN_CHILDLINK_LEN = 8
  11. irresites = ["mvmembers","e-news", "education", "scienceworks", "immigration", "tenders", "http://www.museum.vic.gov.au/about","bfa","search", "ed-online", "whatson","whats_on","privacy", "siteindex", "rights", "disclaimer", "contact", "volunteer"]
  12. mvic = 'http://www.museum.vic.gov.au'
  13. mmel = 'http://melbourne.museum.vic.gov.au'
  14.  
  15. """
  16. only concern the contentual links related to exhibits
  17. """
  18. def purify_irr(link):
  19.     for term in irresites:
  20.         if term in link:
  21.             return False
  22.     return True
  23.  
  24. """
  25. checks if the word has a punctuation char in it
  26. """
  27. def haspunc(word):
  28.     offset = 0
  29.     for char in word:
  30.         if char in string.punctuation:
  31.             return offset
  32.         offset += 1
  33.     return None
  34.  
  35. """
  36. returns the position of the char that is not punctuation
  37. """
  38. def nextnonpunc(word): 
  39.     for i in range(len(word)):
  40.         if word[i] not in string.punctuation:
  41.             return i
  42.     return 100
  43.  
  44. """
  45. don't want URLs that have 'irrelevant' meanings and extensions (types)
  46. """
  47. def valid_page(url):
  48.     link = url.lower()
  49.     if purify_irr(link):
  50.         for ext in ['.gif','.pdf', '.css', '.jpg', '.ram', '.mp3','.exe','.sit','.php']:
  51.             if ext in link:
  52.                 return False
  53.         return True
  54.     return False
  55.  
  56. """
  57. don't want URLs that refer to the central-information sites
  58. """
  59. def informative_page(link, page):
  60.     if 'page not found' not in page.lower() and valid_page(link):
  61.         if not (link in [mvic, mvic+'/', mvic+'/index.asp']) and \
  62.            not (link in [mmel, mmel+'/', mmel+'/index.asp']):
  63.  
  64.             return True
  65.     return False 
  66.  
  67. """
  68. count the number of occurences of 'char' in 'word'
  69. """
  70. def count(word, char):
  71.     count = 0
  72.     for ch in word:
  73.         if ch == char:
  74.             count += 1
  75.     return count
  76.  
  77. """
  78. force an URL to be truncated as 'directory-URL', i.e. ends with '/'
  79. """
  80. def rootify(link):
  81.     root = link
  82.     if link[-1] != '/':
  83.         if count(link, '/') < 3 and 'http' not in link:
  84.             return
  85.         for i in range(len(link)):
  86.             if count(link, '/') < 3:
  87.                 return link+'/'
  88.             if link[-i-1] == '/':
  89.                 root = link[:-i]
  90.                 break;
  91.     else:
  92.         root = link
  93.     return root
  94.  
  95. """
  96. gets the end-string of a URL after the rightmost '/'
  97. tail('aaa/bbb/ccc') => 'ccc'
  98. """
  99. def tail(st):
  100.     if st[-1] == '/':
  101.         st = st[:-1]
  102.     for i in range(len(st)):
  103.         if st[-i] == '/':
  104.             break;
  105.     return st[-i+1:]
  106.  
  107. """
  108. get the content of the page and check for its type
  109. """
  110. def getContent(link):
  111.     try:
  112.         if not valid_page(link):
  113.             return None
  114.         page = urlopen(link).read()
  115.         return page
  116.     except IOError:
  117.         return None
  118.  
  119. """
  120. returns the outlinks (if relative links, then the full URLs of these links) of
  121. the 'link'.
  122. """
  123. def outlinks(link):
  124.     print '=>', link
  125.     givup = False
  126.     temp = []
  127.     r = link
  128.     link = link[link.index('http'):]
  129.     if link[-1] == '"':
  130.         link = link[:-1]
  131.  
  132.     root = rootify(link)
  133.     page = getContent(link)
  134.     if (page == None):
  135.         return None
  136.     if page not in docs.values() and informative_page(link, page):
  137.         docs[link] = page
  138.         temp.append(link)
  139.     outlinks = re.findall(r'href=.+?"', page)
  140.     outlinks = [link for link in outlinks if valid_page(link)]
  141.  
  142.     for link in outlinks:
  143.         com = None
  144.         if 'http' in link:
  145.             link = link[link.index('http'):]
  146.             if link[-1] == '"':
  147.                 link = link[:-1]
  148.             if 'museum' in link.lower():
  149.                 page = getContent(link)
  150.                 if (page == None):
  151.                     return None
  152.                 if page not in docs.values():
  153.                     if informative_page(link, page):
  154.                         temp.append(link)
  155.                         docs[link] = page
  156.  
  157.         elif len(link) < MIN_CHILDLINK_LEN:
  158.             continue
  159.         else:
  160.             if link[6:-1][0] == '/':
  161.                 rest = link[7:-1]
  162.             else:
  163.                 rest = link[6:-1]
  164.             com = rest
  165.             start = nextnonpunc(rest)
  166.             if start == 100:
  167.                 continue;
  168.             link_option = ''
  169.             rest = rest[start:]
  170.             if '/' in rest and '/' in root:
  171.                 child_first_comp = rest[:rest.index('/')]
  172.                 parent_last_comp = tail(root)
  173.                 if child_first_comp.lower() == parent_last_comp.lower():
  174.                 # if the relative link has an overlapping component with root
  175.                 # e.g. /CSIRAC/history... and /history should be the same 
  176.                 # relative links, but they result in diff URLs, thus need to
  177.                 # have a 'backup' link to be urlopened in some cases
  178.                     link_option = root+rest[rest.index('/')+1:]    
  179.             link = root+rest
  180.             if not givup and 'museum' in link.lower():
  181.                 page = getContent(link)
  182.                 if (page != None) and page not in docs.values():
  183.                     if informative_page(link, page):
  184.  
  185.                             temp.append(link)
  186.                             docs[link] = page
  187.                 else:
  188.                     if link_option != '':
  189.                         page = getContent(link_option)
  190.                         if (page != None) and page not in docs.values():
  191.                             if informative_page(link, page):
  192.                                 temp.append(link_option)
  193.                                 docs[link] = page
  194.     pagerank[root] = temp
  195.     for link in temp:
  196.         print "  --  ", link
  197.     return temp
  198.  
  199.  
  200. def crawler(link):
  201.     link = link[link.index('http'):]
  202.     if link[-1] == '"':
  203.         link = link[:-1]
  204.     root = link
  205.     if root[-1] != '/':
  206.         if count(root, '/') < 3:
  207.             return
  208.         for i in range(len(root)):
  209.             if root[-i-1] == '/':
  210.                 root = root[:-i]
  211.                 break;
  212.  
  213.     page = getContent(link)
  214.     if (page == None):
  215.         return
  216.     if page not in docs.values():
  217.         docs[link] = page
  218.     to_scan = [root]
  219.     while len(to_scan) > 0 and len(docs) < MAX_DOCS:
  220.         childlinks = []
  221.         for parent in to_scan:
  222.             out = outlinks(parent)
  223.             if out:            
  224.                 # just in case if some links are dead or invalid
  225.                 childlinks.extend(out)
  226.  
  227.         to_scan = list(set(childlinks))
  228.  
  229. if __name__ == '__main__':
  230.     crawler("http://melbourne.museum.vic.gov.au/exhibitions/")
  231.     fp = open('docs.py', 'a')
  232.     for key in docs:
  233.         fp.write(key+'\n')
  234.     fp.close()
  235.     fp = open('pagerank.py', 'a')
  236.     for key in pagerank:
  237.         fp.write(key+'\n')
  238.         for outlink in pagerank[key]:
  239.         fp.write('    '+outlink+'\n')
  240.     fp.close
  241.  
  242.  
Jul 28 '07 #7
ymic8
8
no worries, as i said, I am a newbie, and I am sure many of you could see that I have many redundant or clumsy codes, or if you know some easy/Python-built-in methods that I could've used, please feel free to point them out, and i will thank you in advance

actually before the code, can i just ask something else?:
I want to grab the textual content of http://www.museum.vic.gov.au/dinosau...immensity.html
now, i tried
[code]
page =urlopen('http://www.museum.vic.gov.au/dinosaurs/time-immensity.html').read()
print re.findall(r'<p>.*p>', page)

so I want it to give me all the content (including some tagged stuff) between the leftmost <p> and rightmost </p>. Now, this works for some URLs, but not the above one, can anyone please suggest why and give me a better regex? thx

This is the museum_crawler.py : please comment on my amateur code thx
Expand|Select|Wrap|Line Numbers
  1. import re, string, os, sys
  2. from nltk_lite.corpora import brown, extract, stopwords
  3. from nltk_lite.wordnet import *
  4. from nltk_lite.probability import FreqDist
  5. from urllib import urlopen
  6.  
  7. docs = {}
  8. pagerank = {}
  9. MAX_DOCS = 3000 
  10. MIN_CHILDLINK_LEN = 8
  11. irresites = ["mvmembers","e-news", "education", "scienceworks", "immigration", "tenders", "http://www.museum.vic.gov.au/about","bfa","search", "ed-online", "whatson","whats_on","privacy", "siteindex", "rights", "disclaimer", "contact", "volunteer"]
  12. mvic = 'http://www.museum.vic.gov.au'
  13. mmel = 'http://melbourne.museum.vic.gov.au'
  14.  
  15. """
  16. only concern the contentual links related to exhibits
  17. """
  18. def purify_irr(link):
  19.     for term in irresites:
  20.         if term in link:
  21.             return False
  22.     return True
  23.  
  24. """
  25. checks if the word has a punctuation char in it
  26. """
  27. def haspunc(word):
  28.     offset = 0
  29.     for char in word:
  30.         if char in string.punctuation:
  31.             return offset
  32.         offset += 1
  33.     return None
  34.  
  35. """
  36. returns the position of the char that is not punctuation
  37. """
  38. def nextnonpunc(word): 
  39.     for i in range(len(word)):
  40.         if word[i] not in string.punctuation:
  41.             return i
  42.     return 100
  43.  
  44. """
  45. don't want URLs that have 'irrelevant' meanings and extensions (types)
  46. """
  47. def valid_page(url):
  48.     link = url.lower()
  49.     if purify_irr(link):
  50.         for ext in ['.gif','.pdf', '.css', '.jpg', '.ram', '.mp3','.exe','.sit','.php']:
  51.             if ext in link:
  52.                 return False
  53.         return True
  54.     return False
  55.  
  56. """
  57. don't want URLs that refer to the central-information sites
  58. """
  59. def informative_page(link, page):
  60.     if 'page not found' not in page.lower() and valid_page(link):
  61.         if not (link in [mvic, mvic+'/', mvic+'/index.asp']) and \
  62.            not (link in [mmel, mmel+'/', mmel+'/index.asp']):
  63.  
  64.             return True
  65.     return False 
  66.  
  67. """
  68. count the number of occurences of 'char' in 'word'
  69. """
  70. def count(word, char):
  71.     count = 0
  72.     for ch in word:
  73.         if ch == char:
  74.             count += 1
  75.     return count
  76.  
  77. """
  78. force an URL to be truncated as 'directory-URL', i.e. ends with '/'
  79. """
  80. def rootify(link):
  81.     root = link
  82.     if link[-1] != '/':
  83.         if count(link, '/') < 3 and 'http' not in link:
  84.             return
  85.         for i in range(len(link)):
  86.             if count(link, '/') < 3:
  87.                 return link+'/'
  88.             if link[-i-1] == '/':
  89.                 root = link[:-i]
  90.                 break;
  91.     else:
  92.         root = link
  93.     return root
  94.  
  95. """
  96. gets the end-string of a URL after the rightmost '/'
  97. tail('aaa/bbb/ccc') => 'ccc'
  98. """
  99. def tail(st):
  100.     if st[-1] == '/':
  101.         st = st[:-1]
  102.     for i in range(len(st)):
  103.         if st[-i] == '/':
  104.             break;
  105.     return st[-i+1:]
  106.  
  107. """
  108. get the content of the page and check for its type
  109. """
  110. def getContent(link):
  111.     try:
  112.         if not valid_page(link):
  113.             return None
  114.         page = urlopen(link).read()
  115.         return page
  116.     except IOError:
  117.         return None
  118.  
  119. """
  120. returns the outlinks (if relative links, then the full URLs of these links) of
  121. the 'link'.
  122. """
  123. def outlinks(link):
  124.     print '=>', link
  125.     givup = False
  126.     temp = []
  127.     r = link
  128.     link = link[link.index('http'):]
  129.     if link[-1] == '"':
  130.         link = link[:-1]
  131.  
  132.     root = rootify(link)
  133.     page = getContent(link)
  134.     if (page == None):
  135.         return None
  136.     if page not in docs.values() and informative_page(link, page):
  137.         docs[link] = page
  138.         temp.append(link)
  139.     outlinks = re.findall(r'href=.+?"', page)
  140.     outlinks = [link for link in outlinks if valid_page(link)]
  141.  
  142.     for link in outlinks:
  143.         com = None
  144.         if 'http' in link:
  145.             link = link[link.index('http'):]
  146.             if link[-1] == '"':
  147.                 link = link[:-1]
  148.             if 'museum' in link.lower():
  149.                 page = getContent(link)
  150.                 if (page == None):
  151.                     return None
  152.                 if page not in docs.values():
  153.                     if informative_page(link, page):
  154.                         temp.append(link)
  155.                         docs[link] = page
  156.  
  157.         elif len(link) < MIN_CHILDLINK_LEN:
  158.             continue
  159.         else:
  160.             if link[6:-1][0] == '/':
  161.                 rest = link[7:-1]
  162.             else:
  163.                 rest = link[6:-1]
  164.             com = rest
  165.             start = nextnonpunc(rest)
  166.             if start == 100:
  167.                 continue;
  168.             link_option = ''
  169.             rest = rest[start:]
  170.             if '/' in rest and '/' in root:
  171.                 child_first_comp = rest[:rest.index('/')]
  172.                 parent_last_comp = tail(root)
  173.                 if child_first_comp.lower() == parent_last_comp.lower():
  174.                 # if the relative link has an overlapping component with root
  175.                 # e.g. /CSIRAC/history... and /history should be the same 
  176.                 # relative links, but they result in diff URLs, thus need to
  177.                 # have a 'backup' link to be urlopened in some cases
  178.                     link_option = root+rest[rest.index('/')+1:]    
  179.             link = root+rest
  180.             if not givup and 'museum' in link.lower():
  181.                 page = getContent(link)
  182.                 if (page != None) and page not in docs.values():
  183.                     if informative_page(link, page):
  184.  
  185.                             temp.append(link)
  186.                             docs[link] = page
  187.                 else:
  188.                     if link_option != '':
  189.                         page = getContent(link_option)
  190.                         if (page != None) and page not in docs.values():
  191.                             if informative_page(link, page):
  192.                                 temp.append(link_option)
  193.                                 docs[link] = page
  194.     pagerank[root] = temp
  195.     for link in temp:
  196.         print "  --  ", link
  197.     return temp
  198.  
  199.  
  200. def crawler(link):
  201.     link = link[link.index('http'):]
  202.     if link[-1] == '"':
  203.         link = link[:-1]
  204.     root = link
  205.     if root[-1] != '/':
  206.         if count(root, '/') < 3:
  207.             return
  208.         for i in range(len(root)):
  209.             if root[-i-1] == '/':
  210.                 root = root[:-i]
  211.                 break;
  212.  
  213.     page = getContent(link)
  214.     if (page == None):
  215.         return
  216.     if page not in docs.values():
  217.         docs[link] = page
  218.     to_scan = [root]
  219.     while len(to_scan) > 0 and len(docs) < MAX_DOCS:
  220.         childlinks = []
  221.         for parent in to_scan:
  222.             out = outlinks(parent)
  223.             if out:            
  224.                 # just in case if some links are dead or invalid
  225.                 childlinks.extend(out)
  226.  
  227.         to_scan = list(set(childlinks))
  228.  
  229. if __name__ == '__main__':
  230.     crawler("http://melbourne.museum.vic.gov.au/exhibitions/")
  231.     fp = open('docs.py', 'a')
  232.     for key in docs:
  233.         fp.write(key+'\n')
  234.     fp.close()
  235.     fp = open('pagerank.py', 'a')
  236.     for key in pagerank:
  237.         fp.write(key+'\n')
  238.         for outlink in pagerank[key]:
  239.         fp.write('    '+outlink+'\n')
  240.     fp.close
  241.  
  242.  
how come it doesn't show up? maybe my last thread is too long?
Jul 28 '07 #8
ymic8
8
Yes, please do. Many good things may come from sharing ideas and experience in this way.

Thank you for the wonderful details of your project, too.
why doesn't my thread show up?
Jul 28 '07 #9
ymic8
8
Yes, please do. Many good things may come from sharing ideas and experience in this way.

Thank you for the wonderful details of your project, too.
(ha, so i missed a [/code] and it refuses to post my thread, lol)
no worries, as i said, I am a newbie, and I am sure many of you could see that I have many redundant or clumsy codes, or if you know some easy/Python-built-in methods that I could've used, please feel free to point them out, and i will thank you in advance

actually before the code, can i just ask something else?:
I want to grab the textual content of http://www.museum.vic.gov.au/dinosaurs/time-immensity.html
now, i tried
Expand|Select|Wrap|Line Numbers
  1. page =urlopen('http://www.museum.vic.gov.au/dinosaurs/time-immensity.html').read()
  2. print re.findall(r'<p>.*p>', page) 
  3.  
so I want it to give me all the content (including some tagged stuff) between the leftmost <p> and rightmost </p>. Now, this works for some URLs, but not the above one, can anyone please suggest why and give me a better regex? thx

This is the museum_crawler.py : please comment on my amateur code thx
Expand|Select|Wrap|Line Numbers
  1. import re, string, os, sys
  2. from nltk_lite.corpora import brown, extract, stopwords
  3. from nltk_lite.wordnet import *
  4. from nltk_lite.probability import FreqDist
  5. from urllib import urlopen
  6.  
  7. docs = {}
  8. pagerank = {}
  9. MAX_DOCS = 3000 
  10. MIN_CHILDLINK_LEN = 8
  11. irresites = ["mvmembers","e-news", "education", "scienceworks", "immigration", "tenders", "http://www.museum.vic.gov.au/about","bfa","search", "ed-online", "whatson","whats_on","privacy", "siteindex", "rights", "disclaimer", "contact", "volunteer"]
  12. mvic = 'http://www.museum.vic.gov.au'
  13. mmel = 'http://melbourne.museum.vic.gov.au'
  14.  
  15. """
  16. only concern the contentual links related to exhibits
  17. """
  18. def purify_irr(link):
  19.     for term in irresites:
  20.         if term in link:
  21.             return False
  22.     return True
  23.  
  24. """
  25. checks if the word has a punctuation char in it
  26. """
  27. def haspunc(word):
  28.     offset = 0
  29.     for char in word:
  30.         if char in string.punctuation:
  31.             return offset
  32.         offset += 1
  33.     return None
  34.  
  35. """
  36. returns the position of the char that is not punctuation
  37. """
  38. def nextnonpunc(word): 
  39.     for i in range(len(word)):
  40.         if word[i] not in string.punctuation:
  41.             return i
  42.     return 100
  43.  
  44. """
  45. don't want URLs that have 'irrelevant' meanings and extensions (types)
  46. """
  47. def valid_page(url):
  48.     link = url.lower()
  49.     if purify_irr(link):
  50.         for ext in ['.gif','.pdf', '.css', '.jpg', '.ram', '.mp3','.exe','.sit','.php']:
  51.             if ext in link:
  52.                 return False
  53.         return True
  54.     return False
  55.  
  56. """
  57. don't want URLs that refer to the central-information sites
  58. """
  59. def informative_page(link, page):
  60.     if 'page not found' not in page.lower() and valid_page(link):
  61.         if not (link in [mvic, mvic+'/', mvic+'/index.asp']) and \
  62.            not (link in [mmel, mmel+'/', mmel+'/index.asp']):
  63.  
  64.             return True
  65.     return False 
  66.  
  67. """
  68. count the number of occurences of 'char' in 'word'
  69. """
  70. def count(word, char):
  71.     count = 0
  72.     for ch in word:
  73.         if ch == char:
  74.             count += 1
  75.     return count
  76.  
  77. """
  78. force an URL to be truncated as 'directory-URL', i.e. ends with '/'
  79. """
  80. def rootify(link):
  81.     root = link
  82.     if link[-1] != '/':
  83.         if count(link, '/') < 3 and 'http' not in link:
  84.             return
  85.         for i in range(len(link)):
  86.             if count(link, '/') < 3:
  87.                 return link+'/'
  88.             if link[-i-1] == '/':
  89.                 root = link[:-i]
  90.                 break;
  91.     else:
  92.         root = link
  93.     return root
  94.  
  95. """
  96. gets the end-string of a URL after the rightmost '/'
  97. tail('aaa/bbb/ccc') => 'ccc'
  98. """
  99. def tail(st):
  100.     if st[-1] == '/':
  101.         st = st[:-1]
  102.     for i in range(len(st)):
  103.         if st[-i] == '/':
  104.             break;
  105.     return st[-i+1:]
  106.  
  107. """
  108. get the content of the page and check for its type
  109. """
  110. def getContent(link):
  111.     try:
  112.         if not valid_page(link):
  113.             return None
  114.         page = urlopen(link).read()
  115.         return page
  116.     except IOError:
  117.         return None
  118.  
  119. """
  120. returns the outlinks (if relative links, then the full URLs of these links) of
  121. the 'link'.
  122. """
  123. def outlinks(link):
  124.     print '=>', link
  125.     givup = False
  126.     temp = []
  127.     r = link
  128.     link = link[link.index('http'):]
  129.     if link[-1] == '"':
  130.         link = link[:-1]
  131.  
  132.     root = rootify(link)
  133.     page = getContent(link)
  134.     if (page == None):
  135.         return None
  136.     if page not in docs.values() and informative_page(link, page):
  137.         docs[link] = page
  138.         temp.append(link)
  139.     outlinks = re.findall(r'href=.+?"', page)
  140.     outlinks = [link for link in outlinks if valid_page(link)]
  141.  
  142.     for link in outlinks:
  143.         com = None
  144.         if 'http' in link:
  145.             link = link[link.index('http'):]
  146.             if link[-1] == '"':
  147.                 link = link[:-1]
  148.             if 'museum' in link.lower():
  149.                 page = getContent(link)
  150.                 if (page == None):
  151.                     return None
  152.                 if page not in docs.values():
  153.                     if informative_page(link, page):
  154.                         temp.append(link)
  155.                         docs[link] = page
  156.  
  157.         elif len(link) < MIN_CHILDLINK_LEN:
  158.             continue
  159.         else:
  160.             if link[6:-1][0] == '/':
  161.                 rest = link[7:-1]
  162.             else:
  163.                 rest = link[6:-1]
  164.             com = rest
  165.             start = nextnonpunc(rest)
  166.             if start == 100:
  167.                 continue;
  168.             link_option = ''
  169.             rest = rest[start:]
  170.             if '/' in rest and '/' in root:
  171.                 child_first_comp = rest[:rest.index('/')]
  172.                 parent_last_comp = tail(root)
  173.                 if child_first_comp.lower() == parent_last_comp.lower():
  174.                 # if the relative link has an overlapping component with root
  175.                 # e.g. /CSIRAC/history... and /history should be the same 
  176.                 # relative links, but they result in diff URLs, thus need to
  177.                 # have a 'backup' link to be urlopened in some cases
  178.                     link_option = root+rest[rest.index('/')+1:]    
  179.             link = root+rest
  180.             if not givup and 'museum' in link.lower():
  181.                 page = getContent(link)
  182.                 if (page != None) and page not in docs.values():
  183.                     if informative_page(link, page):
  184.  
  185.                             temp.append(link)
  186.                             docs[link] = page
  187.                 else:
  188.                     if link_option != '':
  189.                         page = getContent(link_option)
  190.                         if (page != None) and page not in docs.values():
  191.                             if informative_page(link, page):
  192.                                 temp.append(link_option)
  193.                                 docs[link] = page
  194.     pagerank[root] = temp
  195.     for link in temp:
  196.         print "  --  ", link
  197.     return temp
  198.  
  199.  
  200. def crawler(link):
  201.     link = link[link.index('http'):]
  202.     if link[-1] == '"':
  203.         link = link[:-1]
  204.     root = link
  205.     if root[-1] != '/':
  206.         if count(root, '/') < 3:
  207.             return
  208.         for i in range(len(root)):
  209.             if root[-i-1] == '/':
  210.                 root = root[:-i]
  211.                 break;
  212.  
  213.     page = getContent(link)
  214.     if (page == None):
  215.         return
  216.     if page not in docs.values():
  217.         docs[link] = page
  218.     to_scan = [root]
  219.     while len(to_scan) > 0 and len(docs) < MAX_DOCS:
  220.         childlinks = []
  221.         for parent in to_scan:
  222.             out = outlinks(parent)
  223.             if out:            
  224.                 # just in case if some links are dead or invalid
  225.                 childlinks.extend(out)
  226.  
  227.         to_scan = list(set(childlinks))
  228.  
  229. if __name__ == '__main__':
  230.     crawler("http://melbourne.museum.vic.gov.au/exhibitions/")
  231.     fp = open('docs.py', 'a')
  232.     for key in docs:
  233.         fp.write(key+'\n')
  234.     fp.close()
  235.     fp = open('pagerank.py', 'a')
  236.     for key in pagerank:
  237.         fp.write(key+'\n')
  238.         for outlink in pagerank[key]:
  239.         fp.write('    '+outlink+'\n')
  240.     fp.close
  241.  
  242.  
Jul 28 '07 #10
ymic8
8
oh I see, the reason the re.findall business didn't work is because in the page source there are \r, \n, and \t chars, that's why the regex ignored them.
now i am try to look for a succint regex to rid all the \x's.

by the way, my crawler code is not really commented and documented well, so if you are having trouble understanding them, i will soon document them and post it up again, but for now i need to get other components working.

thank you all
Jul 29 '07 #11
bartonc
6,596 Expert 4TB
why doesn't my thread show up?
Sorry. There's a bug in the way this site displays long code blocks. Putting the code block inside quotes seems to make it worse. Your most recent post seems to have worked out well. Thanks for being persistent with this sometimes finicky site.
Jul 29 '07 #12

Sign in to post your reply or Sign up for a free account.

Similar topics

9
by: Steve | last post by:
Hello, I am writing a script that calls a URL and reads the resulting HTML into a function that strips out everthing and returns ONLY the links, this is so that I can build a link index of various...
2
by: chris hughes | last post by:
Can anyone please tell me how to create a javascript that I can place in any page that will disable all the links or just change all the hrefs to # Many Thanks Chris Hughes
7
by: Chris | last post by:
I'm using eight links listed horizontally as a menu on my site. I'm using font-variant:small-caps and they are padded so that they mimic buttons. My gripe is with the way IE handles the focus...
7
by: Jonas Smithson | last post by:
Hello all, I have an absolute positioned parent div, and nested inside it is an absolute positioned child div -- I mean the div *code* is nested inside, but the child is physically positioned so...
25
by: David | last post by:
I've got a 50k main.css file that's referenced by a load of sites. Each of these sites also has a site.css file that modifies certain styles defined in main.css. Changing the colour of borders,...
19
by: opt_inf_env | last post by:
Hello all, I know that in html to set colors of visited and activ links one need to use line in the following format: <BODY text="#FF0000" link="#0000FF" alink="#FFFF00" vlink="#00FF00"> ...
2
by: CW | last post by:
I have pages, such as LogOn.aspx, Payment.aspx that enforces the use of SSL. Every single one of my page embeds a header and menu server controls - which have links to other pages that do not...
1
by: Luis Esteban Valencia | last post by:
have a website that uses both http and https. I am able to change any hard coded http links to relative paths. No problem. But, I have several aspx and ascx pages that contain hard coded links...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.