473,385 Members | 1,588 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Ideas For Better Or More Optimize Web Crawler? Code Below.

Expand|Select|Wrap|Line Numbers
  1. import requests
  2. import argparse
  3. from urllib.request import urlparse, urljoin
  4. from bs4 import BeautifulSoup
  5.  
  6. # initialize the set of links (unique links)
  7. internal_urls = set()
  8. external_urls = set()
  9.  
  10. total_urls_visited = 0
  11.  
  12. def is_valid(url):
  13.     """
  14.     Checks whether `url` is a valid URL.
  15.     """
  16.     parsed = urlparse(url)
  17.     return bool(parsed.netloc) and bool(parsed.scheme)
  18.  
  19.  
  20. def get_all_website_links(url):
  21.     """
  22.     Returns all URLs that is found on `url` in which it belongs to the same website
  23.     """
  24.     # all URLs of `url`
  25.     urls = set()
  26.     # domain name of the URL without the protocol
  27.     domain_name = urlparse(url).netloc
  28.     soup = BeautifulSoup(requests.get(url).content, "html.parser")
  29.     for a_tag in soup.findAll("a"):
  30.         href = a_tag.attrs.get("href")
  31.         if href == "" or href is None:
  32.             # href empty tag
  33.             continue
  34.         # join the URL if it's relative (not absolute link)
  35.         href = urljoin(url, href)
  36.         parsed_href = urlparse(href)
  37.         # remove URL GET parameters, URL fragments, etc.
  38.         href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
  39.         if not is_valid(href):
  40.             # not a valid URL
  41.             continue
  42.         if href in internal_urls:
  43.             # already in the set
  44.             continue
  45.         if domain_name not in href:
  46.             # external link
  47.             if href not in external_urls:
  48.                 print(f"External link: {href}")
  49.                 external_urls.add(href)
  50.             continue
  51.         print(f"Internal link: {href}")
  52.         urls.add(href)
  53.         internal_urls.add(href)
  54.     return urls
  55.  
  56.  
  57. def crawl(url, max_urls=50):
  58.     """
  59.     Crawls a web page and extracts all links.
  60.     You'll find all links in `external_urls` and `internal_urls` global set variables.
  61.     params:
  62.         max_urls (int): number of max urls to crawl, default is 30.
  63.     """
  64.     global total_urls_visited
  65.     total_urls_visited += 1
  66.     links = get_all_website_links(url)
  67.     for link in links:
  68.         if total_urls_visited > max_urls:
  69.             break
  70.         crawl(link, max_urls=max_urls)
  71.  
  72.  
  73. if __name__ == "__main__":
  74.     parser = argparse.ArgumentParser(description="Link Extractor Tool with Python")
  75.     parser.add_argument("url", help="The URL to extract links from.")
  76.     parser.add_argument("-m", "--max-urls", help="Number of max URLs to crawl, default is 30.", default=30, type=int)
  77.  
  78.     args = parser.parse_args()
  79.     url = args.url
  80.     max_urls = args.max_urls
  81.  
  82.     crawl(url, max_urls=max_urls)
  83.  
  84.     print("Total Internal links:", len(internal_urls))
  85.     print("Total External links:", len(external_urls))
  86.     print("Total URLs:", len(external_urls) + len(internal_urls))
  87.  
  88.     domain_name = urlparse(url).netloc
  89.  
  90.     # save the internal links to a file
  91.     with open(f"{domain_name}_internal_links.txt", "w") as f:
  92.         for internal_link in internal_urls:
  93.             print(internal_link.strip(), file=f)
  94.  
  95.     # save the external links to a file
  96.     with open(f"{domain_name}_external_links.txt", "w") as f:
  97.         for external_link in external_urls:
  98.             print(external_link.strip(), file=f)
  99.  
Jul 13 '20 #1
0 1932

Sign in to post your reply or Sign up for a free account.

Similar topics

1
by: samlicon | last post by:
i have a java project to do , but what problem i encounter is how to fill / put an image of jpeg into a Reatangle in the code below please type the code in detail since i am an entry level...
3
by: Wayne Wood | last post by:
i am now working on a financial calculating application, in which there are so many number cells to calculate. the relation between them are very like the formula in MS Excel, we need all result...
1
by: Sanjay Patra | last post by:
Hi All, I am looking for a simple C/ C++ web crawler code. It should be very simple with minimal functionality. I am particularly interested in the code to grab the content of a url and the...
14
by: joshc | last post by:
I'm writing some C to be used in an embedded environment and the code needs to be optimized. I have a question about optimizing compilers in general. I'm using GCC for the workstation and Diab...
31
by: DeltaOne | last post by:
#include<stdio.h> typedef struct test{ int i; int j; }test; main(){ test var; var.i=10; var.j=20;
28
by: lovecreatesbeauty | last post by:
On gcc, which version of C standard has the most compliant: -c89, -ansi or c99? For more portable C code, which options should be applied to compilation? Can the following options guarantee the...
6
by: vbdog | last post by:
Hi I am trying to convert the vb6 code below into vb.net code can anyboddy help The vb6 code is from a chip8 Emulator that somboddy wrote. vb6 code this function keeps getting called until the...
9
by: TF | last post by:
Hello all, I made a ASP.NET 2.0 site that shows possible "recipes" for paint colors stored in an access dbase. Basically, 1000 colors are stored with specific RGB values in separate columns. A...
8
by: jd2007 | last post by:
Why the Ajax code below in ajax.js is causing my form not to work ? ajax.js: var a=0; var b=0; var c=0; var d=0; var e=0; var f=0;
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.