Simple Site Crawler

Mar 3, 2020

I had a small task at work that required me to build a list of internal URLs on a website that had to be properly redirect to. Rather than go through thousands of URLs manually, I decided to develop a prototype outside of work that, if successful, I could use at work for this very task whilst also getting to make it a personal mini project and open source it :). Like everything I create, I wanted the user experience to be as simple as possible but with the capability to tweak it to more unique requirements. That is why the script can run with one import line and two lines of code but can also be configured by passing a few additional parameters in if needed.

Over the course of the developing it, I realised that having a single thread read a page, find links and then go to the next link and repeat the process was incredibly slow. Thus, I made it multi-threaded so a user defined number of threads can share the work load and speed up the process. It also added the ability to pass an argument to the crawl() method that will either return a list of internal, external or both URLs. This might be handy for some.

Below is how to use it:

from SiteUrlCrawler import SiteUrlCrawler

crawler = SiteUrlCrawler("https://strongscot.com")

for url in crawler.crawl():
    print("Found: " + url)

Simple huh. The function crawl() will return a Python array of all the URLs it found.

If you wish to only find specific types of URLs on your chosen site, you can pass the following params to the crawl() method.

  • SiteUrlCrawler.Mode.INTERNAL: Find only internal URLs
  • SiteUrlCrawler.Mode.EXTERNAL: Find only external URLs
  • SiteUrlCrawler.Mode.ALL: Find all URLs

Example:

for url in crawler.crawl(SiteUrlCrawler.Mode.INTERNAL):
    print("Found: " + url)

Note: The one major thing the script WONT do is crawl external URLs as this would essentially make this a small Google and we don’t want that…although, if you did, it would be easy enough to tweak the source.

Source: https://github.com/strongscot/simple-python-url-crawler