How I created my first Web Crawler!

Jun 27, 20223 min read

Updated: Oct 16, 2022

What is a web crawler?

A web crawler is a bot that crawls the internet to index and downloads the contents of websites for scraping. Web crawlers are also called web spiders or crawling bots. A web crawler needs to be provided with a list of initial websites to start from which it will index and crawl the links present in the indexed websites to discover new pages.

The Library Analogy

To give an analogy, let’s consider all the websites on the internet as books present in a library. A web crawler is a librarian whose job is to enter the book’s information in a catalog so that it is easy to find the books when required. To organize the books, the librarian will store the title, description, and category of the books in a catalog. A web crawler will also do the same thing. The goal of a web crawler is accomplished when it indexes all the pages on the internet. Something which is impossible to achieve!

Creating a Web Crawler

In this blog, I will be coding in python. There are a couple of web crawling and web scraping frameworks present in python. I will be using scrapy.

Installing scrapy:

$ pip install scrapy

1. Create a python application using scrapy

To create a scrapy project run the following command. Here the name of my application is my_first_web_crawler

$ scrapy startproject my_first_web_crawler

This will generate a scrapy boilerplate code and folder structure that should look like this:

2. Creating a Web Crawler

The folder named spiders contains the files which scrapy uses to crawl the websites. I will create a file named spider1.py in this directory and write the following lines of code:

You can find the above code here: https://github.com/gouravdhar/my-first-web-crawler/blob/main/test_spider.py

I have provided the URLs of my web pages which I will be crawling. These pages contain links to my blogs. You can provide any number of URLs since this is a list. My URLs which I will be crawling :

https://gourav-dhar.com 
https://gourav-dhar.com/profile

The above code crawls through the web pages provided in the links and downloads the pages.

To execute the code, run the following command :

scrapy crawl <your-spider-name>

My spider name is blogs (Defined in line 7 of the above code)

And tada!!! The data of the links have been downloaded in the project folder.

But that’s not enough, I want to actually download the data of the links this page points to. For this, I have to scrape all the links present on the main page and crawl through it. I will be using scrapy shell to write code to scrape the website information.

Note: Scrapy Shell is an interactive shell where you can try and debug scraping code very quickly

To start scrapy shell, just write :

$ scrapy shell 'https://gourav-dhar.com'

i.e. scrapy shell followed by the url

Once the shell is opened, type response to confirm that you get a 200 response.

The referring links are generally located in the a href class in the html. I need to scrape all the values present in this link, so I will write this to see the output

>>> response.css('a::attr(href)')

This is a list of a href classes on the page. To get a cleaned out result of only the links, we have to use the getall() function

>>> response.css('a::attr(href)').getall()

The result should look like this :

This will return me a list of all the values of href.

To download all the files in this list, I will modify my parse function in the spider code to get all the links using the above command. The modified parse function looks something like this:

The Github Link for the project can be found here : Github Link

Now again run the following command in the terminal:

$ scrapy crawl blogs

And I was able to download the content of all the links my homepage points to. This function can be extended to an infinite loop where you can crawl through all the websites on the internet.

Summarising Web Crawler

The web crawler is a powerful tool to store and index the contents of a web page. Its applications are enormous.

Note : You can also add filters as to who can crawl your site by mentioning the blacklisted/whitelisted domains in the robots.txt file of your site.

The search engines use web crawling to index and store the meta titles and descriptions in their database to quickly show the results of the queries the user enters. Examples of major search engines are google, bing, yahoo, duck duck go. The search engines also add their own recommendation system on top of these results which makes each search engine’s algorithms different.

Web crawling is also used for copyright/plagiarism violation detection. Web Analytics and Data Mining are also the major applications. It can also be used to detect web malware detection like phishing attacks. Suppose you own facebook.com, you can crawl the internet to check if anyone else is using a website that looks similar to facebook.com which can be used for phishing attacks.

The Github Link for the project can be found here : Github Link

This blog was originally published in the personal blog website of Gourav : https://gourav-dhar.com

Comments

How I created my first Web Crawler!

What is a web crawler?

The Library Analogy

Creating a Web Crawler

1. Create a python application using scrapy

2. Creating a Web Crawler

Summarising Web Crawler

Related Posts

Comments

Subscribe to my Youtube Channel @codewithgd

Master MongoDB Aggregation Pipeline: Essential Operators & Real-World Examples

Remove duplicates from sorted array

Encryption in Typescript

Exploring Event-Driven Architecture - Its Pros and Cons

What are Message Queues? How do Message Queues work?

Achieving High Availability in Microservices: Best Practices and Strategies

A Comprehensive Guide to Achieving Scalability in Microservices

What is the role of cloud computing in Microservices?

How message queues increase the reliability of the system

What is an Idempotent API and How to Use it?

Microservices vs Monolithic Architecture - Which one should you choose

Exploring Stateful vs. Stateless Architecture

How WebSockets are different from HTTP?

Understanding Resiliency in Applications & Services: What It Is and How to Build It

Advantages and Disadvantages of Microservices

How do Microservices Communicate With Each Other?

9 Proven Strategies to Improve API Performance

What are the benefits of message queues?

Eventual Consistency in Microservices and Large-Scale Distributed Systems

What happens when the load balancer fails?

How does database sharding work in SQL server? Explained with examples

Is the Load Balancer a Reverse Proxy? Reverse Proxy vs Load Balancer

Database Sharding vs Partitioning - What are the differences

Can the load balancer be a single point of failure?

What is the CAP theorem? Is the CAP theorem still valid?

What is DNS? How does DNS work? Types of DNS Servers- Backend Development Series

What is a CDN (Content Delivery Network)? How does CDN work and where is it used?

How Database Indexing Makes Your Query Faster in a Relational Database - The Complete Guide

What is an SSL/TLS Certificate and How do they Secure Your Website? - Backend Development Series

What is OSI Model? 7 layers of the OSI Model Explained (2023)

What are WebSockets? Everything you need to know about WebSockets! (2023)

Beginners guide to RabbitMQ - (Backend Development Series)

What is a Proxy? The difference between a proxy and a reverse proxy. Use cases of proxies.

Build an Alexa or Siri Equivalent Bot in Python Using OpenAI in minutes!

Python Has A Major Scalability Flaw! - System Design

How to create the perfect Pull Request?

Videos you might like

Let's Get Social

Subscribe to our NewsLetter

Let's Get
Social