Browse by Domains

Web scraping using Python

Introduction to web scraping

Web scraping is defined as the process of finding web documents and extracting usable information from it. Web scraping is different from web crawling. Web crawling is the method of iteratively fetching links starting from a basic seed URL. Web scraping is a subset of web crawling. We shall consider this in detail in this article.

Let us talk about crawling first. Crawling is the equivalent of a search engine. It visits the entire web and searches for particular information and returns it to the user. Web scraping, on the other hand, is targeted at specific websites in order to look for specific data related to the project or application at hand. For example, if we want to know the prices of food on Zomato’s website, we can come up with a web scraper to obtain the prices. Web scraping is considered an art because creating web pages is an art. Therefore, all web pages are of different layouts. Extracting information from the majority of websites with the help of highly efficient code is definitely an art, isn’t it? 

The information that a website chooses to make public is present inside a file called robots.txt. The crawler, which is part of the search engine, usually does not search for such information. On the other hand, web scrapers have no regard for the robots.txt file. The reason why web scrapers work well is because the source IP is usually from a computer, which addresses itself as a browser(consumer looking for something), and therefore is in disguise. 

Have you wondered why you click on the images, captchas, etc? It is partially because web scrapers also have the ability to fill forms, just like how they can extract information. Sometimes, they also enable javascript files to further enhance their disguise as a user. 

Let us look at some of the key differences between data scraping and data crawling. 

Data ScrapingData Crawling
Involves extracting data from various source including webRefers to downloading pages from the web
Can be done at any scaleMostly done at a large scale
Deduplication is not necessarily a partDeduplication is an essential part
Needs crawl agent and parserNeeds only crawl agent

Applications of web scraping

Collecting Data for Market Research

Web scraping tools help in analyzing the market and industry trends and can help in making decisions for the company. Google analytics is a business built +on providing detailed insights through the use of advanced proprietary technology. 

Extract Contact Information

Although might be illegal, many people automate the task of obtaining leads for their businesses by building scraping agents. There are various loopholes in the system and programmers are the first ones to capitalize on the same. 

Download Solutions from StackOverflow

The web scraping tool can also be used to search for queries in websites like StackOverflow, Wikipedia etc. Therefore, we can get more data in less amount of time. Using a summarisation algorithm on the scraped data may result in the best answering machine ever made. 

Looking for Jobs or Candidates

Imagine you got a list of jobs that contained all the keywords you are looking for. Machine Learning, computer vision, natural language processing, big data etc. Personalised job search from multiple websites is just a click away. 

Track Prices from Multiple Markets

As we considered the example of Zomato earlier, let us build on that. We want to spend the least. What do we do? Compare Zomato, Swiggy, Uber Eats (Check out this Uber Data Analysis Project to learn more), and many other food delivery platforms. Imagine you could track all the prices from multiple websites. Trivago compares prices from multiple websites. A business idea based on such a simple idea. That is the power of the internet.

Some popular tools for web scraping are:

Web Scraping with Python

There are many forms of data files which can be used as an input for machine learning or deep learning implementation using python. Following are some of the examples of these data source files:

  • MS excel file
  • Comma Separated value file [csv]
  • Text file
  • Website
  • Javascript object notation [JSON]
  • Images [jpeg, jpg, png, dcim etc.]
  • Hierarchical Data Format [hdf 5]

From the above list let us pick up the topic of extracting data from a website. Following are the procedures using which we can extract data from a hosted live website: 

  • API: Get the data from the API of the website if accessible. For example, Facebook has the Facebook Graph API which gives an update on possibilities on Facebook.
  • Web scraping: Get access to HTML of the webpage. Trace the class where the data is present. Transfer this information to the web scraping function to extract data present on the website. 

Python has many functions and methods which can perform web scraping. The following blog will explain these concepts using two separate examples and approaches of web scraping using different python utilities.

Web scraping example 1

 Let us understand web scraping using the following example:

Business need: A social media analytics company wants to find differences between a number of presidents depending on the speeches delivered by each. 

Problem:  The team has to get their hands on the speech transcripts available on one of the websites. [‘https://millercenter.org ]

Solution: Web scraping can be employed here to pinpoint all the URLs where the speech transcripts are present. 

Algorithm:

  • Load all the URLs into a list “URL_Links”
  • Define a function which will perform web scraping
  • Import requests utility to get the contents as text from the requested URL
  • Import beautiful soup utility to the library for pulling data out of HTML and XML files. It works with a parser to provide a simple way of navigating, searching, and modifying the parse tree.
  • From the downloaded data, mention the HTML class where the transcript is present
  • Print the URL to confirm that the data has been read and loaded
  • Pass text as the returning value for the function
  • Call the function by passing all the URLs one by one
  • Define a list who these transcripts belong to in the same order as we have mentioned the URLs in the URL_Links list
  • Import pickle function to store the downloaded transcripts to the local machine as a text file
  • Open the text file in python

Below screenshot displays the class where the data [transcript] is to be referenced in the above function to locate and web scrape the text. 

To get the following view, follow the below steps:

  • Visit link: https://millercenter.org/the-presidency/presidential-speeches/january-8-2020-statement-iran
  • Right-click on the page and choose the inspect option. An alternative method is to press ctrl+shift+c.
  • On the right-hand side, you will find the web elements page
  • Hover your mouse over the web [left side] to get the highlighter of the class structure on the elements window [right side]
  • Record the appropriate class to pass it to the above web scraping function

Once the data is loaded in python the team can use NLP and Deep Learning for performing word check, sentiment analysis, word cloud display, words spoken by each president per minute, and ideology and beliefs of each president.

Web scraping example 2

Let us understand web scraping using the following example: 

Algorithm:

  • Load all the URLs
  • Get the data from the URL using ‘urllib’ library
  • Download the complete data from the URL using ‘Article’ utility
  • Parse() function splits the given sequence of characters or values (text) into smaller parts based on rules
  • Display the data summary

Web scraped data summary

Once the data is loaded in python the team can use NLP and Deep learning to perform words check, sentiment analysis, word cloud display, words spoken by each president per minute, and ideology and beliefs of each president.

Conclusion

There are many methods or ways in which we can extract information from live hosted websites. Majorly when you do not have access or facility of ready web API available to get the logs or data. You can use web scraping to download the web data on to your local machine using python.

If you want to dive deep into the concepts and applications of Python in various domains, upskill with Great Learning’s programs in Machine Learning and Data Science.

About the author:

Krishnav Dave is a certified data scientist with 7+ years of industry experience. He specialises in implementing artificial intelligence onto development, testing, operations and service domains.

Avatar photo
Great Learning Team
Great Learning's Blog covers the latest developments and innovations in technology that can be leveraged to build rewarding careers. You'll find career guides, tech tutorials and industry news to keep yourself updated with the fast-changing world of tech and business.

Leave a Comment

Your email address will not be published. Required fields are marked *

Great Learning Free Online Courses
Scroll to Top