Introduction to web scraping
Web scraping is defined as the process of finding web documents and extracting usable information from it. Web scraping is different from web crawling. Web crawling is the method of iteratively fetching links starting from a basic seed URL. Web scraping is a subset of web crawling. We shall consider this in detail in this article.
Let us talk about crawling first. Crawling is the equivalent of a search engine. It visits the entire web and searches for particular information and returns it to the user. Web scraping, on the other hand, is targeted at specific websites in order to look for specific data related to the project or application at hand. For example, if we want to know the prices of food on Zomato’s website, we can come up with a web scraper to obtain the prices. Web scraping is considered an art because creating web pages is an art. Therefore, all web pages are of different layouts. Extracting information from the majority of websites with the help of highly efficient code is definitely an art, isn’t it?
The information that a website chooses to make public is present inside a file called robots.txt. The crawler, which is part of the search engine, usually does not search for such information. On the other hand, web scrapers have no regard for the robots.txt file. The reason why web scrapers work well is because the source IP is usually from a computer, which addresses itself as a browser(consumer looking for something), and therefore is in disguise.
Let us look at some of the key differences between data scraping and data crawling.
|Data Scraping||Data Crawling|
|Involves extracting data from various source including web||Refers to downloading pages from the web|
|Can be done at any scale||Mostly done at a large scale|
|Deduplication is not necessarily a part||Deduplication is an essential part|
|Needs crawl agent and parser||Needs only crawl agent|
Applications of web scraping
Collecting Data for Market Research
Web scraping tools help in analysing the market and industry trends, and can help in making decisions for the company. Google analytics is a business built +on providing detailed insights through the use of advanced proprietary technology.
Extract Contact Information
Although might be illegal, many people automate the task of obtaining leads for their businesses by building scraping agents. There are various loopholes in the system and programmers are the first ones to capitalise on the same.
Download Solutions from StackOverflow
The web scraping tool can also be used to search for queries in websites like StackOverflow, Wikipedia etc. Therefore, we can get more data in less amount of time. Using a summarisation algorithm on the scraped data may result in the best answering machine ever made.
Looking for Jobs or Candidates
Imagine you got a list of jobs that contained all the keywords you are looking for. Machine Learning, computer vision, natural language processing, big data etc. Personalised job search from multiple websites is just a click away.
Track Prices from Multiple Markets
As we considered the example of Zomato earlier, let us build on that. We want to spend the least. What do we do? Compare Zomato, Swiggy, Uber Eats, and many other food delivery platforms. Imagine you could track all the prices from multiple websites. Trivago compares prices from multiple websites. A business idea based on such a simple idea. That is the power of the internet.
Some popular tools for web scraping are:
Web Scraping with Python
- MS excel file
- Comma Separated value file [csv]
- Text file
- Images [jpeg, jpg, png, dcim etc.]
- Hierarchical Data Format [hdf 5]
From the above list let us pick up the topic of extracting data from a website. Following are the procedures using which we can extract data from a hosted live website:
- API: Get the data from the API of the website if accessible. For example, Facebook has the Facebook Graph API which gives an update on possibilities on Facebook.
- Web scraping: Get access to HTML of the webpage. Trace the class where the data is present. Transfer this information to the web scraping function to extract data present on the website.
Python has many functions and methods which can perform web scraping. The following blog will explain these concepts using two separate examples and approaches of web scraping using different python utilities.
Web scraping example 1
Let us understand web scraping using the following example:
Business need: A social media analytics company wants to find differences between a number of presidents depending on the speeches delivered by each.
Problem: The team has to get their hands on the speech transcripts available on one of the websites. [‘https://millercenter.org ]
Solution: Web scraping can be employed here to pinpoint all the URLs where the speech transcripts are present.
- Load all the URLs into a list “URL_Links”
- Define a function which will perform web scraping
- Import requests utility to get the contents as text from the requested URL
- Import beautiful soup utility to the library for pulling data out of HTML and XML files. It works with a parser to provide a simple way of navigating, searching, and modifying the parse tree.
- From the downloaded data, mention the HTML class where the transcript is present
- Print the URL to confirm that the data has been read and loaded
- Pass text as the returning value for the function
- Call the function by passing all the URLs one by one
- Define a list who these transcripts belong to in the same order as we have mentioned the URLs in the URL_Links list
- Import pickle function to store the downloaded transcripts to the local machine as a text file
- Open the text file in python
Below screenshot displays the class where the data [transcript] is to be referenced in the above function to locate and web scrape the text.
To get the following view, follow the below steps:
- Visit link: https://millercenter.org/the-presidency/presidential-speeches/january-8-2020-statement-iran
- Right-click on the page and choose the inspect option. An alternative method is to press ctrl+shift+c.
- On the right-hand side, you will find the web elements page
- Hover your mouse over the web [left side] to get the highlighter of the class structure on the elements window [right side]
- Record the appropriate class to pass it to the above web scraping function
Once the data is loaded in python the team can use NLP and Deep Learning for performing word check, sentiment analysis, word cloud display, words spoken by each president per minute, and ideology and beliefs of each president.
Web scraping example 2
Let us understand web scraping using the following example:
- Load all the URLs
- Get the data from the URL using ‘urllib’ library
- Download the complete data from the URL using ‘Article’ utility
- Parse() function splits the given sequence of characters or values (text) into smaller parts based on rules
- Display the data summary
Web scraped data summary
Here’s a screenshot of the source web from where the data needs to be downloaded
Once the data is loaded in python the team can use NLP and Deep learning to perform words check, sentiment analysis, word cloud display, words spoken by each president per minute, and ideology and beliefs of each president.
There are many methods or ways in which we can extract information from live hosted websites. Majorly when you do not have access or facility of ready web API available to get the logs or data. You can use web scraping to download the web data on to your local machine using python.
About the author:
Krishnav Dave is a certified data scientist with 7+ years of industry experience. He specialises in implementing artificial intelligence onto development, testing, operations and service domains.