scrapy tutorial
  1. Web Scraper
  2. Web Crawler
  3. Scrapy
  4. Scrapy Installation
  5. Scrapy Packages
  6. Scrapy File Structure
  7. Scrapy Command Line Tool
  8. Global Commands
  9. Project-only Commands
  10. Spiders
  11. Selectors
  12. Items
  13. Working with Item Objects
  14. Item Loaders
  15. Scrapy Shell
  16. Item Pipeline
  17. Feed Exporters
  18. Requests and Responses
  19. Link Extractors
  20. Settings
  21. Exceptions

Web Scraper

A web scraper is a tool that is used to extract the data from a website.  

It  involves the following process:

  1. Figure out the target website
  2. Get the URL of the pages from which the data needs to be extracted.
  3. Obtain the HTML/CSS/JS of those pages.
  4. Find the locators such as XPath or CSS selectors or regex of those data which needs to be extracted.
  5. Save the data in a structured format such as JSON or CSV file.

Web Crawler

A web crawler is used to collect the URL of the websites and their corresponding child websites.  The crawler will collect all the links associated with the website. It then records (or copies) them and stores them in the servers as a search index.  This helps the server to find the websites easily.  Servers then use this index and rank them accordingly. The pages are then displayed to the user based on ranking given by the search engine.

The web crawler can also be called a web spider, spider bot, crawler or web bot.

Also Read: Web Scraping Tutorial | What is Web Scraping?

Scrapy

Scrapy does the work of a web crawler and the work of a web scraper. Hence, Scrapy is quite a handful in crawling a site, then extracting it and storing it in a structured format. Scrapy also works with API to extract data as well.

Scrapy provides:

  1. the methods like Xpath and regex used for selecting and extracting data from locators like CSS selectors.
  2. Scrapy shell is an interactive shell console that we can use to execute spider commands without running the entire code. This facility can debug or write the Scrapy code or just check it before the final spider file execution.
  3. Facility to store the data in a structured data in formats such as :
    • JSON
    • JSON Lines
    • CSV
    • XML
    • Pickle
    • Marshal
  4. Facility to store the  extracted data in:
    • Local filesystems
    • FTP
    • S3
    • Google Cloud Storage
    • Standard output
  1. Facility to use API or signals (which are functions that are written in case of an event)
  2. Facility to handle :
    • HTTP features
    • User-agent spoofing
    • Robots.txt
    • Crawl depth restriction
  1. Telnet console – Python console that could run inside Scrapy to introspect.
  2. And more

Scrapy Installation

Scrapy can be installed by: 

Using Anaconda / Miniconda.

Type the following command in the Conda shell:

conda install -c conda-forage scrapy 

Alternatively, you could do the following.

pip install Scrapy

Scrapy Packages

  1. lxml  – XML and HTML parser
  2. parsel – HTML/XML library that lies on top of lxml
  3. w3lib – Deals with webpages
  4. twisted – asynchronous networking framework
  5. cryptography and pyOpenSSL –  for network-level security needs.

Scrapy File Structure

A scrapy project will have two parts.

  1. Configuration file  – It is the project root directory. It has the settings for the project. The location of the cfg can be seen in the following place:
  • System wide     –       /etc/scrapyg.cfg         or      c:\scrapy\scrapy.cfg
  • Global –  ~/.config/scrapy.cfg($XDG_CONFIG_HOME) and ~/.scrapy.cfg($HOME)
  • Scrapy project root – scrapy.cfg

Settings from these files have the following precedence :

  • Project-wide settings
  • System-wide defaults
  • User-defined values

Environment variables through which Scrapy can be controlled are :

  • SCRAPY_SETTINGS_MODULE 
  • SCRAPY_PROJECT
  • SCRAPY_PYTHON_SHELL
  1. A project folder – It contains files as  follows :
  • __init__.py
  • items.py
  • middleware.py
  • pipelines.py
  • settings.py
  • spider – folder.  It is the place where the spider that we create gets stored. 

A project’s configuration file can be shared between multiple projects having its own settings module.

SCRAPY COMMAND LINE TOOL

The Scrapy command line provides many commands.  Those commands can be classified into two groups.  

  1. Global commands
  2. Project – only commands

To see all the commands available type the following in the shell:

scrapy -h

Syntax to see the help for a particular command is:

scrapy <command> [options] [args]

Global Commands

These are those commands that can work without an active scrapy project.

  • startproject
scrapy startproject <project_name> [project_dir]

Usage: It is used to create a project with the specified project name under the specified project directory. If the directory is not mentioned, then the project directory will be the same as the project name.

Example:

scrapy startproject tutorial

This will create a directory with the name “tutorial” and the project name as “tutorial” and the configuration file.  

  • genspider
scrapy genspider [-t template] <name> <domain>

Usage: This is used to create a new spider in the current folder.  It is always best practice to create the spider after traversing inside the project’s spider folder. Spider’s name is given by the <name> parameter and <domain>  generates “start_urls” and “allowed_domains”. 

Example:

scrapy genspider tuts https://www.imdb.com/chart/top/

This will create a directory with the spider with the name tuts.py and the allowed domain is “imdb”. Use this command post traversing into the spider folder. 

  • settings
scrapy settings [options]

Usage: It shows the scrapy setting outside the project and the project setting inside the project.

The following options can be used with the settings:

–help                                 show this help message and exit

–get=SETTING                  print raw setting value

–getbool = SETTING        print setting value, interpreted as Boolean

–getint = SETTING            print setting value, interpreted as an integer

–getfloat = SETTING        print setting value,interpreted as an float

–getlist = SETTING           print setting value,interpreted as a list

–logfile = FILE                   logfile,if omitted stderr will be used

–loglevel = LEVEL             log level

–nolog                               disable logging completely

–profile=FILE                    write python cProfile to file

–pidfile = FILE                   write process id to file

–set NAME=VALUE         set/override setting

–pdb                                  enable pdb on failure

Example:

scrapy crawl tuts -s LOG_FILE = scrapy.log
  • runspider
scrapy runspider <spider.py>

Usage: To run spider without having to creating project

Example:

scrapy runspider tuts.py
  • shell
scrapy shell [url]

Usage: Shell will start for the given url.

Options:

–spider = SPIDER      (The mentioned spider will be used and auto-detection gets bypassed)

–c code                    (Evaluates, prints the result and exited)

–no-redirect              (Does not follow HTTP 3xx redirects)

Example:

scrapy shell https://www.imdb.com/chart/top/

Scrapy will start the shell on https://www.imdb.com/chart/top/ page.

  • fetch
scrapy fetch <url>

Usage:     

Scrapy Downloader will download the page and give the output.

Options:

–spider = SPIDER      (The mentioned spider will be used and auto-detection gets bypassed)

–headers                    (Header’s of the HTTP request will be shown in the output)

–no-redirect              (Does not follow HTTP 3xx redirects)

Example:

scrapy fetch https://www.imdb.com/chart/top/

Scrapy will download the https://www.imdb.com/chart/top/ page.

  • View
scrapy view <url>

Usage:     

Scrapy will open the mentioned URL in the default browser.  This will help to view the page from the spider’s perspective

Options:

–spider = SPIDER      (The mentioned spider will be used, and auto-detection gets bypassed)

–no-redirect              (Does not follow HTTP 3xx redirects)

Example:

scrapy view https://www.imdb.com/chart/top/

Scrapy will open https://www.imdb.com/chart/top/ page in the default browser.

  • Version

Syntax: scrapy version -v

Usage:     

Prints the version of the scrapy.

Project-only Commands

These are those commands that can work inside an active scrapy project.

  1. crawl

Syntax:

scrapy crawl <spider>

Usage:     

This will start the crawling.

Example:

scrapy crawl tuts

Scrapy will crawl the domains mentioned in the spider.

  1. check

Syntax: 

scrapy check [-I] <spider>

Usage:     

Checks what’s returned by the crawler

Example:

scrap check tuts

Scrapy will check the crawled output of the crawler and returns the result as “OK”.

  1. list

Syntax: 

scrapy list

Usage:     

All the spider’s names are returned that are present in the project.

Example:

scrapy list

Scrapy will return all the spiders that are there in the project

  1. edit

Syntax: 

scrapy edit <spider>

Usage:     

This command is used to edit the spider.  The editor mentioned in the editor environment variable will open up. If it’s not set, then IDLE (windows) will open up, or vi (UNIX) will open up. The developer is not restricted to use this editor but can use any editor.

Example:

scrapy editor tuts

Scrapy will open tuts in the editor.

  1. parse

Syntax: 

scrapy parse <url> [options]

Usage:     

Scrapy will parse the URL mentioned with the spider. Method if mentions in the  –callback will be used; if not, parse will be used.

Options:

–spider = SPIDER      (The mentioned spider will be used, and auto-detection gets bypassed)

–a Name = Value      (To set the spider option)

–callback                    (Spider method for parsing)

–cb_kwargs                  (Additional methods for callback parsing)

–meta                         (Spider meta for the callback method)

–pipelines                  (To process via pipelines)

–rules                         (Rules for parsing)

–noitems                   (Hides scraped items)

–nocolour                  (Removes colours)

–nolinks                     (Hides links)

–depth                        (The level to which the requests needs to done recursively)

–verbose                     (Displays information depth level)

–output                       (Output is stored in a file)

Example:

scrapy parse https://www.imdb.com/chart/top/

Scrapy will parse the https://www.imdb.com/chart/top/ page.

  1. Bench

Syntax: scrapy bench

Usage:

To run a benchmark test.

To add custom commands. 

COMMANDS_MODULE = ‘command_name’

scrapy.commands can be used in setup.py for adding up the commands externally.

SPIDERS

Spider folder is the place which contains the classes that are needed for scraping data and for crawling the site. Customisation can be done as per the requirement.

SPIDER SCRAPING CYCLE

There are different types of Spiders available for various purposes.

Scrapy.Spider

Class:  scrapy.spiders.Spider

It is the simplest spider.  It has the default method  start_requests().  This will send requests from start_urls() calls the parse for each resulting response.

name –  Name of the spider is given in this.  It should be unique, and more than one instance can be instantiated.  It’s the best practice to keep the spider’s name the same as the name of the website that’s crawled.

allowed_domains –  Only the domains that are mentioned in this list are allowed to crawl.  To crawl the domain that is not mentioned in the list “OffsieMiddelware” should be enabled.

start_urls – A list of URLs that needs to be crawled gets mentioned over here

custom_settings  – Settings that need to be overridden are given here.  It should be defined as a class as the settings are updated first before crawling.

crawler – from_crawler()  method sets this attribute.  It links the crawler object with the spider object

settings – settings for the spider/project gets mentioned over here

logger – logger with the same name as the Spider’s name will have all the log of the spider.

from_crawler(crawler,*args,**kwargs) – Sets the crawler and the settings attribute. It creates spiders.

A. crawler  – object that bounds spider and the crawler

B. args –  arguments that are passed to the __int__()

C. kwargs – kwargs that are passed to  __int__()

start_requests() – Used to scrape the website.  It’s called only once and start_url() will generate Request() for each url.

parse(response) – Callback method is used to get the response returns the scraped data.

log(message,level,component) – Sends the log throught the “logger”

closed(reason) – It will close the spider and signal.connect() gets triggered for spider_closed signal.

Spider Arguments

Arguments can be given to spiders. The arguments are passed through the crawl command using  -a option.

The __init__() will take these arguments and apply them as attributes.

Example:

scrapy crawl tuts –a category = electronics

__init__() should have category as an argument for this code to work 

Generic Spiders

These spiders can be used for rule-based crawling, crawling Sitemaps, or parsing XML/CSV feed.

CrawlSpider

Class – scrapy.spider.CrawlSpider

This is the spider that crawls based on rules that can be custom written.

Attributes: 

  1. rules  – List of Rule object that defines the crawling behaviour.
  2. parse_start_url(response, **kwargs) –   This is called whenever a response is created for the URL requests. Expects an item object or an item containing iterable object.

Crawling Rules:

class scrapy.spiders.Rule(link_extractor=None, callback=None, cb_kwargs=None, follow=None, 

process_links=None, process_request=None, errback=None)

link_extractor – rule for how the link is to be extracted is mentioned here. It then creates a Request object for each generated link

callback – This is called when each link is extracted. Receives a response as it’s the first argument and must return the iterable object.

cb_kwargs – arguments for callback function

follow – If callback is None, then follow is set to True otherwise, it’s False.  It is a Boolean.

process_links – Called for each link extracted from each response.

process_request – called for each request.

errback – Exception is raised if there is an error.

XMLFeedSpider

Class – scrapy.spider.XMLFeedSpider

It is used to parse XML feeds. This will parse iternodes, XML, HTML for performance reasons through a particular node name.

The following class attributes must be defines to set the iterator and tag name:

  1. iterator    –  Tells what iterator to be used, i.e. iternodes or HTML or XML. Default is iternodes.
  2. itertag      –   Name of the string that needs to be iterated.
  3. namespaces – (prefix,url) tuples that are mentioned in the document will be gets processed in this spider.

The following overridable methods are available as well :

  1. adapt_response(respURLe) – It can change the response body before parsing . It can receive and send responses.
  2. parse_node(response,selector) – This must me overridden if the a matching node and itertag is there for the spider to work. It should return an iterable object or a Request.
  3. process_results(response, results) – Does last-minute processing if required.

CSVFeedSpider

Class – scrapy.spiders.CSVFeedSpider

This spider iterate over rows. parse_row() will be called for each iteration. 

delimiter:  it’s the separator character for each string. Default is “,”

quotechar: It defines the enclosure character. Default is ‘ “ ‘.

headers: Column names in CSV file.

parse_row(response,row) : It helps to override adapt_response and process_results for post and preprocessing. It obtains dict with a key for each header of the CSV file.

SitemapSpider 

Class – scrapy.spiders.SitemapSpider

It is used for crawling the site.  It discovers sitemap urls from robot.txt

  1. sitemap_urls – This will contain the list of urls.  These urls usually point to the sitemap or robot.txt which needs to be crawled.
  2. sitemap_rules-    It’s value is defined by a tuple (regex,callback).  Callbacks should match with the url extracted from regex.
  3. sitemap_follow – It containts regexes.
  4. sitemap_alternate_link – Alternate links gets specified here. This is disabled by default.
  5. sitemap_filter(entries)  –  Can be used when there is a need to override sitemap attributes.

Selectors

Scrapy uses CSS or Xpath to select HTML elements. 

Querying can be done using response.css() or response.XPath().

Example:

response.css(“div::text”).get()

Selector() can also be used if needed directly.

.get()  or .getall() is used along with the response to extract the data.   

.get()  – will give a single result. None if nothing gets matched.

.getall()  – will give a list of matches.

CSS pseudo-elements can be used to select text or attribute-nodes.

.get()  has an alias   .extract-first().

.get() returns NONE if no match is found.  Default value can be given to replace NONE with some other value with the help of .get(default=’value’)

.attrib[] can also be used to query via attributes of a tag for CSS selectors.

Example:

response.css(‘a’).attrib[‘href’]

Non-standard pseudo-elements that are essential for web scraping are:

  1. ::text   – selects the text nodes
  2. ::attr(name) – selects attributes values.

Adding a   *  infront of  ::text will help to select all the elements of the node.

*::text

foo::text  can be used to check if there is no result incase the element is present but does not have any value .

Nesting Selectors  

Selectors having the same type on which selection can be done again  is nesting of selectors.

Example:

val = response.css(“div::text”)

val.getall()

Selecting element attributes   

Attributes of an element can obtained using Xpath or CSS selectors.

XPATH –  Advantage with Xpath is that ,  @attributes can be used as a filter and it’s standard feature as well.

Example : response.xpath(“//a/@href”).get()

CSS Selector  :    ::attr(…)  can be used to get attribute vales as well.  

Example :  response.css(‘img::attrb(src)’).get()

Or   .attrib() property can also be used

Example :   response.css.(‘img’).attrib[‘src’]

Using Selectors with regular expressions

.re() can be used to extract data along with Xpath or with CSS.

Example : response.xpath(‘//a[contains(@href,”image”)]/text()’).re(r’Name:\s*(.*)’)

.re_first()  can also be used to extract the first element.

Some equivalents

SelectionEquaivalent Value Used these days
SelectorList.extract_first()SelectorList.get()
SelectorList.extract()SelectorList.getall()
Selector.extract()Selector.get()

Selector.getall()   – will return a list.

.get()  returns single output

.getall() – return a list

.extract() will return either a single output or a list as the output. To get single result either extract() or extract_first() can be called.

Working with relative XPATHS

Absolute Xpath –  Absolute Xpath gets created whenever an Xpath starts with ‘/’ and it’s nested.

A proper way to make it relative is use “.” Infront of ‘/’.

Example:

divs = response.xpath(“//div”)

for p in divs.xpath(‘.//p”):

print(p.get())

or  

for p in divs.xpath(‘p):

print(p.get())

For mode details on XPATH can be obtained from https://www.w3.org/TR/xpath/all/#location-paths

Querying the elements by Class Use CSS

If done with Xpath then the resulting output will end up having so  much of complications.

If  ‘@class = “someclass”’ is used the output might have missing elements.

If  ‘contains(@class,’someclass’) is used then more then needed elements might come up in the result.

As Scrapy allows chaining of selectors,  CSS selector can be chained to select the class element and then Xpath can be used along with it to select the required elements instead.

Example:

response.css(“.shout”).xpath(‘./div’).getall()

“.” Should be appended before ‘/’ in the xpath that follows the CSS selector.

Difference between //node[1] and (//node)[1]

(//node)[1]  – selects all the nodes first then the first element from that list will get selected.

//node[1]  – First node of all the parent node will get selected.

Text nodes under condition

.//text()  when passed to contains() or starts-with() will result in a collection of text elements. The resulting node set will not give any result even if it gets converted to a string . And hence it is better to use “.”  alone instead of “.//text()”.

Variables in Xpath expressions

$somevariable is used as a reference variables. It’s value will be passed to the query after substitution.

Example:

response.xpath(‘//div[count(a)=$cnt]/@id’, cnt=5).get()

More examples on https://parsel.readthedocs.io/en/latest/usage.html#variables-in-xpath-expressions

Removing namespaces

selector.namespaces()  method can be used so that all the namespaces of that html file can be used. 

Example:

response.selector.namespaces()

Namespaces are not removed by default by scrapy because namespaces of the page are needed at times and not need at times. So this method is called only when needed.

Using EXSLT extensions

PrefixNamespaceUsage
rehttp://exslt.org/regular-expressionsRegular expression
sethttp://exslt.org/setsSet manipulation

Regular Expressions

test() is used when starts-with() and contains() are not helpful

Set operations

These are used when there is a need to excluding data before extraction.

Example

scope.xpath(‘’’set:difference(./descendant::*/@itemprop)’’’)

Other Xpath extensions

has-class  returns false if the nodes does not match with the given HTML classes and True for nodes that are matching.

response.xpath(‘//p[has-class(“foo”)]’)

Built-in Selectors reference

  1. Selector objects

Class – scrapy.selector.Selector(*args,**kwargs)

response – It is a Htmlresponse or a XMLresponse.

text – It is a Unicode string or a utf-8 encoded text cases

type – type can be “html” for HtmlResponse,”xml” for XmlResponse or None 

xpath(query,namespaces=None,**kwargs) – SelectorList will be returned with flattened elements, where query is the Xpath query. Namespaces are optional and is nothing but dictionaries that are registered with register_namespace(prefix,uri) 

css(query) – SelectorList is returned post application of the css where query containing the css selector is given as the argument. 

get() – Matches nodes will be returned.

attrib – Element’s attributes will be returned.

re(regex,replace_entities = True) – Returns a list of Unicode post application of regex. Regex will contain the regex queries and replace_entities will replace if it’s true. 

re_first(regex,default=None,entities=True) – Default value will be returned if there is not match, first Unicode will be returned if there is a match

register_namespace(prefix,uri) – To register the namespaces

remove_namespaces() – Removes all namespaces

__bool__() – Return True if the content is real

getall() – Returns a list of matched content 

  1. SelectorList objects –

 xpath(query,namespaces=None,**kwargs) – SelectorList will be returned with flattened elements, where query is the Xpath query. Namespaces are optional and is nothing but dictionaries that are registered with register_namespace(prefix,uri) 

css(query) – SelectorList is returned post application of the css where query containing the css selector is given as the argument. 

get() – returns the result for the first element in the list

getall() – get() is called for each element in the list. 

attrib – Element’s attributes will be returned.

re_first(regex,default=None,entities=True) – re() is called for each element in the list

attrib – first element attribute is returned.

ITEMS

A dict (key-value) pair is usually returned.  Different types of items are there.

Item Types

  1. Dictionaries – dict is convenient and familiar.
  2. Item Objects 

Class – scrapy.item.Item([arg])

Item behaves the same way as the standard dict API and allows to define the field names such as :  

  • KeyError – Raised when undefined field names are called.
  • Item exporters – Exports all fields

Item allows metadata definition. trackref  can track item object inorder to find memory leak.

Additional Item API members that can be used are  copy() , deepcopy() and fields

  1. Dataclass objects  

Item classes field names can be defined with dataclass().  Default value and type for each dataclass can be defined.  dataclasses.field() can be used to define custom field.

  1. attr.s objects

Item classes with field names can be defined with attr.s().  Each field type and definition and custom field metadata can also be defined.

Working with Item Objects

Declaring Item subclasses

Simple class definition and Field objects can be used to declare Item subclasses.

Example:

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

Declaring Fields

Field objects are used to specify any kind of metadata for each field. Different components can use the Field object. 

Class – scrapy.item.Field

Example

Creating items

product = Product(name='Desktop PC', price=1000)

Getting field values

product['price']

Setting field values

product['lala'] = 'test'

Accessing all populated values

product.keys()
product.items()

Copying items

product2 = product.copy()
product2 = product.deepcopy()

Extending Item Subclass

Items can also be extended by defining a subclass of the original item.

Metadata can be extended with previous metadata.

Supporting all Item Types

Class – itemadapter.ItemAdapter(item:Any)

Common interface to extract and set data

Itemadapter.is__item(obj:Any) -> bool

If the item belongs to the supported types then True will be returned.

ITEM LOADERS

This is used to populate the items.

Using Item Loaders to populate items

Item class creates item loader __init__ which is how item loader gets instantiated. Selectors load the value into the item loader. Item loader then joins using processing functions.

add_xpath(), add_css() and add_value() are all used to collect data into an item loader. ItemLoader.load_item() populates the data extracted from add_xpath(),add_css() and add_value().

Working with data class items

Passing of values can be controlled using field() when used with item loaders which will load the item automatically with the methods add_xpath(),add_css() and add_value().

Input and output processors

Each item loader has 1 input processor and 1 output processor. 

The input processor loads the data in  the item loader through add_xpath(),add_css() and add_value().

ItemLoader.load_item() then populates the data in the item loader.

The output processor then assigns the value to the items.

Declaring Item Loaders

Input processors are declared using  _in suffix.

Output processors are declared using _out suffix.

Also can be declared using  ItemLoader.default_input_processor and ItemLoader.default_output_processor.

Declaring Input and Output processors

Input/Output processors can also be declared using Item Field metadata.

Precedence order:

  1. Item loader field specific attributes
  2. Field metadata
  3. Item Loader defaults

Item Loader Context

Item Loader Context can modify the behavior of the input/output processors.  It can be passed anytime and it is of dict type.

loader_context passes the context that is active and parse_length uses it.

To modify 

  1. Modify the Item Loader context attribute
  2. On loader instantiation
  3. On item loader declaration

Item Loader Object

If no item then default_item_class gets instantiated.

item – The objects that’s parsed by the item loadercontext – current active context
default_item_class – instantiates when not given in  __init__()default_input_processor – Default input processor for which there is none
default_output_processor – Default output processor for which there is nonedefault_selector_class – Ignored if __init__() is given, if not then selector of item loader will get constructed
selector – This object extracts the data.add_css(field_name,css,*processors,**kw) – css selector given in this extracts list of Unicode strings
add_value(field_name,xpath,*processors,**kw) – Processors and kw passes the value to get_value() , then to field input processors and then appended to the data collected.add_xpath(field_name,*processors,**kw) – xpath will be used to extract list of strings
get_collected_values(field_name) – Collected values will be returnedget_css(css,*processors,**kw) – Css selector will be used to extract list of Unicode strings
get_output_value(value,*processors,**kw) –   collected values from parsed through output processors are returned.get_value(value,*processors,**kw) – given value is processed by the processors.
get_xpath(xpath,*processors,**kw) – xpath will extract list of Unicode strings load_item() – Used to populate the item 
nested_class(css,**context) – css selector creates nested loadernested_xpath(xpath,**context) – xpath selector creates nested loader
replace_css(field_name,css,*processors,**kw) – replaces collected data replace_value(field_name,value,*processors,**kw) – replaces collected data
replace_value(field_name,value,*processors,**kw) – replaces collected datareplace_xpath(field_name,value,*preprocess,**kw) – replaces collected data

Nested Loaders

Nested Loaders can be used when the subsection values need to be parsed.

Reusing and Extending Item Loaders

Scrapy provides the support for python class inheritance and hence item loaders can be reused and extended.

SCRAPY SHELL

Scrapy shell can be used for testing and evaluating spiders before running the entire spider. Individual queries can be checked in this.

Configuring the shell

Scrapy works wonderful with IPython, and can support bpython. IPython is recommended as it provides auto-completion and colorized output.

The setting can be changed by

[settings]

shell = bpython

Launch the shell

To launch the shell

scrapy shell <url>

Using the shell

It just a regular python shell with additional shortcuts

Available shortcuts

  1. shelp()   – print list of available objects and lits
  2. fetch(url,[.redirect=True]) – fetch response from URL
  3. fetch(request) – fetch response from given request
  4. view(response) – open the given response in the local browse

Available scrapy objects

  1. crawler – current crawler object
  2. spider – that which can handle URL
  3. request – Request object of last fetched page
  4. response – response object containing last fetched item
  5. settings – current scrapy settings

Invoking shell from spiders to inspect responses

To see the response use:

scrapy.shell.inspect_response

ITEM PIPELINE

Post scraping item pipeline processes them. 

Item pipelines:

  1. cleanses HTML data
  2. scraped data validation
  3. duplicates validation
  4. storing of scraped data

Writing item pipeline

Item pipeline components are python classes.

  1. process_item(self,item,spider)  – All the component calls this method and returns an item object, Deferred object or raise a DropItem. Item is scraped item , spider – the spider that scraped the item
  2. open_spider(self,spider) – to open the spider. 
  3. Close_spider(self,spider) – to close the spider.
  4. from_crawler(cls,crawler) – It creates a crawler and returns a new instance of pipeline.

Example application:

  1. price validation and dropping items with no prices
  2. write items to json file
  3. write items to mongodb
  4. take a screenshot of item
  5. duplicates filter

To activate a pipeline, it has to be added to the ITEM_PIPELINES settings.

 FEED EXPORTS

Scrapy supports feed exports that is to export the scraped data into storage in multiple formarts.

Serialization formats

Item exporters are used for this process.  The supported formats are :

Serialization formatFeed setting format keyExporter
JSONjsonJsonItemExporter
JSON linesjsonlinesJsonItemExporter
CSVcsvCsvItemExporter
XMLxmlXmlItemExporter
PicklepickleMarshalItemExporter
MarshalmarshalMarshalItemExporter

Storages

Supported backend storage:

  1. Local filesystem
  2. FTP
  3. S3
  4. Google cloud storage
  5. Standard output

Storage URI parameters

%(time)s – timestamp replaces this parameter

%(name)s – spider name replaces this parameter

Storage backends

Storage backendURI schemeExample URIRequired external library
Local filesystemfilefile://tmp/export.csvNone
FTPftpftp://user:pass@ftp.example.com/path/to/export.csvNoneTwo connections : active or passiveDefault connection mode : PassiveFor active connection :FEED_STORAGE_FTP_ACTIVE = True
Amazon S3s3s3://mybucket/path/to/export.csvbotocore >= 1.4.87AWS credentials can be passed through :AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY
Custom ACL;FEED_STORAGE_S3_ACL
Google Cloud Storagegsgs://mybucket/path/to/export.csvgoogle-cloud-storageProject setting and Access Control Light setting:FEED_STORAGE_GCS_ACLGCS_PROJECT_ID
Standard Outputstdoutstdout:none

Delayed File Directory

Storage backends that uses delayed file directory are :

  1. FTP
  2. S3
  3. Google Cloud Storage

File content will be uploaded to the feed URI only if all the contents are collected entirely.

To start the item delivery early use FEED_EXPORT_BATCH_ITEM_COUNT

Settings

Settings for feed exporters

  1. FEEDS (mandatory)
  2. FEED_EXPORT_ENCODING
  3. FEED_STORE_EMPTY
  4. FEED_EXPORT_FIELDS
  5. FEED_EXPORT_INDENT
  6. FEED_STORAGES
  7. FEED_STORAGE_FTP_ACTIVE
  8. FEED_STORAGE_S3_ACL
  9. FEED_EXPORTERS
  10. FEED_EXPORT_BATCH_ITEM_COUNT

Feeds

Default : {}

Feed is a dictionary in which all the feed URI are the keys and values are nested parameters.

Accepted KeysFallback Value
formatNIL
batch_item_countFEED_EXPORT_BATCH_ITEM_COUNT
encodingFEED_EXPORT_ENCODING
fieldsFEED_EXPORT_FIELDS
IndentFEED_EXPORT_INDENT
Item_exports_kwargsdict with keyword arguments to corresponding item exporter class
overwriteIf already exists then True or else False
Local filesystemFalse
FTPTrue
S3True
Standard OutputFalse
store_emptyFEED_STORE_EMPTY
uri_paramsFEED_URI_PARAMS

Feed Export Encoding

Default: None

Encoding: If unset or None is setting then UTF-8 will be set except for JSON.   Utf-8 can be set for JSON too if needed.

FEED_EXPORT_FIELDS

Default: None

To define fields use FEED_EXPORT_FIELDS

When FEED_EXPORT_FIELDS are empty scrapy used fields from item objects

FEED_EXPORT_INDENT

Default:0

If this is non-negative integer – array elements and objects are given

If this is 0 or negative, it ll be in new line

None will select compact representation

FEED_STORE_EMPTY

Default : False

FEED_STORAGES

Default : {}

FEED_STORAGE_FTP_ACTIVE

Default:False

To use active or passive connection when exporting FTP

FEED_STORAGE_S3_ACL

Default:False

Default: ‘ ’

String have custom ACL

FEED_STORAGES_BASE

Dict containing built-in feed storage.

FEED_EXPORTERS

Default: {}

Dict containing additional exporters

FEED_EXPORTERS_BASE

Dict having build-in feed exporters

FEED_EXPORT_BATCH_ITEM_COUNT

Default: 0

Number greater than 0 then scrapy generates multiple file storing to a particular number

FEED_URI_PARAMS

Default: None

String with import path of function.

REQUESTS AND RESPONSES

Requests and responses are made for crawling the site.

Request Objects

PARAMETERS

  1. url – url of the request
  2. callback – the function that gets called as a response for a request
  3. method – Defaut : get.  Method for the request
  4. meta – dictionary values for Request.meta
  5. body – If not available then bytes is stored.
  6. headers – headers of the request
  7. cookies – request cookies
  8. encoding – encoding of the request
  9. priority – priority of the request
  10. don’t_filter – request should not be filtered
  11. errback – functions gets called if there is an exception
  12. flags – flags sent for logging
  13. cb_kwargs – dict passed as keyword arguments

Passing additional data to callback functions

Request.cb_kwargs can be used to pass arguments to the callback functions so that these then can be passed to the second callback later 

Using errbacks to catch exceptions in request processing.

Failure will be received as the first parameter for the errbacks, this then can be used to track errors.
Additional data can be accessed by Failure.request.cb_kwargs

Request.meta special keys

Special keys ;

  • dont_redirect
  • dont_retry
  • handle_httpstatus_list
  • handle_httpstatus_all
  • dont_merge_cookies
  • cookiejar
  • dont_cache
  • redirect_reasons
  • redirect_urls
  • bindaddress
  • dont_obey_robotstxt
  • download_timeout
  • download_maxsize
  • download_latency
  • download_fail_on_dataloss
  • proxy
  • ftp_user  
  • ftp_password 
  • referrer_policy
  • max_retry_times

bindaddress – Outgoing IP address

download_timeout – time for the downloader to wait

download_latency – time to fetch response

download_fail_on_dataloss – to fail or not to fail on broken response

max_retry_times – to set retry times per request

Stopping the download of  response

StopDownload  exception will be raised to stop the download

Request subclasses

List of request subclasses

  • FormRequest Objects

Parameters:

  • formdata

classmethodfrom_response(response[, formname=Noneformid=Noneformnumber=0formdata=Noneformxpath=Noneformcss=Noneclickdata=Nonedont_click=False]

Parameters:

  1. response
  2. formname
  3. formid
  4. formxpath
  5. formcss
  6. formnumber
  7. formdata
  8. clickdata
  9. don’t_click

Examples:

Fromrequest to send data via HTTP post

To simulate user login

  • JsonRequest

Parameters:

  • data
  • dumps_kwargs

Response Objects

These are HTTP responses.

Parameters:

  1. url
  2. status
  3. headers
  4. body
  5. flags
  6. request
  7. certificate
  8. ip_address
  9. cb_kwargs
  10. copy()
  11. replace ([urlstatusheadersbodyrequestflagscls])
  12. urljoin(url)
  13. follow(url, callback=None, method=’GET’, headers=None, body=None, cookies=None, meta=None, encoding=’utf-8′, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)
  14. follow_all(urls, callback=None, method=’GET’, headers=None, body=None, cookies=None, meta=None, encoding=’utf-8′, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)

Response subclasses

List of subclasses:

  1. TestResponse objects
  2. HtmlResponse objects
  3. XmlResponse objects

Extracts links from responses.

LxmlExtractor.extract_links returns a list of matching Link objects.

Link Extractor Reference

Link extractor class is scrapy.linkextractor.lxmlhtml.LxmlLinkExtractor

LxmlLinkExtractor

Parameters:

  1. allow
  2. deny
  3. allow_domains
  4. deny_domains
  5. deny_extensions
  6. restrict_xpaths
  7. restrict_css
  8. restrict_text
  9. tags
  10. attrs
  11. canonicalize
  12. unique
  13. process_value
  14. strip
  15. extract_links(response)

Link

They represent the extracted link

Parameters:

  1. url
  2. text
  3. fragment
  4. nofollow

SETTINGS

Scrapy settings can be adjusted as needed

Designating the setting

SCRAPY_SETTINGS_MODULE is used to set the settings.

Populating the settings

Settings can be populated in the following precedence :

  1. Command line options  –  “-s” or “—set” is used to override the settings
  2. Settings per-spider – This can be defined through “custom_settings” attribute
  3. Project settings module – This can be changed in the “settings.py” file.
  4. Default settings per-command  – “default_settings”  is used to define this
  5. Default global settings – scrapy.settings.default_settings  is used to set this.

Import Paths and Classes

Importing can be done

  1.  String containing the import path
  2. Object

How to access settings

Settings can be accessed through “self.settings”  in spider , “scrapy.crawler.Crawler.settings” in Crawler from “from_crawler”

Rationale for setting names

Setting name are prefixed with component name.

Built-in settings reference

AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_ENDPOINT_URLAWS_ENDPOINT_URLAWS_USE_SSL
AWS_VERIFYAWS_REGION_NAMEASYNCIO_EVENT_LOOPBOT_NAMECONCURRENT_ITEMS
CONCURRENT_REQUESTSCONCURRENT_REQUESTS_PER_DOMAINDEFAULT_ITEM_CLASSDEFAULT_REQUEST_HEADERSDEPTH_LIMIT
DEPTH_PRIORITYDEPTH_STAT_VERBOSEDNSCACHE_ENABLEDDNSCACHE_SIZEDNS_RESOLVER
DOWNLOADERDOWNLOADER_HTTPCLIENTFACTORYDOWNLOADER_CLIENTCONTEXTFACTORYDOWNLOADER_CLIENT_TLS_CIPHERSDOWNLOADER_CLIENT_TLS_METHOD
DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGINGDOWNLOADER_MIDDLEWAREDOWNLOADER_MIDDLWARES_BASEDOWNLOADER_STATSDOWNLOAD_DELAY
DOWNLOAD_HANDLERSDOWNLOAD_HANDLERS_BASEDOWNLOAD_TIMEOUTDOWNLOAD_MAXSIZEDOWNLOAD_WARNSIZE
DOWNLOAD_FAIL_ON_DATALOSSDUPEFILTER_CLASSDUPEFILTER_DEBUGEDITOREXTENSIONS
EXTENSIONS_BASEFEED_TEMPDIRFEED_STORAGE_GCS_ACLFTP_PASSIVE_MODEFTP_PASSWORD
FTP_USERGCS_PROJECT_IDITEM_PIPELINESITEM_PIPELINES_BASELOG_ENABLED
LOG_FILELOG_FORMATLOG_DATEFORMATLOG_FORMATTERLOG_LEVEL
LOG_STDOUTLOG_SHORT_NAMESLOGSTATS_INTERVALMEMDEBUG_ENGABLEDMEMDEBUG_NOTIFY
MEMUSAGE_ENABLEDMEMUSAGE_LIMIT_MBMEMUSAGE_CHECK_INTERVAL_SECONDSMEMUSAGE_WARNING_MBNEWSPIDER_MODULE
RANDOMIZE_DOWNLOAD_DELAYREACTOR_THREADPOOL_MAXSIZEREDIRECT_PRIORITY_ADJUSTRETRY_PRIORITY_ADJUSTROBOTSTXT_OBEY
ROBOTSTXT_PARSERROBOTSTXT_USER_AGENTSCHEDULERSCHEDULER_DEBUGSCHEDULER_DISK_QUEUE
SCHEDULER_MEMORY_QUEUESCHEDULER_PRIORITY_QUEUESCRAPER_SLOT_MAX_ACTIVE_SIZESPIDER_CONTACTSSPIDER_CONTACTS_BASE
SPIDER_LOADER_CLASSSPIDER_LOADER_WARN_ONLYSPIDER_MIDLDLEWARESSPIDER_MIDDLEWARES_BASESPIDER_MODULES
STATS_CLASSSTATS_DUMPSTATSMAILER_RCPTSTELNETCONSOLE_ENABLEDTEMPLATES_DIR
TWISTED_REACTORURLLENGTH_LIMITUSER_AGENT

EXCEPTIONS

Built-in Exceptions reference

  1. CloseSpider  – Raised when the spider needs to be closed
  2. DontCloseSpider – To stop spider from closing
  3. DropItem – Item pipeline stops the item processing
  4. IgnoreRequest – Request when needed to be ignored
  5. NotConfigured – Raised by Extension/Item pipelines/Downloader middleware/Spider middleware to tell that this will remain disabled.
  6. NotSupported – Indicates when feature is not supported.
  7. StopDownload – Nothing should be downloaded henceforth

A sample tutorial to try 

1. Open command prompt and traverse to the folder where you want to store the scraped data.

2.  Let’s create the project under the name “scrape”

Type  the following in the conda shell

scrapy startproject scrape

The above command will create a folder with the name scrape containing a scrape folder and scrapy.cfg file.

  1. Traverse inside this project scrape
  2. Go inside the folder called spider and then create a file called “project.py”

Type the following inside it:

import scrapy
 #scrapy.Spider needs to be extended
class scrape(scrapy.Spider): 
    #unique name that identifies the spider
    name = "posts"    
    start_urls  = ['https://blog.scrapinghub.com']
 
     #takes in response to process downloaded responses.
    def parse(self,response): 
         #for crawling each and every links
        for post in response.css('div.post-item'): 
            yield {
                #extracts title
                'title':post.css('.post-header h2 a::text')[0].get(), 
                #extracts date 
                'date':post.css('.post-header a::text')[1].get(),  
                 #extracts author name
                'author':post.css('.post-header a::text')[2].get() 
            }
        #goes to next page
        next_page = response.css('a.next-posts-link::attr(href)').get()
        #if there is next page then this parse method gets called again   
        if next_page is not None :    
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

5. Save the file
6. In the cmd, run the file with the following command
7. scrapy crawl posts
8. All the links get crawled and at the same time title author date gets extracted.

This brings us to the end of the Scrapy Tutorial. We hope that you were able to gain a comprehensive understanding of the same. If you wish to learn more such skills, check out the pool of Free Online Courses offered by Great Learning Academy.

1

LEAVE A REPLY

Please enter your comment!
Please enter your name here

nineteen + fifteen =