{"id":30778,"date":"2021-04-26T16:38:00","date_gmt":"2021-04-26T11:08:00","guid":{"rendered":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/"},"modified":"2022-10-20T16:51:52","modified_gmt":"2022-10-20T11:21:52","slug":"scrapy-tutorial","status":"publish","type":"post","link":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/","title":{"rendered":"Scrapy Tutorial - An Introduction"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\" id=\"web-scraper\"><strong>Web Scraper <\/strong><\/h2>\n\n\n\n<p>A web scraper is a tool that is used to extract the data from a website.&nbsp;&nbsp;<\/p>\n\n\n\n<p>It&nbsp; involves the following process:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Figure out the target website<\/li><li>Get the URL of the pages from which the data needs to be extracted.<\/li><li>Obtain the HTML\/CSS\/JS of those pages.<\/li><li>Find the locators such as XPath or CSS selectors or regex of those data which needs to be extracted.<\/li><li>Save the data in a structured format such as JSON or CSV file.<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"web-crawler\"><strong>Web Crawler <\/strong><\/h2>\n\n\n\n<p>A web crawler is used to collect the URL of the websites and their corresponding child websites.&nbsp; The crawler will collect all the links associated with the website. It then records (or copies) them and stores them in the servers as a search index.&nbsp; This helps the server to find the websites easily.&nbsp; Servers then use this index and rank them accordingly. The pages are then displayed to the user based on ranking given by the search engine.<\/p>\n\n\n\n<p>The web crawler can also be called a web spider, spider bot, crawler or web bot.<\/p>\n\n\n\n<p>Also Read: <a href=\"https:\/\/www.mygreatlearning.com\/blog\/web-scraping-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\">Web Scraping Tutorial | What is Web Scraping? <\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scrapy\"><strong>Scrapy <\/strong><\/h2>\n\n\n\n<p>Scrapy does the work of a web crawler and the work of a web scraper. Hence, Scrapy is quite a handful in crawling a site, then extracting it and storing it in a structured format. Scrapy also works with API to extract data as well.<\/p>\n\n\n\n<p>Scrapy provides:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>the methods like Xpath and regex used for selecting and extracting data from locators like CSS selectors.<\/li><li>Scrapy shell is an interactive shell console that we can use to execute spider commands without running the entire code. This facility can debug or write the Scrapy code or just check it before the final spider file execution.<\/li><li>Facility to store the data in a structured data in formats such as :<ul><li>JSON<\/li><li>JSON Lines<\/li><li>CSV<\/li><li>XML<\/li><li>Pickle<\/li><li>Marshal<\/li><\/ul><\/li><li>Facility to store the&nbsp; extracted data in:<ul><li>Local filesystems<\/li><li>FTP<\/li><li>S3<\/li><li>Google Cloud Storage<\/li><li>Standard output<\/li><\/ul><\/li><\/ol>\n\n\n\n<ol class=\"wp-block-list\" start=\"5\"><li>Facility to use API or signals (which are functions that are written in case of an event)<\/li><li>Facility to handle :<ul><li>HTTP features<\/li><li>User-agent spoofing<\/li><li>Robots.txt<\/li><li>Crawl depth restriction<\/li><\/ul><\/li><\/ol>\n\n\n\n<ol class=\"wp-block-list\" start=\"7\"><li>Telnet console \u2013 Python console that could run inside Scrapy to introspect.<\/li><li>And more<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scrapy-installation\"><strong>Scrapy Installation<\/strong><\/h2>\n\n\n\n<p>Scrapy can be installed by:&nbsp;<\/p>\n\n\n\n<p><strong>Using Anaconda \/ Miniconda.<\/strong><\/p>\n\n\n\n<p>Type the following command in the Conda shell:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nconda install -c conda-forage scrapy \n<\/pre><\/div>\n\n\n<p>Alternatively, you could do the following.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\npip install Scrapy\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"scrapy-packages\"><strong>Scrapy Packages<\/strong><\/h2>\n\n\n\n<ol class=\"wp-block-list\"><li>lxml&nbsp; - XML and HTML parser<\/li><li>parsel \u2013 HTML\/XML library that lies on top of lxml<\/li><li>w3lib \u2013 Deals with webpages<\/li><li>twisted \u2013 asynchronous networking framework<\/li><li>cryptography and pyOpenSSL -&nbsp; for network-level security needs.<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scrapy-file-structure\"><strong>Scrapy File Structure<\/strong><\/h2>\n\n\n\n<p>A scrapy project will have two parts.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Configuration file&nbsp; - It is the <em>project root directory<\/em>. It has the settings for the project. The location of the cfg can be seen in the following place:<\/li><\/ol>\n\n\n\n<ul class=\"wp-block-list\"><li>System wide &nbsp; &nbsp; - &nbsp; &nbsp; &nbsp; \/etc\/scrapyg.cfg &nbsp; &nbsp; &nbsp; &nbsp; or&nbsp; &nbsp; &nbsp; c:\\scrapy\\scrapy.cfg<\/li><li>Global -&nbsp; ~\/.config\/scrapy.cfg($XDG_CONFIG_HOME) and ~\/.scrapy.cfg($HOME)<\/li><li>Scrapy project root \u2013 scrapy.cfg<\/li><\/ul>\n\n\n\n<p>Settings from these files have the following precedence :<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Project-wide settings<\/li><li>System-wide defaults<\/li><li>User-defined values<\/li><\/ul>\n\n\n\n<p>Environment variables through which Scrapy can be controlled are :<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>SCRAPY_SETTINGS_MODULE&nbsp;<\/li><li>SCRAPY_PROJECT<\/li><li>SCRAPY_PYTHON_SHELL<\/li><\/ul>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\"><li>A project folder \u2013 It contains files as&nbsp; follows :<\/li><\/ol>\n\n\n\n<ul class=\"wp-block-list\"><li>__init__.py<\/li><li>items.py<\/li><li>middleware.py<\/li><li>pipelines.py<\/li><li>settings.py<\/li><li>spider \u2013 folder.&nbsp; It is the place where the spider that we create gets stored.&nbsp;<\/li><\/ul>\n\n\n\n<p>A project\u2019s configuration file can be shared between multiple projects having its own settings module.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scrapy-command-line-tool\"><strong>SCRAPY COMMAND LINE TOOL<\/strong><\/h2>\n\n\n\n<p>The Scrapy command line provides many commands.&nbsp; Those commands can be classified into two groups.&nbsp;&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Global commands<\/li><li>Project \u2013 only commands<\/li><\/ol>\n\n\n\n<p>To see all the commands available type the following in the shell:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy -h\n<\/pre><\/div>\n\n\n<p>Syntax to see the help for a particular command is:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy &amp;lt;command&gt; &#x5B;options] &#x5B;args]\n<\/pre><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"global-commands\"><strong>Global Commands<\/strong><\/h2>\n\n\n\n<p>These are those commands that can work without an active scrapy project.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>startproject<\/li><\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy startproject &amp;lt;project_name&gt; &#x5B;project_dir]\n<\/pre><\/div>\n\n\n<p>Usage: It is used to create a project with the specified project name under the specified project directory. If the directory is not mentioned, then the project directory will be the same as the project name. <\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy startproject tutorial\n<\/pre><\/div>\n\n\n<p>This will create a directory with the name \u201ctutorial\u201d and the project name as \u201ctutorial\u201d and the configuration file.&nbsp;&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>genspider<\/li><\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy genspider &#x5B;-t template] &amp;lt;name&gt; &amp;lt;domain&gt;\n<\/pre><\/div>\n\n\n<p>Usage: This is used to create a new spider in the current folder.&nbsp; It is always best practice to create the spider after traversing inside the project\u2019s spider folder. Spider\u2019s name is given by the &lt;name&gt; parameter and &lt;domain&gt;&nbsp; generates \u201cstart_urls\u201d and \u201callowed_domains\u201d.&nbsp;<\/p>\n\n\n\n<p>Example: <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy genspider tuts https:\/\/www.imdb.com\/chart\/top\/\n<\/pre><\/div>\n\n\n<p>This will create a directory with the spider with the name tuts.py and the allowed domain is \u201cimdb\u201d. Use this command post traversing into the spider folder.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>settings<\/li><\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy settings &#x5B;options]\n<\/pre><\/div>\n\n\n<p>Usage: It shows the scrapy setting outside the project and the project setting inside the project.<\/p>\n\n\n\n<p>The following options can be used with the settings:<\/p>\n\n\n\n<p>--help &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; show this help message and exit<\/p>\n\n\n\n<p>--get=SETTING&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print raw setting value<\/p>\n\n\n\n<p>--getbool = SETTING&nbsp; &nbsp; &nbsp; &nbsp; print setting value, interpreted as Boolean<\/p>\n\n\n\n<p>--getint = SETTING&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print setting value, interpreted as an integer<\/p>\n\n\n\n<p>--getfloat = SETTING&nbsp; &nbsp; &nbsp; &nbsp; print setting value,interpreted as an float<\/p>\n\n\n\n<p>--getlist = SETTING &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print setting value,interpreted as a list<\/p>\n\n\n\n<p>--logfile = FILE &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; logfile,if omitted stderr will be used<\/p>\n\n\n\n<p>--loglevel = LEVEL &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; log level<\/p>\n\n\n\n<p>--nolog &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; disable logging completely<\/p>\n\n\n\n<p>--profile=FILE&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; write python cProfile to file<\/p>\n\n\n\n<p>--pidfile = FILE &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; write process id to file<\/p>\n\n\n\n<p>--set NAME=VALUE &nbsp; &nbsp; &nbsp; &nbsp; set\/override setting<\/p>\n\n\n\n<p>--pdb&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; enable pdb on failure<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy crawl tuts -s LOG_FILE = scrapy.log\n<\/pre><\/div>\n\n\n<ul class=\"wp-block-list\"><li>runspider<\/li><\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy runspider &amp;lt;spider.py&gt;\n<\/pre><\/div>\n\n\n<p>Usage: To run spider without having to creating project<\/p>\n\n\n\n<p>Example: <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy runspider tuts.py\n<\/pre><\/div>\n\n\n<ul class=\"wp-block-list\"><li>shell<\/li><\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy shell &#x5B;url]\n<\/pre><\/div>\n\n\n<p>Usage: Shell will start for the given url.<\/p>\n\n\n\n<p>Options: <\/p>\n\n\n\n<p>--spider = SPIDER&nbsp; &nbsp; &nbsp; (The mentioned spider will be used and auto-detection gets bypassed)<\/p>\n\n\n\n<p>--c code&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Evaluates, prints the result and exited)<\/p>\n\n\n\n<p>--no-redirect&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Does not follow HTTP 3xx redirects)<\/p>\n\n\n\n<p>Example: <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy shell https:\/\/www.imdb.com\/chart\/top\/\n<\/pre><\/div>\n\n\n<p>Scrapy will start the shell on <a rel=\"noreferrer noopener\" href=\"https:\/\/www.imdb.com\/chart\/top\/\" target=\"_blank\">https:\/\/www.imdb.com\/chart\/top\/<\/a>&nbsp;page.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>fetch<\/li><\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy fetch &amp;lt;url&gt;\n<\/pre><\/div>\n\n\n<p>Usage:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>Scrapy Downloader will download the page and give the output.<\/p>\n\n\n\n<p>Options:<\/p>\n\n\n\n<p>--spider = SPIDER&nbsp; &nbsp; &nbsp; (The mentioned spider will be used and auto-detection gets bypassed)<\/p>\n\n\n\n<p>--headers&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Header\u2019s of the HTTP request will be shown in the output)<\/p>\n\n\n\n<p>--no-redirect&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Does not follow HTTP 3xx redirects)<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy fetch https:\/\/www.imdb.com\/chart\/top\/\n<\/pre><\/div>\n\n\n<p>Scrapy will download the <a href=\"https:\/\/www.imdb.com\/chart\/top\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.imdb.com\/chart\/top\/<\/a>&nbsp;page.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>View<\/li><\/ul>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy view &amp;lt;url&gt;\n<\/pre><\/div>\n\n\n<p>Usage:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>Scrapy will open the mentioned URL in the default browser.&nbsp; This will help to view the page from the spider\u2019s perspective<\/p>\n\n\n\n<p>Options:<\/p>\n\n\n\n<p>--spider = SPIDER&nbsp; &nbsp; &nbsp; (The mentioned spider will be used, and auto-detection gets bypassed)<\/p>\n\n\n\n<p>--no-redirect&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Does not follow HTTP 3xx redirects)<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy view https:\/\/www.imdb.com\/chart\/top\/\n<\/pre><\/div>\n\n\n<p>Scrapy will open <a rel=\"noreferrer noopener\" href=\"https:\/\/www.imdb.com\/chart\/top\/\" target=\"_blank\">https:\/\/www.imdb.com\/chart\/top\/<\/a>&nbsp;page in the default browser.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Version<\/li><\/ul>\n\n\n\n<p>Syntax: scrapy version -v<\/p>\n\n\n\n<p>Usage:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>Prints the version of the scrapy.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"project-only-commands\"><strong>Project-only Commands<\/strong><\/h2>\n\n\n\n<p>These are those commands that can work inside an active scrapy project.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>crawl<\/li><\/ol>\n\n\n\n<p>Syntax:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy crawl &amp;lt;spider&gt;\n<\/pre><\/div>\n\n\n<p>Usage:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>This will start the crawling.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy crawl tuts\n<\/pre><\/div>\n\n\n<p>Scrapy will crawl the domains mentioned in the spider.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\"><li>check<\/li><\/ol>\n\n\n\n<p>Syntax:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy check &#x5B;-I] &amp;lt;spider&gt;\n<\/pre><\/div>\n\n\n<p>Usage:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>Checks what\u2019s returned by the crawler<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrap check tuts\n<\/pre><\/div>\n\n\n<p>Scrapy will check the crawled output of the crawler and returns the result as \"OK\u201d.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\"><li>list<\/li><\/ol>\n\n\n\n<p>Syntax:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy list\n<\/pre><\/div>\n\n\n<p>Usage:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>All the spider\u2019s names are returned that are present in the project.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy list\n<\/pre><\/div>\n\n\n<p>Scrapy will return all the spiders that are there in the project<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"4\"><li>edit<\/li><\/ol>\n\n\n\n<p>Syntax:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy edit &amp;lt;spider&gt;\n<\/pre><\/div>\n\n\n<p>Usage:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>This command is used to edit the spider.&nbsp; The editor mentioned in the editor environment variable will open up. If it\u2019s not set, then IDLE (windows) will open up, or vi (UNIX) will open up. The developer is not restricted to use this editor but can use any editor.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy editor tuts\n<\/pre><\/div>\n\n\n<p>Scrapy will open tuts in the editor.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"5\"><li>parse<\/li><\/ol>\n\n\n\n<p>Syntax:&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy parse &amp;lt;url&gt; &#x5B;options]\n<\/pre><\/div>\n\n\n<p>Usage:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>Scrapy will parse the URL mentioned with the spider. Method if mentions in the&nbsp; --callback will be used; if not, parse will be used.<\/p>\n\n\n\n<p>Options:<\/p>\n\n\n\n<p>--spider = SPIDER&nbsp; &nbsp; &nbsp; (The mentioned spider will be used, and auto-detection gets bypassed)<\/p>\n\n\n\n<p>--a Name = Value&nbsp; &nbsp; &nbsp; (To set the spider option)<\/p>\n\n\n\n<p>--callback&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Spider method for parsing)<\/p>\n\n\n\n<p>--cb_kwargs&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Additional methods for callback parsing)<\/p>\n\n\n\n<p>--meta &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Spider meta for the callback method)<\/p>\n\n\n\n<p>--pipelines&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (To process via pipelines)<\/p>\n\n\n\n<p>--rules &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Rules for parsing)<\/p>\n\n\n\n<p>--noitems &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Hides scraped items)<\/p>\n\n\n\n<p>--nocolour&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Removes colours)<\/p>\n\n\n\n<p>--nolinks &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Hides links)<\/p>\n\n\n\n<p>--depth&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (The level to which the requests needs to done recursively)<\/p>\n\n\n\n<p>--verbose &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Displays information depth level)<\/p>\n\n\n\n<p>--output &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (Output is stored in a file)<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nscrapy parse https:\/\/www.imdb.com\/chart\/top\/\n<\/pre><\/div>\n\n\n<p>Scrapy will parse the <a rel=\"noreferrer noopener\" href=\"https:\/\/www.imdb.com\/chart\/top\/\" target=\"_blank\">https:\/\/www.imdb.com\/chart\/top\/<\/a>&nbsp;page.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"6\"><li>Bench<\/li><\/ol>\n\n\n\n<p>Syntax: scrapy bench<\/p>\n\n\n\n<p>Usage:<\/p>\n\n\n\n<p>To run a benchmark test.<\/p>\n\n\n\n<p>To add custom commands.&nbsp;<\/p>\n\n\n\n<p>COMMANDS_MODULE = \u2018command_name\u2019<\/p>\n\n\n\n<p>scrapy.commands can be used in setup.py for adding up the commands externally.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"spiders\"><strong>SPIDERS<\/strong><\/h2>\n\n\n\n<p>Spider folder is the place which contains the classes that are needed for scraping data and for crawling the site. Customisation can be done as per the requirement.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"spider-scraping-cycle\">SPIDER SCRAPING CYCLE<\/h3>\n\n\n\n<p>There are different types of Spiders available for various purposes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scrapy-spider\">Scrapy.Spider<\/h2>\n\n\n\n<p><strong>Class:&nbsp; scrapy.spiders.Spider<\/strong><\/p>\n\n\n\n<p>It is the simplest spider.&nbsp; It has the default method&nbsp; <strong>start_requests().&nbsp; <\/strong>This will send requests from <strong>start_urls()<\/strong> calls the <strong>parse<\/strong> for each resulting response.<\/p>\n\n\n\n<p><strong>name -&nbsp; <\/strong>Name of the spider is given in this.&nbsp; It should be unique, and more than one instance can be instantiated.&nbsp; It\u2019s the best practice to keep the spider\u2019s name the same as the name of the website that\u2019s crawled.<\/p>\n\n\n\n<p><strong>allowed_domains -&nbsp; <\/strong>Only the domains that are mentioned in this list are allowed to crawl.&nbsp; To crawl the domain that is not mentioned in the list \u201cOffsieMiddelware\u201d should be enabled.<\/p>\n\n\n\n<p><strong>start_urls \u2013 <\/strong>A list of URLs that needs to be crawled gets mentioned over here<\/p>\n\n\n\n<p><strong>custom_settings&nbsp; - <\/strong>Settings that need to be overridden are given here.&nbsp; It should be defined as a class as the settings are updated first before crawling.<\/p>\n\n\n\n<p><strong>crawler \u2013 <\/strong>from_crawler()&nbsp; method sets this attribute.&nbsp; It links the crawler object with the spider object<\/p>\n\n\n\n<p><strong>settings \u2013 <\/strong>settings for the spider\/project gets mentioned over here<\/p>\n\n\n\n<p><strong>logger \u2013 <\/strong>logger with the same name as the Spider\u2019s name will have all the log of the spider.<\/p>\n\n\n\n<p><strong>from_crawler(crawler,*args,**kwargs) \u2013 <\/strong>Sets the crawler and the settings attribute. It creates spiders.<\/p>\n\n\n\n<p>A. crawler&nbsp; - object that bounds spider and the crawler<\/p>\n\n\n\n<p>B. args -&nbsp; arguments that are passed to the __int__()<\/p>\n\n\n\n<p>C. kwargs \u2013 kwargs that are passed to&nbsp; __int__()<\/p>\n\n\n\n<p><strong>start_requests() \u2013 <\/strong>Used to scrape the website.&nbsp; It\u2019s called only once and start_url() will generate Request() for each url.<\/p>\n\n\n\n<p><strong>parse(response) \u2013 <\/strong>Callback method is used to get the response returns the scraped data.<\/p>\n\n\n\n<p><strong>log(message,level,component) \u2013 <\/strong>Sends the log throught the \u201clogger\u201d<\/p>\n\n\n\n<p><strong>closed(reason) \u2013 <\/strong>It will close the spider and signal.connect() gets triggered for spider_closed signal.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"spider-arguments\">Spider Arguments<\/h2>\n\n\n\n<p>Arguments can be given to spiders.&nbsp;The arguments are passed through the crawl command using&nbsp; -a option.<\/p>\n\n\n\n<p>The __init__() will take these arguments and apply them as attributes.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n\n<p>scrapy crawl tuts \u2013a category = electronics<\/p>\n\n\n\n<p>__init__() should have category as an argument for this code to work&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"generic-spiders\">Generic Spiders<\/h2>\n\n\n\n<p>These spiders can be used for rule-based crawling, crawling Sitemaps, or parsing XML\/CSV feed.<\/p>\n\n\n\n<p><strong>CrawlSpider<\/strong><\/p>\n\n\n\n<p>Class \u2013 scrapy.spider.CrawlSpider<\/p>\n\n\n\n<p>This is the spider that crawls based on rules that can be custom written.<\/p>\n\n\n\n<p><strong>Attributes:&nbsp;<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>rules&nbsp; - List of <strong>Rule<\/strong> object that defines the crawling behaviour.<\/li><li>parse_start_url(response, **kwargs) - &nbsp; This is called whenever a response is created for the URL requests. Expects an item object or an item containing iterable object.<\/li><\/ol>\n\n\n\n<p><strong>Crawling Rules:<\/strong><\/p>\n\n\n\n<p><strong>class scrapy.spiders.Rule(link_extractor=None,&nbsp;callback=None,&nbsp;cb_kwargs=None,&nbsp;follow=None,&nbsp;<\/strong><\/p>\n\n\n\n<p><strong>process_links=None,&nbsp;process_request=None,&nbsp;errback=None)<\/strong><\/p>\n\n\n\n<p>link_extractor \u2013 rule for how the link is to be extracted is mentioned here. It then creates a Request object for each generated link<\/p>\n\n\n\n<p>callback \u2013 This is called when each link is extracted. Receives a response as it\u2019s the first argument and must return the iterable object.<\/p>\n\n\n\n<p>cb_kwargs \u2013 arguments for callback function<\/p>\n\n\n\n<p>follow \u2013 If callback is None, then follow is set to True otherwise, it\u2019s False.&nbsp; It is a Boolean.<\/p>\n\n\n\n<p>process_links \u2013 Called for each link extracted from each response.<\/p>\n\n\n\n<p>process_request \u2013 called for each request.<\/p>\n\n\n\n<p>errback \u2013 Exception is raised if there is an error.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"xmlfeedspider\"><strong>XMLFeedSpider<\/strong><\/h3>\n\n\n\n<p>Class \u2013 scrapy.spider.XMLFeedSpider<\/p>\n\n\n\n<p>It is used to parse XML feeds. This will parse iternodes, XML, HTML for performance reasons through a particular node name.<\/p>\n\n\n\n<p>The following class attributes must be defines to set the iterator and tag name:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>iterator&nbsp; &nbsp; -&nbsp; Tells what iterator to be used, i.e. iternodes or HTML or XML. Default is iternodes.<\/li><li>itertag&nbsp; &nbsp; &nbsp; - &nbsp; Name of the string that needs to be iterated.<\/li><li>namespaces \u2013 (prefix,url) tuples that are mentioned in the document will be gets processed in this spider.<\/li><\/ol>\n\n\n\n<p>The following overridable methods are available as well :<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>adapt_response(respURLe) \u2013 It can change the response body before parsing . It can receive and send responses.<\/li><li>parse_node(response,selector) \u2013 This must me overridden if the a matching node and itertag is there for the spider to work. It should return an iterable object or a Request.<\/li><li>process_results(response, results) \u2013 Does last-minute processing if required.<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"csvfeedspider\"><strong>CSVFeedSpider<\/strong><\/h3>\n\n\n\n<p>Class \u2013 scrapy.spiders.CSVFeedSpider<\/p>\n\n\n\n<p>This spider iterate over rows. parse_row() will be called for each iteration.&nbsp;<\/p>\n\n\n\n<p>delimiter:&nbsp; it\u2019s the separator character for each string. Default is \u201c,\u201d<\/p>\n\n\n\n<p>quotechar: It defines the enclosure character. Default is \u2018 \u201c \u2018.<\/p>\n\n\n\n<p>headers: Column names in CSV file.<\/p>\n\n\n\n<p>parse_row(response,row) : It helps to override adapt_response and process_results for post and preprocessing. It obtains dict with a key for each header of the CSV file.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"sitemapspider\"><strong>SitemapSpider&nbsp;<\/strong><\/h3>\n\n\n\n<p>Class \u2013 scrapy.spiders.SitemapSpider<\/p>\n\n\n\n<p>It is used for crawling the site.&nbsp; It discovers sitemap urls from robot.txt<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>sitemap_urls \u2013 This will contain the list of urls.&nbsp; These urls usually point to the sitemap or robot.txt which needs to be crawled.<\/li><li>sitemap_rules-&nbsp; &nbsp; It\u2019s value is defined by a tuple (regex,callback).&nbsp; Callbacks should match with the url extracted from regex.<\/li><li>sitemap_follow \u2013 It containts regexes.<\/li><li>sitemap_alternate_link \u2013 Alternate links gets specified here. This is disabled by default.<\/li><li>sitemap_filter(entries)&nbsp; -&nbsp; Can be used when there is a need to override sitemap attributes.<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"selectors\"><strong>Selectors<\/strong><\/h2>\n\n\n\n<p>Scrapy uses CSS or Xpath to select HTML elements.&nbsp;<\/p>\n\n\n\n<p>Querying can be done using response.css() or response.XPath().<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n\n<p>response.css(\u201cdiv::text\u201d).get()<\/p>\n\n\n\n<p>Selector() can also be used if needed directly.<\/p>\n\n\n\n<p>.get()&nbsp; or .getall() is used along with the response to extract the data.&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>.get()&nbsp; - will give a single result. None if nothing gets matched.<\/p>\n\n\n\n<p>.getall()&nbsp; - will give a list of matches.<\/p>\n\n\n\n<p>CSS pseudo-elements can be used to select text or attribute-nodes.<\/p>\n\n\n\n<p>.get()&nbsp; has an alias &nbsp; .extract-first().<\/p>\n\n\n\n<p>.get() returns NONE if no match is found.&nbsp; Default value can be given to replace NONE with some other value with the help of .get(default=\u2019value\u2019)<\/p>\n\n\n\n<p>.attrib[] can also be used to query via attributes of a tag for CSS selectors.<\/p>\n\n\n\n<p><strong>Example:<\/strong><\/p>\n\n\n\n<p>response.css(\u2018a\u2019).attrib[\u2018href\u2019]<\/p>\n\n\n\n<p><strong>Non-standard pseudo-elements that are essential for web scraping are:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>::text &nbsp; - selects the text nodes<\/li><li>::attr(name) \u2013 selects attributes values.<\/li><\/ol>\n\n\n\n<p>Adding a &nbsp; *&nbsp; infront of&nbsp; ::text will help to select all the elements of the node.<\/p>\n\n\n\n<p>*::text<\/p>\n\n\n\n<p>foo::text&nbsp; can be used to check if there is no result incase the element is present but does not have any value .<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"nesting-selectors\"><strong>Nesting Selectors<\/strong>&nbsp;&nbsp;<\/h3>\n\n\n\n<p>Selectors having the same type on which selection can be done again&nbsp; is nesting of selectors.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n\n<p>val = response.css(\u201cdiv::text\u201d)<\/p>\n\n\n\n<p>val.getall()<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"selecting-element-attributes\"><strong>Selecting element attributes&nbsp;&nbsp;&nbsp;<\/strong><\/h3>\n\n\n\n<p>Attributes of an element can obtained using Xpath or CSS selectors.<\/p>\n\n\n\n<p>XPATH -&nbsp; Advantage with Xpath is that ,&nbsp; @attributes can be used as a filter and it\u2019s standard feature as well.<\/p>\n\n\n\n<p>Example : response.xpath(\u201c\/\/a\/@href\u201d).get()<\/p>\n\n\n\n<p>CSS Selector&nbsp; :&nbsp; &nbsp; ::attr(\u2026)&nbsp; can be used to get attribute vales as well.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Example :&nbsp; response.css(\u2018img::attrb(src)\u2019).get()<\/p>\n\n\n\n<p>Or &nbsp; .attrib() property can also be used<\/p>\n\n\n\n<p>Example : &nbsp; response.css.(\u2018img\u2019).attrib[\u2018src\u2019]<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"using-selectors-with-regular-expressions\"><strong>Using Selectors with regular expressions<\/strong><\/h3>\n\n\n\n<p>.re() can be used to extract data along with Xpath or with CSS.<\/p>\n\n\n\n<p>Example : response.xpath(\u2018\/\/a[contains(@href,\u201dimage\u201d)]\/text()\u2019).re(r\u2019Name:\\s*(.*)\u2019)<\/p>\n\n\n\n<p>.re_first()&nbsp; can also be used to extract the first element.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"some-equivalents\"><strong>Some equivalents<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Selection<\/strong><\/td><td><strong>Equaivalent Value Used these days<\/strong><\/td><\/tr><tr><td>SelectorList.extract_first()<\/td><td>SelectorList.get()<\/td><\/tr><tr><td>SelectorList.extract()<\/td><td>SelectorList.getall()<\/td><\/tr><tr><td>Selector.extract()<\/td><td>Selector.get()<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Selector.getall() &nbsp; - will return a list.<\/p>\n\n\n\n<p>.get()&nbsp; returns single output<\/p>\n\n\n\n<p>.getall() \u2013 return a list<\/p>\n\n\n\n<p>.extract() will return either a single output or a list as the output. To get single result either extract() or extract_first() can be called.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"working-with-relative-xpaths\"><strong>Working with relative XPATHS<\/strong><\/h3>\n\n\n\n<p>Absolute Xpath -&nbsp; Absolute Xpath gets created whenever an Xpath starts with \u2018\/\u2019 and it\u2019s nested.<\/p>\n\n\n\n<p>A&nbsp;proper way to make it relative is use \u201c.\u201d Infront of \u2018\/\u2019.<\/p>\n\n\n\n<p><strong>Example:<\/strong><\/p>\n\n\n\n<p>divs = response.xpath(\u201c\/\/div\u201d)<\/p>\n\n\n\n<p>for p in divs.xpath(\u2018.\/\/p\u201d):<\/p>\n\n\n\n<p>print(p.get())<\/p>\n\n\n\n<p>or&nbsp;&nbsp;<\/p>\n\n\n\n<p>for p in divs.xpath(\u2018p):<\/p>\n\n\n\n<p>print(p.get())<\/p>\n\n\n\n<p>For mode details on XPATH can be obtained from <a href=\"https:\/\/www.w3.org\/TR\/xpath\/all\/#location-paths\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.w3.org\/TR\/xpath\/all\/#location-paths<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"querying-the-elements-by-class-use-css\"><strong>Querying the elements by Class Use CSS<\/strong><\/h3>\n\n\n\n<p>If done with Xpath then the resulting output will end up having so&nbsp; much of complications.<\/p>\n\n\n\n<p>If&nbsp; \u2018@class = \u201csomeclass\u201d\u2019 is used the output might have missing elements.<\/p>\n\n\n\n<p>If&nbsp; \u2018contains(@class,\u2019someclass\u2019) is used then more then needed elements might come up in the result.<\/p>\n\n\n\n<p>As Scrapy allows chaining of selectors,&nbsp; CSS selector can be chained to select the class element and then Xpath can be used along with it to select the required elements instead.<\/p>\n\n\n\n<p><strong>Example:<\/strong><\/p>\n\n\n\n<p>response.css(\u201c.shout\u201d).xpath(\u2018.\/div\u2019).getall()<\/p>\n\n\n\n<p>\u201c.\u201d Should be appended before \u2018\/\u2019 in the xpath that follows the CSS selector.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"difference-between-node1-and-node1\"><strong>Difference between \/\/node[1] and (\/\/node)[1]<\/strong><\/h3>\n\n\n\n<p>(\/\/node)[1]&nbsp; - selects all the nodes first then the first element from that list will get selected.<\/p>\n\n\n\n<p>\/\/node[1]&nbsp; - First node of all the parent node will get selected.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"text-nodes-under-condition\"><strong>Text nodes under condition<\/strong><\/h3>\n\n\n\n<p>.\/\/text()&nbsp; when passed to contains() or starts-with() will result in a collection of text elements. The resulting node set will not give any result even if it gets converted to a string . And hence it is better to use \u201c.\u201d&nbsp; alone instead of \u201c.\/\/text()\u201d.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"variables-in-xpath-expressions\"><strong>Variables in Xpath expressions<\/strong><\/h3>\n\n\n\n<p>$somevariable is used as a reference variables. It\u2019s value will be passed to the query after substitution.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n\n<p>response.xpath('\/\/div[count(a)=$cnt]\/@id', cnt=5).get()<\/p>\n\n\n\n<p>More examples on <a href=\"https:\/\/parsel.readthedocs.io\/en\/latest\/usage.html#variables-in-xpath-expressions\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/parsel.readthedocs.io\/en\/latest\/usage.html#variables-in-xpath-expressions<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"removing-namespaces\"><strong>Removing namespaces<\/strong><\/h3>\n\n\n\n<p>selector.namespaces()&nbsp; method can be used so that all the namespaces of that html file can be used.&nbsp;<\/p>\n\n\n\n<p><strong>Example:<\/strong><\/p>\n\n\n\n<p>response.selector.namespaces()<\/p>\n\n\n\n<p>Namespaces are not removed by default by scrapy because namespaces of the page are needed at times and not need at times. So this method is called only when needed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"using-exslt-extensions\"><strong>Using EXSLT extensions<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Prefix<\/strong><\/td><td><strong>Namespace<\/strong><\/td><td><strong>Usage<\/strong><\/td><\/tr><tr><td>re<\/td><td>http:\/\/exslt.org\/regular-expressions<\/td><td>Regular expression<\/td><\/tr><tr><td>set<\/td><td>http:\/\/exslt.org\/sets<\/td><td>Set manipulation<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"regular-expressions\"><strong>Regular Expressions<\/strong><\/h3>\n\n\n\n<p>test() is used when starts-with() and contains() are not helpful<\/p>\n\n\n\n<p><strong>Set operations<\/strong><\/p>\n\n\n\n<p>These are used when there is a need to excluding data before extraction.<\/p>\n\n\n\n<p>Example<\/p>\n\n\n\n<p>scope.xpath(\u2018\u2019\u2019set:difference(.\/descendant::*\/@itemprop)\u2019\u2019\u2019)<\/p>\n\n\n\n<p><strong>Other Xpath extensions<\/strong><\/p>\n\n\n\n<p>has-class&nbsp; returns false if the nodes does not match with the given HTML classes and True for nodes that are matching.<\/p>\n\n\n\n<p>response.xpath('\/\/p[has-class(\"foo\")]')<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"built-in-selectors-reference\"><strong>Built-in Selectors reference<\/strong><\/h3>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>Selector objects<\/strong><\/li><\/ol>\n\n\n\n<p>Class \u2013 scrapy.selector.Selector(*args,**kwargs)<\/p>\n\n\n\n<p><strong>response<\/strong> \u2013 It is a Htmlresponse or a XMLresponse.<\/p>\n\n\n\n<p><strong>text<\/strong> \u2013 It is a Unicode string or a utf-8 encoded text cases<\/p>\n\n\n\n<p><strong>type<\/strong> \u2013 type can be \u201chtml\u201d for HtmlResponse,\u201dxml\u201d for XmlResponse or None&nbsp;<\/p>\n\n\n\n<p><strong>xpath(query,namespaces=None,**kwargs)<\/strong> \u2013 SelectorList will be returned with flattened elements, where query is the Xpath query. Namespaces are optional and is nothing but dictionaries that are registered with register_namespace(prefix,uri)&nbsp;<\/p>\n\n\n\n<p><strong>css(query)<\/strong> \u2013 SelectorList is returned post application of the css where query containing the css selector is given as the argument.&nbsp;<\/p>\n\n\n\n<p><strong>get()<\/strong> \u2013 Matches nodes will be returned.<\/p>\n\n\n\n<p><strong>attrib<\/strong> \u2013 Element\u2019s attributes will be returned.<\/p>\n\n\n\n<p><strong>re(regex,replace_entities = True)<\/strong> \u2013 Returns a list of Unicode post application of regex. Regex will contain the regex queries and replace_entities will replace if it\u2019s true.&nbsp;<\/p>\n\n\n\n<p><strong>re_first(regex,default=None,entities=True)<\/strong> \u2013 Default value will be returned if there is not match, first Unicode will be returned if there is a match<\/p>\n\n\n\n<p><strong>register_namespace(prefix,uri) <\/strong>\u2013 To register the namespaces<\/p>\n\n\n\n<p><strong>remove_namespaces() <\/strong>\u2013 Removes all namespaces<\/p>\n\n\n\n<p><strong>__bool__() <\/strong>\u2013 Return True if the content is real<\/p>\n\n\n\n<p><strong>getall() \u2013 <\/strong>Returns a list of matched content<strong>&nbsp;<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\"><li><strong>SelectorList objects \u2013<\/strong><\/li><\/ol>\n\n\n\n<p><strong>&nbsp;xpath(query,namespaces=None,**kwargs)<\/strong> \u2013 SelectorList will be returned with flattened elements, where query is the Xpath query. Namespaces are optional and is nothing but dictionaries that are registered with register_namespace(prefix,uri)&nbsp;<\/p>\n\n\n\n<p><strong>css(query)<\/strong> \u2013 SelectorList is returned post application of the css where query containing the css selector is given as the argument.&nbsp;<\/p>\n\n\n\n<p><strong>get()<\/strong> \u2013 returns the result for the first element in the list<\/p>\n\n\n\n<p><strong>getall() \u2013 <\/strong>get() is called for each element in the list.<strong>&nbsp;<\/strong><\/p>\n\n\n\n<p><strong>attrib<\/strong> \u2013 Element\u2019s attributes will be returned.<\/p>\n\n\n\n<p><strong>re_first(regex,default=None,entities=True)<\/strong> \u2013 re() is called for each element in the list<\/p>\n\n\n\n<p><strong>attrib<\/strong> \u2013 first element attribute is returned.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"items\"><strong>ITEMS<\/strong><\/h2>\n\n\n\n<p>A dict (key-value) pair is usually returned.&nbsp; Different types of items are there.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"item-types\">Item Types<\/h2>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>Dictionaries \u2013 <\/strong>dict is convenient and familiar.<\/li><li><strong>Item Objects&nbsp;<\/strong><\/li><\/ol>\n\n\n\n<p>Class \u2013 scrapy.item.Item([arg])<\/p>\n\n\n\n<p>Item behaves the same way as the standard dict API and allows to define the field names such as :&nbsp;&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>KeyError \u2013 Raised when undefined field names are called.<\/li><li>Item exporters \u2013 Exports all fields<\/li><\/ul>\n\n\n\n<p>Item allows metadata definition. trackref&nbsp; can track item object inorder to find memory leak.<\/p>\n\n\n\n<p>Additional Item API members that can be used are&nbsp; copy() , deepcopy() and fields<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\"><li><strong>Dataclass objects <\/strong>&nbsp;<\/li><\/ol>\n\n\n\n<p>Item classes field names can be defined with dataclass().&nbsp; Default value and type for each dataclass can be defined.&nbsp; dataclasses.field() can be used to define custom field.<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"4\"><li><strong>attr.s objects<\/strong><\/li><\/ol>\n\n\n\n<p>Item classes with field names can be defined with attr.s().&nbsp; Each field type and definition and custom field metadata can also be defined.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"working-with-item-objects\"><strong>Working with Item Objects <\/strong><\/h2>\n\n\n\n<p><strong>Declaring Item subclasses<\/strong><\/p>\n\n\n\n<p>Simple class definition and Field objects can be used to declare Item subclasses.<\/p>\n\n\n\n<p>Example:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nimport scrapy\n\nclass Product(scrapy.Item):\n    name = scrapy.Field()\n    price = scrapy.Field()\n\n<\/pre><\/div>\n\n\n<p><strong>Declaring Fields<\/strong><\/p>\n\n\n\n<p>Field objects are used to specify any kind of metadata for each field. Different components can use the Field object.&nbsp;<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nClass \u2013 scrapy.item.Field\n<\/pre><\/div>\n\n\n<p><strong>Example<\/strong><\/p>\n\n\n\n<p><strong>Creating items<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nproduct = Product(name=&#039;Desktop PC&#039;, price=1000)\n\n<\/pre><\/div>\n\n\n<p><strong>Getting field values<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nproduct&#x5B;&#039;price&#039;]\n\n<\/pre><\/div>\n\n\n<p><strong>Setting field values<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nproduct&#x5B;&#039;lala&#039;] = &#039;test&#039;\n\n<\/pre><\/div>\n\n\n<p><strong>Accessing all populated values<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nproduct.keys()\nproduct.items()\n\n<\/pre><\/div>\n\n\n<p><strong>Copying items<\/strong><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nproduct2 = product.copy()\nproduct2 = product.deepcopy()\n\n<\/pre><\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"extending-item-subclass\"><strong>Extending Item Subclass<\/strong><\/h3>\n\n\n\n<p>Items can also be extended by defining a subclass of the original item.<\/p>\n\n\n\n<p>Metadata can be extended with previous metadata.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"supporting-all-item-types\">Supporting all Item Types<\/h3>\n\n\n\n<p>Class \u2013 itemadapter.ItemAdapter(item:Any)<\/p>\n\n\n\n<p>Common interface to extract and set data<\/p>\n\n\n\n<p>Itemadapter.is__item(obj:Any) -&gt; bool<\/p>\n\n\n\n<p>If the item belongs to the supported types then True will be returned.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"item-loaders\"><strong>ITEM LOADERS<\/strong><\/h2>\n\n\n\n<p>This is used to populate the items.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"using-item-loaders-to-populate-items\">Using Item Loaders to populate items<\/h3>\n\n\n\n<p>Item class creates item loader __init__ which is how item loader gets instantiated. Selectors load the value into the item loader. Item loader then joins using processing functions.<\/p>\n\n\n\n<p>add_xpath(), add_css() and add_value() are all used to collect data into an item loader. ItemLoader.load_item() populates the data extracted from add_xpath(),add_css() and add_value().<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"working-with-data-class-items\">Working with data class items<\/h3>\n\n\n\n<p>Passing of values can be controlled using field() when used with item loaders which will load the item automatically with the methods add_xpath(),add_css() and add_value().<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"input-and-output-processors\">Input and output processors<\/h3>\n\n\n\n<p>Each item loader has 1 input processor and 1 output processor.&nbsp;<\/p>\n\n\n\n<p>The input processor loads the data in&nbsp; the item loader through add_xpath(),add_css() and add_value().<\/p>\n\n\n\n<p>ItemLoader.load_item() then populates the data in the item loader.<\/p>\n\n\n\n<p>The output processor then assigns the value to the items.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"declaring-item-loaders\">Declaring Item Loaders<\/h3>\n\n\n\n<p>Input processors are declared using&nbsp; _in suffix.<\/p>\n\n\n\n<p>Output processors are declared using _out suffix.<\/p>\n\n\n\n<p>Also can be declared using&nbsp; ItemLoader.default_input_processor and ItemLoader.default_output_processor.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"declaring-input-and-output-processors\">Declaring Input and Output processors<\/h3>\n\n\n\n<p>Input\/Output processors can also be declared using Item Field metadata.<\/p>\n\n\n\n<p>Precedence order:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Item loader field specific attributes<\/li><li>Field metadata<\/li><li>Item Loader defaults<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"item-loader-context\">Item Loader Context<\/h3>\n\n\n\n<p>Item Loader Context can modify the behavior of the input\/output processors.&nbsp; It can be passed anytime and it is of dict type.<\/p>\n\n\n\n<p>loader_context passes the context that is active and parse_length uses it.<\/p>\n\n\n\n<p>To modify&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Modify the Item Loader context attribute<\/li><li>On loader instantiation<\/li><li>On item loader declaration<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"item-loader-object\">Item Loader Object<\/h3>\n\n\n\n<p>If no item then default_item_class gets instantiated.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>item<\/strong> \u2013 The objects that\u2019s parsed by the item loader<\/td><td><strong>context<\/strong> \u2013 current active context<\/td><\/tr><tr><td><strong>default_item_class<\/strong> \u2013 instantiates when not given in&nbsp; __init__()<\/td><td><strong>default_input_processor<\/strong> \u2013 Default input processor for which there is none<\/td><\/tr><tr><td><strong>default_output_processor<\/strong> \u2013 Default output processor for which there is none<\/td><td><strong>default_selector_class<\/strong> \u2013 Ignored if __init__() is given, if not then selector of item loader will get constructed<\/td><\/tr><tr><td><strong>selector<\/strong> \u2013 This object extracts the data.<\/td><td><strong>add_css(field_name,css,*processors,**kw)<\/strong> \u2013 css selector given in this extracts list of Unicode strings<\/td><\/tr><tr><td><strong>add_value(field_name,xpath,*processors,**kw)<\/strong> \u2013 Processors and kw passes the value to get_value() , then to field input processors and then appended to the data collected.<\/td><td><strong>add_xpath(field_name,*processors,**kw)<\/strong> \u2013 xpath will be used to extract list of strings<\/td><\/tr><tr><td><strong>get_collected_values(field_name)<\/strong> \u2013 Collected values will be returned<\/td><td><strong>get_css(css,*processors,**kw) <\/strong>\u2013 Css selector will be used to extract list of Unicode strings<\/td><\/tr><tr><td><strong>get_output_value(value,*processors,**kw)<\/strong> - &nbsp; collected values from parsed through output processors are returned.<\/td><td><strong>get_value(value,*processors,**kw)<\/strong> \u2013 given value is processed by the processors.<\/td><\/tr><tr><td><strong>get_xpath(xpath,*processors,**kw)<\/strong> \u2013 xpath will extract list of Unicode strings&nbsp;<\/td><td>l<strong>oad_item() <\/strong>\u2013 Used to populate the item&nbsp;<\/td><\/tr><tr><td><strong>nested_class(css,**context)<\/strong> \u2013 css selector creates nested loader<\/td><td><strong>nested_xpath(xpath,**context)<\/strong> \u2013 xpath selector creates nested loader<\/td><\/tr><tr><td><strong>replace_css(field_name,css,*processors,**kw)<\/strong> \u2013 replaces collected data&nbsp;<\/td><td><strong>replace_value(field_name,value,*processors,**kw)<\/strong> \u2013 replaces collected data<\/td><\/tr><tr><td>replace_value(field_name,value,*processors,**kw) \u2013 replaces collected data<\/td><td>replace_xpath(field_name,value,*preprocess,**kw) \u2013 replaces collected data<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"nested-loaders\">Nested Loaders<\/h3>\n\n\n\n<p>Nested Loaders can be used when the subsection values need to be parsed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"reusing-and-extending-item-loaders\">Reusing and Extending Item Loaders<\/h3>\n\n\n\n<p>Scrapy provides the support for python class inheritance and hence item loaders can be reused and extended.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"scrapy-shell\"><strong>SCRAPY SHELL<\/strong><\/h2>\n\n\n\n<p>Scrapy shell can be used for testing and evaluating spiders before running the entire spider. Individual queries can be checked in this.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"configuring-the-shell\">Configuring the shell<\/h3>\n\n\n\n<p>Scrapy works wonderful with IPython, and can support bpython. IPython is recommended as it provides auto-completion and colorized output.<\/p>\n\n\n\n<p>The setting can be changed by<\/p>\n\n\n\n<p>[settings]<\/p>\n\n\n\n<p>shell = bpython<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"launch-the-shell\">Launch the shell<\/h3>\n\n\n\n<p>To launch the shell<\/p>\n\n\n\n<p>scrapy shell &lt;url&gt;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"using-the-shell\">Using the shell<\/h3>\n\n\n\n<p>It just a regular python shell with additional shortcuts<\/p>\n\n\n\n<p><strong>Available shortcuts<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>shelp() &nbsp; - print list of available objects and lits<\/li><li>fetch(url,[.redirect=True]) \u2013 fetch response from URL<\/li><li>fetch(request) \u2013 fetch response from given request<\/li><li>view(response) \u2013 open the given response in the local browse<\/li><\/ol>\n\n\n\n<p><strong>Available scrapy objects<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>crawler \u2013 current crawler object<\/li><li>spider \u2013 that which can handle URL<\/li><li>request \u2013 Request object of last fetched page<\/li><li>response \u2013 response object containing last fetched item<\/li><li>settings \u2013 current scrapy settings<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"invoking-shell-from-spiders-to-inspect-responses\">Invoking shell from spiders to inspect responses<\/h3>\n\n\n\n<p>To see the response use:<\/p>\n\n\n\n<p>scrapy.shell.inspect_response<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"item-pipeline\"><strong>ITEM PIPELINE<\/strong><\/h2>\n\n\n\n<p>Post scraping item pipeline processes them.&nbsp;<\/p>\n\n\n\n<p>Item pipelines:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>cleanses HTML data<\/li><li>scraped data validation<\/li><li>duplicates validation<\/li><li>storing of scraped data<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"writing-item-pipeline\">Writing item pipeline<\/h3>\n\n\n\n<p>Item pipeline components are python classes.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>process_item(self,item,spider)&nbsp; - All the component calls this method and returns an item object, Deferred object or raise a DropItem. Item is scraped item , spider \u2013 the spider that scraped the item<\/li><li>open_spider(self,spider) \u2013 to open the spider.&nbsp;<\/li><li>Close_spider(self,spider) \u2013 to close the spider.<\/li><li>from_crawler(cls,crawler) \u2013 It creates a crawler and returns a new instance of pipeline.<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"example-application\">Example application:<\/h3>\n\n\n\n<ol class=\"wp-block-list\"><li>price validation and dropping items with no prices<\/li><li>write items to json file<\/li><li>write items to mongodb<\/li><li>take a screenshot of item<\/li><li>duplicates filter<\/li><\/ol>\n\n\n\n<p>To activate a pipeline, it has to be added to the ITEM_PIPELINES settings.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"feed-exports\">&nbsp;<strong>FEED EXPORTS<\/strong><\/h2>\n\n\n\n<p>Scrapy supports feed exports that is to export the scraped data into storage in multiple formarts.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"serialization-formats\">Serialization formats<\/h3>\n\n\n\n<p>Item exporters are used for this process.&nbsp; The supported formats are :<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong><em>Serialization format<\/em><\/strong><\/td><td><strong>Feed setting format key<\/strong><\/td><td><strong>Exporter<\/strong><\/td><\/tr><tr><td><em>JSON<\/em><\/td><td>json<\/td><td>JsonItemExporter<\/td><\/tr><tr><td><em>JSON lines<\/em><\/td><td>jsonlines<\/td><td>JsonItemExporter<\/td><\/tr><tr><td><em>CSV<\/em><\/td><td>csv<\/td><td>CsvItemExporter<\/td><\/tr><tr><td><em>XML<\/em><\/td><td>xml<\/td><td>XmlItemExporter<\/td><\/tr><tr><td><em>Pickle<\/em><\/td><td>pickle<\/td><td>MarshalItemExporter<\/td><\/tr><tr><td><em>Marshal<\/em><\/td><td>marshal<\/td><td>MarshalItemExporter<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"storages\">Storages<\/h3>\n\n\n\n<p>Supported backend storage:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Local filesystem<\/li><li>FTP<\/li><li>S3<\/li><li>Google cloud storage<\/li><li>Standard output<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"storage-uri-parameters\">Storage URI parameters<\/h3>\n\n\n\n<p>%(time)s \u2013 timestamp replaces this parameter<\/p>\n\n\n\n<p>%(name)s \u2013 spider name replaces this parameter<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"storage-backends\">Storage backends<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Storage backend<\/td><td>URI scheme<\/td><td>Example URI<\/td><td>Required external library<\/td><td><\/td><\/tr><tr><td>FTP<\/td><td>ftp<\/td><td>ftp:\/\/user:pass@ftp.example.com\/path\/to\/export.csv<\/td><td>None<\/td><td>Two connections : active or passiveDefault connection mode : PassiveFor active connection :FEED_STORAGE_FTP_ACTIVE = True<\/td><\/tr><tr><td>Amazon S3<\/td><td>s3<\/td><td>s3:\/\/mybucket\/path\/to\/export.csv<\/td><td><a href=\"https:\/\/github.com\/boto\/botocore\" target=\"_blank\" rel=\"noreferrer noopener\">botocore<\/a>&nbsp;&gt;= 1.4.87<\/td><td>AWS credentials can be passed through :AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY<br>Custom ACL;FEED_STORAGE_S3_ACL<\/td><\/tr><tr><td>Google Cloud Storage<\/td><td>gs<\/td><td>gs:\/\/mybucket\/path\/to\/export.csv<\/td><td><a rel=\"noreferrer noopener\" href=\"https:\/\/cloud.google.com\/storage\/docs\/reference\/libraries#client-libraries-install-python\" target=\"_blank\">google-cloud-storage<\/a><\/td><td>Project setting and Access Control Light setting:FEED_STORAGE_GCS_ACLGCS_PROJECT_ID<\/td><\/tr><tr><td>Standard Output<\/td><td>stdout<\/td><td>stdout:<\/td><td>none<\/td><td><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"delayed-file-directory\">Delayed File Directory<\/h3>\n\n\n\n<p>Storage backends that uses delayed file directory are :<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>FTP<\/li><li>S3<\/li><li>Google Cloud Storage<\/li><\/ol>\n\n\n\n<p>File content will be uploaded to the feed URI only if all the contents are collected entirely.<\/p>\n\n\n\n<p>To start the item delivery early use FEED_EXPORT_BATCH_ITEM_COUNT<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"settings\">Settings<\/h2>\n\n\n\n<p>Settings for feed exporters<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>FEEDS&nbsp;(mandatory)<\/li><li>FEED_EXPORT_ENCODING<\/li><li>FEED_STORE_EMPTY<\/li><li>FEED_EXPORT_FIELDS<\/li><li>FEED_EXPORT_INDENT<\/li><li>FEED_STORAGES<\/li><li>FEED_STORAGE_FTP_ACTIVE<\/li><li>FEED_STORAGE_S3_ACL<\/li><li>FEED_EXPORTERS<\/li><li>FEED_EXPORT_BATCH_ITEM_COUNT<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"feeds\">Feeds<\/h2>\n\n\n\n<p>Default : {}<\/p>\n\n\n\n<p>Feed is a dictionary in which all the feed URI are the keys and values are nested parameters.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Accepted Keys<\/td><td>Fallback Value<\/td><\/tr><tr><td>format<\/td><td>NIL<\/td><\/tr><tr><td>batch_item_count<\/td><td>FEED_EXPORT_BATCH_ITEM_COUNT<\/td><\/tr><tr><td>encoding<\/td><td>FEED_EXPORT_ENCODING<\/td><\/tr><tr><td>fields<\/td><td>FEED_EXPORT_FIELDS<\/td><\/tr><tr><td>Indent<\/td><td>FEED_EXPORT_INDENT<\/td><\/tr><tr><td>Item_exports_kwargs<\/td><td>dict with keyword arguments to corresponding item exporter class<\/td><\/tr><tr><td>overwrite<\/td><td>If already exists then True or else False<\/td><\/tr><tr><td>Local filesystem<\/td><td>False<\/td><\/tr><tr><td>FTP<\/td><td>True<\/td><\/tr><tr><td>S3<\/td><td>True<\/td><\/tr><tr><td>Standard Output<\/td><td>False<\/td><\/tr><tr><td>store_empty<\/td><td>FEED_STORE_EMPTY<\/td><\/tr><tr><td>uri_params<\/td><td>FEED_URI_PARAMS<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed-export-encoding\">Feed Export Encoding<\/h3>\n\n\n\n<p>Default: None<\/p>\n\n\n\n<p>Encoding: If unset or None is setting then UTF-8 will be set except for JSON. &nbsp; Utf-8 can be set for JSON too if needed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_export_fields\">FEED_EXPORT_FIELDS<\/h3>\n\n\n\n<p>Default: None<\/p>\n\n\n\n<p>To define fields use FEED_EXPORT_FIELDS<\/p>\n\n\n\n<p>When FEED_EXPORT_FIELDS are empty scrapy used fields from item objects<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_export_indent\">FEED_EXPORT_INDENT<\/h3>\n\n\n\n<p>Default:0<\/p>\n\n\n\n<p>If this is non-negative integer \u2013 array elements and objects are given<\/p>\n\n\n\n<p>If this is 0 or negative, it ll be in new line<\/p>\n\n\n\n<p>None will select compact representation<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_store_empty\">FEED_STORE_EMPTY<\/h3>\n\n\n\n<p>Default : False<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_storages\">FEED_STORAGES<\/h3>\n\n\n\n<p>Default : {}<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_storage_ftp_active\">FEED_STORAGE_FTP_ACTIVE<\/h3>\n\n\n\n<p>Default:False<\/p>\n\n\n\n<p>To use active or passive connection when exporting FTP<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_storage_s3_acl\">FEED_STORAGE_S3_ACL<\/h3>\n\n\n\n<p>Default:False<\/p>\n\n\n\n<p>Default: \u2018 \u2019<\/p>\n\n\n\n<p>String have custom ACL<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_storages_base\">FEED_STORAGES_BASE<\/h3>\n\n\n\n<p>Dict containing built-in feed storage.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_exporters\">FEED_EXPORTERS<\/h3>\n\n\n\n<p>Default: {}<\/p>\n\n\n\n<p>Dict containing additional exporters<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_exporters_base\">FEED_EXPORTERS_BASE<\/h3>\n\n\n\n<p>Dict having build-in feed exporters<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_export_batch_item_count\">FEED_EXPORT_BATCH_ITEM_COUNT<\/h3>\n\n\n\n<p>Default: 0<\/p>\n\n\n\n<p>Number greater than 0 then scrapy generates multiple file storing to a particular number<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"feed_uri_params\">FEED_URI_PARAMS<\/h3>\n\n\n\n<p>Default: None<\/p>\n\n\n\n<p>String with import path of function.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"requests-and-responses\"><strong>REQUESTS AND RESPONSES<\/strong><\/h2>\n\n\n\n<p>Requests and responses are made for crawling the site.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"request-objects\">Request Objects<\/h3>\n\n\n\n<p><strong>PARAMETERS<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>url \u2013 url of the request<\/li><li>callback \u2013 the function that gets called as a response for a request<\/li><li>method \u2013 Defaut : get.&nbsp; Method for the request<\/li><li>meta \u2013 dictionary values for Request.meta<\/li><li>body \u2013 If not available then bytes is stored.<\/li><li>headers \u2013 headers of the request<\/li><li>cookies \u2013 request cookies<\/li><li>encoding \u2013 encoding of the request<\/li><li>priority \u2013 priority of the request<\/li><li>don\u2019t_filter \u2013 request should not be filtered<\/li><li>errback \u2013 functions gets called if there is an exception<\/li><li>flags \u2013 flags sent for logging<\/li><li>cb_kwargs \u2013 dict passed as keyword arguments<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"passing-additional-data-to-callback-functions\">Passing additional data to callback functions<\/h3>\n\n\n\n<p>Request.cb_kwargs can be used to pass arguments to the callback functions so that these then can be passed to the second callback later&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"using-errbacks-to-catch-exceptions-in-request-processing\">Using errbacks to catch exceptions in request processing.<\/h3>\n\n\n\n<p>Failure will be received as the first parameter for the errbacks, this then can be used to track errors.<br>Additional data can be accessed by Failure.request.cb_kwargs<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"request-meta-special-keys\">Request.meta special keys<\/h3>\n\n\n\n<p>Special keys ;<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>dont_redirect<\/li><li>dont_retry<\/li><li>handle_httpstatus_list<\/li><li>handle_httpstatus_all<\/li><li>dont_merge_cookies<\/li><li>cookiejar<\/li><li>dont_cache<\/li><li>redirect_reasons<\/li><li>redirect_urls<\/li><li>bindaddress<\/li><li>dont_obey_robotstxt<\/li><li>download_timeout<\/li><li>download_maxsize<\/li><li>download_latency<\/li><li>download_fail_on_dataloss<\/li><li>proxy<\/li><li>ftp_user&nbsp;&nbsp;<\/li><li>ftp_password&nbsp;<\/li><li>referrer_policy<\/li><li>max_retry_times<\/li><\/ul>\n\n\n\n<p><strong>bindaddress \u2013 <\/strong>Outgoing IP address<\/p>\n\n\n\n<p><strong>download_timeout \u2013 <\/strong>time for the downloader to wait<\/p>\n\n\n\n<p><strong>download_latency \u2013 <\/strong>time to fetch response<\/p>\n\n\n\n<p><strong>download_fail_on_dataloss \u2013 <\/strong>to fail or not to fail on broken response<\/p>\n\n\n\n<p><strong>max_retry_times \u2013 <\/strong>to set retry times per request<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"stopping-the-download-of-response\">Stopping the download of&nbsp; response<\/h3>\n\n\n\n<p><strong>StopDownload<\/strong>&nbsp; exception will be raised to stop the download<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"request-subclasses\">Request subclasses<\/h3>\n\n\n\n<p>List of request subclasses<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>FormRequest Objects<\/li><\/ul>\n\n\n\n<p>Parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>formdata<\/li><\/ul>\n\n\n\n<p><em>classmethod<\/em>from_response(<em>response<\/em>[,&nbsp;<em>formname=None<\/em>,&nbsp;<em>formid=None<\/em>,&nbsp;<em>formnumber=0<\/em>,&nbsp;<em>formdata=None<\/em>,&nbsp;<em>formxpath=None<\/em>,&nbsp;<em>formcss=None<\/em>,&nbsp;<em>clickdata=None<\/em>,&nbsp;<em>dont_click=False<\/em>,&nbsp;<em>...<\/em>]<\/p>\n\n\n\n<p>Parameters:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>response<\/li><li>formname<\/li><li>formid<\/li><li>formxpath<\/li><li>formcss<\/li><li>formnumber<\/li><li>formdata<\/li><li>clickdata<\/li><li>don\u2019t_click<\/li><\/ol>\n\n\n\n<p>Examples:<\/p>\n\n\n\n<p>Fromrequest to send data via HTTP post<\/p>\n\n\n\n<p>To simulate user login<\/p>\n\n\n\n<ul class=\"wp-block-list\" start=\"2\"><li>JsonRequest<\/li><\/ul>\n\n\n\n<p>Parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>data<\/li><li>dumps_kwargs<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"response-objects\">Response Objects<\/h3>\n\n\n\n<p>These are HTTP responses.<\/p>\n\n\n\n<p>Parameters:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>url<\/li><li>status<\/li><li>headers<\/li><li>body<\/li><li>flags<\/li><li>request<\/li><li>certificate<\/li><li>ip_address<\/li><li>cb_kwargs<\/li><li>copy()<\/li><li>replace<strong> <\/strong>([<em>url<\/em>,&nbsp;<em>status<\/em>,&nbsp;<em>headers<\/em>,&nbsp;<em>body<\/em>,&nbsp;<em>request<\/em>,&nbsp;<em>flags<\/em>,&nbsp;<em>cls<\/em>])<\/li><li>urljoin(url)<\/li><li>follow(url,&nbsp;callback=None,&nbsp;method='GET',&nbsp;headers=None,&nbsp;body=None,&nbsp;cookies=None,&nbsp;meta=None,&nbsp;encoding='utf-8',&nbsp;priority=0,&nbsp;dont_filter=False,&nbsp;errback=None,&nbsp;cb_kwargs=None,&nbsp;flags=None)<\/li><li>follow_all(urls,&nbsp;callback=None,&nbsp;method='GET',&nbsp;headers=None,&nbsp;body=None,&nbsp;cookies=None,&nbsp;meta=None,&nbsp;encoding='utf-8',&nbsp;priority=0,&nbsp;dont_filter=False,&nbsp;errback=None,&nbsp;cb_kwargs=None,&nbsp;flags=None)<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"response-subclasses\">Response subclasses<\/h2>\n\n\n\n<p>List of subclasses:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>TestResponse objects<\/li><li>HtmlResponse objects<\/li><li>XmlResponse objects<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"link-extractors\"><strong>LINK EXTRACTORS<\/strong><\/h2>\n\n\n\n<p>Extracts links from responses.<\/p>\n\n\n\n<p>LxmlExtractor.extract_links returns a list of matching Link objects.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"link-extractor-reference\">Link Extractor Reference<\/h2>\n\n\n\n<p>Link extractor class is scrapy.linkextractor.lxmlhtml.LxmlLinkExtractor<\/p>\n\n\n\n<p><strong>LxmlLinkExtractor<\/strong><\/p>\n\n\n\n<p><strong>Parameters:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>allow<\/li><li>deny<\/li><li>allow_domains<\/li><li>deny_domains<\/li><li>deny_extensions<\/li><li>restrict_xpaths<\/li><li>restrict_css<\/li><li>restrict_text<\/li><li>tags<\/li><li>attrs<\/li><li>canonicalize<\/li><li>unique<\/li><li>process_value<\/li><li>strip<\/li><li>extract_links(response)<\/li><\/ol>\n\n\n\n<p><strong>Link<\/strong><\/p>\n\n\n\n<p>They represent the extracted link<\/p>\n\n\n\n<p><strong>Parameters:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>url<\/li><li>text<\/li><li>fragment<\/li><li>nofollow<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"settings\"><strong>SETTINGS<\/strong><\/h2>\n\n\n\n<p>Scrapy settings can be adjusted as needed<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"designating-the-setting\">Designating the setting<\/h3>\n\n\n\n<p>SCRAPY_SETTINGS_MODULE is used to set the settings.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"populating-the-settings\">Populating the settings<\/h3>\n\n\n\n<p>Settings can be populated in the following precedence :<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Command line options&nbsp; -&nbsp; \u201c-s\u201d or \u201c\u2014set\u201d is used to override the settings<\/li><li>Settings per-spider \u2013 This can be defined through \u201ccustom_settings\u201d attribute<\/li><li>Project settings module \u2013 This can be changed in the \u201csettings.py\u201d file.<\/li><li>Default settings per-command&nbsp; - \u201cdefault_settings\u201d&nbsp; is used to define this<\/li><li>Default global settings \u2013 scrapy.settings.default_settings&nbsp; is used to set this.<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"import-paths-and-classes\">Import Paths and Classes<\/h3>\n\n\n\n<p>Importing can be done<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>&nbsp;String containing the import path<\/li><li>Object<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-to-access-settings\">How to access settings<\/h3>\n\n\n\n<p>Settings can be accessed through \u201cself.settings\u201d&nbsp; in spider , \u201cscrapy.crawler.Crawler.settings\u201d in Crawler from \u201cfrom_crawler\u201d<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"rationale-for-setting-names\">Rationale for setting names<\/h3>\n\n\n\n<p>Setting name are prefixed with component name.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"built-in-settings-reference\">Built-in settings reference<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>AWS_ACCESS_KEY_ID<\/td><td>AWS_SECRET_ACCESS_KEY<\/td><td>AWS_ENDPOINT_URL<\/td><td>AWS_ENDPOINT_URL<\/td><td>AWS_USE_SSL<\/td><\/tr><tr><td>AWS_VERIFY<\/td><td>AWS_REGION_NAME<\/td><td>ASYNCIO_EVENT_LOOP<\/td><td>BOT_NAME<\/td><td>CONCURRENT_ITEMS<\/td><\/tr><tr><td>CONCURRENT_REQUESTS<\/td><td>CONCURRENT_REQUESTS_PER_DOMAIN<\/td><td>DEFAULT_ITEM_CLASS<\/td><td>DEFAULT_REQUEST_HEADERS<\/td><td>DEPTH_LIMIT<\/td><\/tr><tr><td>DEPTH_PRIORITY<\/td><td>DEPTH_STAT_VERBOSE<\/td><td>DNSCACHE_ENABLED<\/td><td>DNSCACHE_SIZE<\/td><td>DNS_RESOLVER<\/td><\/tr><tr><td>DOWNLOADER<\/td><td>DOWNLOADER_HTTPCLIENTFACTORY<\/td><td>DOWNLOADER_CLIENTCONTEXTFACTORY<\/td><td>DOWNLOADER_CLIENT_TLS_CIPHERS<\/td><td>DOWNLOADER_CLIENT_TLS_METHOD<\/td><\/tr><tr><td>DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING<\/td><td>DOWNLOADER_MIDDLEWARE<\/td><td>DOWNLOADER_MIDDLWARES_BASE<\/td><td>DOWNLOADER_STATS<\/td><td>DOWNLOAD_DELAY<\/td><\/tr><tr><td>DOWNLOAD_HANDLERS<\/td><td>DOWNLOAD_HANDLERS_BASE<\/td><td>DOWNLOAD_TIMEOUT<\/td><td>DOWNLOAD_MAXSIZE<\/td><td>DOWNLOAD_WARNSIZE<\/td><\/tr><tr><td>DOWNLOAD_FAIL_ON_DATALOSS<\/td><td>DUPEFILTER_CLASS<\/td><td>DUPEFILTER_DEBUG<\/td><td>EDITOR<\/td><td>EXTENSIONS<\/td><\/tr><tr><td>EXTENSIONS_BASE<\/td><td>FEED_TEMPDIR<\/td><td>FEED_STORAGE_GCS_ACL<\/td><td>FTP_PASSIVE_MODE<\/td><td>FTP_PASSWORD<\/td><\/tr><tr><td>FTP_USER<\/td><td>GCS_PROJECT_ID<\/td><td>ITEM_PIPELINES<\/td><td>ITEM_PIPELINES_BASE<\/td><td>LOG_ENABLED<\/td><\/tr><tr><td>LOG_FILE<\/td><td>LOG_FORMAT<\/td><td>LOG_DATEFORMAT<\/td><td>LOG_FORMATTER<\/td><td>LOG_LEVEL<\/td><\/tr><tr><td>LOG_STDOUT<\/td><td>LOG_SHORT_NAMES<\/td><td>LOGSTATS_INTERVAL<\/td><td>MEMDEBUG_ENGABLED<\/td><td>MEMDEBUG_NOTIFY<\/td><\/tr><tr><td>MEMUSAGE_ENABLED<\/td><td>MEMUSAGE_LIMIT_MB<\/td><td>MEMUSAGE_CHECK_INTERVAL_SECONDS<\/td><td>MEMUSAGE_WARNING_MB<\/td><td>NEWSPIDER_MODULE<\/td><\/tr><tr><td>RANDOMIZE_DOWNLOAD_DELAY<\/td><td>REACTOR_THREADPOOL_MAXSIZE<\/td><td>REDIRECT_PRIORITY_ADJUST<\/td><td>RETRY_PRIORITY_ADJUST<\/td><td>ROBOTSTXT_OBEY<\/td><\/tr><tr><td>ROBOTSTXT_PARSER<\/td><td>ROBOTSTXT_USER_AGENT<\/td><td>SCHEDULER<\/td><td>SCHEDULER_DEBUG<\/td><td>SCHEDULER_DISK_QUEUE<\/td><\/tr><tr><td>SCHEDULER_MEMORY_QUEUE<\/td><td>SCHEDULER_PRIORITY_QUEUE<\/td><td>SCRAPER_SLOT_MAX_ACTIVE_SIZE<\/td><td>SPIDER_CONTACTS<\/td><td>SPIDER_CONTACTS_BASE<\/td><\/tr><tr><td>SPIDER_LOADER_CLASS<\/td><td>SPIDER_LOADER_WARN_ONLY<\/td><td>SPIDER_MIDLDLEWARES<\/td><td>SPIDER_MIDDLEWARES_BASE<\/td><td>SPIDER_MODULES<\/td><\/tr><tr><td>STATS_CLASS<\/td><td>STATS_DUMP<\/td><td>STATSMAILER_RCPTS<\/td><td>TELNETCONSOLE_ENABLED<\/td><td>TEMPLATES_DIR<\/td><\/tr><tr><td>TWISTED_REACTOR<\/td><td>URLLENGTH_LIMIT<\/td><td>USER_AGENT<\/td><td><\/td><td><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"exceptions\"><strong>EXCEPTIONS<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"built-in-exceptions-reference\">Built-in Exceptions reference<\/h3>\n\n\n\n<ol class=\"wp-block-list\"><li>CloseSpider&nbsp; - Raised when the spider needs to be closed<\/li><li>DontCloseSpider \u2013 To stop spider from closing<\/li><li>DropItem \u2013 Item pipeline stops the item processing<\/li><li>IgnoreRequest \u2013 Request when needed to be ignored<\/li><li>NotConfigured \u2013 Raised by Extension\/Item pipelines\/Downloader middleware\/Spider middleware to tell that this will remain disabled.<\/li><li>NotSupported \u2013 Indicates when feature is not supported.<\/li><li>StopDownload \u2013 Nothing should be downloaded henceforth<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"a-sample-tutorial-to-try\">A sample tutorial to try&nbsp;<\/h3>\n\n\n\n<p>1. Open command prompt and traverse to the folder where you want to store the scraped data.<\/p>\n\n\n\n<p>2.&nbsp; Let\u2019s create the project under the name \u201cscrape\u201d<\/p>\n\n\n\n<p>Type&nbsp; the following in the conda shell<\/p>\n\n\n\n<p><strong><em>scrapy startproject scrape<\/em><\/strong><\/p>\n\n\n\n<p>The above command will create a folder with the name scrape containing a scrape folder and scrapy.cfg file.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Traverse inside this project scrape<\/li><li>Go inside the folder called spider and then create a file called \u201cproject.py\u201d<\/li><\/ol>\n\n\n\n<p>Type the following inside it:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nimport scrapy\n #scrapy.Spider needs to be extended\nclass scrape(scrapy.Spider): \n    #unique name that identifies the spider\n    name = &quot;posts&quot;    \n    start_urls  = &#x5B;&#039;https:\/\/blog.scrapinghub.com&#039;]\n \n     #takes in response to process downloaded responses.\n    def parse(self,response): \n         #for crawling each and every links\n        for post in response.css(&#039;div.post-item&#039;): \n            yield {\n                #extracts title\n                &#039;title&#039;:post.css(&#039;.post-header h2 a::text&#039;)&#x5B;0].get(), \n                #extracts date \n                &#039;date&#039;:post.css(&#039;.post-header a::text&#039;)&#x5B;1].get(),  \n                 #extracts author name\n                &#039;author&#039;:post.css(&#039;.post-header a::text&#039;)&#x5B;2].get() \n            }\n        #goes to next page\n        next_page = response.css(&#039;a.next-posts-link::attr(href)&#039;).get()\n        #if there is next page then this parse method gets called again   \n        if next_page is not None :    \n            next_page = response.urljoin(next_page)\n            yield scrapy.Request(next_page, callback=self.parse)\n\n<\/pre><\/div>\n\n\n<p>5. Save the file<br>6. In the cmd, run the file with the following command<br>7. scrapy crawl posts<br>8. All the links get crawled and at the same time title author date gets extracted.<\/p>\n\n\n\n<p>This brings us to the end of the Scrapy Tutorial. We hope that you were able to gain a comprehensive understanding of the same. If you wish to learn more such skills, check out the pool of <a href=\"https:\/\/www.mygreatlearning.com\/academy\" target=\"_blank\" rel=\"noreferrer noopener\">Free Online Courses offered by Great Learning Academy. <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Web Scraper A web scraper is a tool that is used to extract the data from a website.&nbsp;&nbsp; It&nbsp; involves the following process: Figure out the target website Get the URL of the pages from which the data needs to be extracted. Obtain the HTML\/CSS\/JS of those pages. Find the locators such as XPath or [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":30870,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[25860],"tags":[],"content_type":[36252],"class_list":["post-30778","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","content_type-tutorials"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Scrapy Tutorial - An Introduction | Python Scrapy Tutorial<\/title>\n<meta name=\"description\" content=\"Scrapy Tutorial: Scrapy does the work of a web crawler and the work of a web scraper. In this post you will know Scrapy Installation, Scrapy Packages &amp; Scrapy File Structure.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Scrapy Tutorial - An Introduction\" \/>\n<meta property=\"og:description\" content=\"Scrapy Tutorial: Scrapy does the work of a web crawler and the work of a web scraper. In this post you will know Scrapy Installation, Scrapy Packages &amp; Scrapy File Structure.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/\" \/>\n<meta property=\"og:site_name\" content=\"Great Learning Blog: Free Resources what Matters to shape your Career!\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/GreatLearningOfficial\/\" \/>\n<meta property=\"article:published_time\" content=\"2021-04-26T11:08:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-10-20T11:21:52+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/d1m75rqqgidzqn.cloudfront.net\/wp-data\/2021\/04\/15163655\/iStock-913017342.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1365\" \/>\n\t<meta property=\"og:image:height\" content=\"768\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Great Learning Editorial Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/Great_Learning\" \/>\n<meta name=\"twitter:site\" content=\"@Great_Learning\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Great Learning Editorial Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"25 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/\"},\"author\":{\"name\":\"Great Learning Editorial Team\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\"},\"headline\":\"Scrapy Tutorial - An Introduction\",\"datePublished\":\"2021-04-26T11:08:00+00:00\",\"dateModified\":\"2022-10-20T11:21:52+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/\"},\"wordCount\":6736,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/04\\\/iStock-913017342.jpg\",\"articleSection\":[\"IT\\\/Software Development\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/\",\"name\":\"Scrapy Tutorial - An Introduction | Python Scrapy Tutorial\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/04\\\/iStock-913017342.jpg\",\"datePublished\":\"2021-04-26T11:08:00+00:00\",\"dateModified\":\"2022-10-20T11:21:52+00:00\",\"description\":\"Scrapy Tutorial: Scrapy does the work of a web crawler and the work of a web scraper. In this post you will know Scrapy Installation, Scrapy Packages & Scrapy File Structure.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/04\\\/iStock-913017342.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/04\\\/iStock-913017342.jpg\",\"width\":1365,\"height\":768,\"caption\":\"scrapy tutorial\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/scrapy-tutorial\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Blog\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"IT\\\/Software Development\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/software\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Scrapy Tutorial &#8211; An Introduction\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"name\":\"Great Learning Blog\",\"description\":\"Learn, Upskill &amp; Career Development Guide and Resources\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"alternateName\":\"Great Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\",\"name\":\"Great Learning\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"width\":900,\"height\":900,\"caption\":\"Great Learning\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/GreatLearningOfficial\\\/\",\"https:\\\/\\\/x.com\\\/Great_Learning\",\"https:\\\/\\\/www.instagram.com\\\/greatlearningofficial\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/in.pinterest.com\\\/greatlearning12\\\/\",\"https:\\\/\\\/www.youtube.com\\\/user\\\/beaconelearning\\\/\"],\"description\":\"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.\",\"email\":\"info@mygreatlearning.com\",\"legalName\":\"Great Learning Education Services Pvt. Ltd\",\"foundingDate\":\"2013-11-29\",\"numberOfEmployees\":{\"@type\":\"QuantitativeValue\",\"minValue\":\"1001\",\"maxValue\":\"5000\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\",\"name\":\"Great Learning Editorial Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"caption\":\"Great Learning Editorial Team\"},\"description\":\"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.\",\"sameAs\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/\",\"https:\\\/\\\/in.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/x.com\\\/https:\\\/\\\/twitter.com\\\/Great_Learning\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCObs0kLIrDjX2LLSybqNaEA\"],\"award\":[\"Best EdTech Company of the Year 2024\",\"Education Economictimes Outstanding Education\\\/Edtech Solution Provider of the Year 2024\",\"Leading E-learning Platform 2024\"],\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/author\\\/greatlearning\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Scrapy Tutorial - An Introduction | Python Scrapy Tutorial","description":"Scrapy Tutorial: Scrapy does the work of a web crawler and the work of a web scraper. In this post you will know Scrapy Installation, Scrapy Packages & Scrapy File Structure.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/","og_locale":"en_US","og_type":"article","og_title":"Scrapy Tutorial - An Introduction","og_description":"Scrapy Tutorial: Scrapy does the work of a web crawler and the work of a web scraper. In this post you will know Scrapy Installation, Scrapy Packages & Scrapy File Structure.","og_url":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/","og_site_name":"Great Learning Blog: Free Resources what Matters to shape your Career!","article_publisher":"https:\/\/www.facebook.com\/GreatLearningOfficial\/","article_published_time":"2021-04-26T11:08:00+00:00","article_modified_time":"2022-10-20T11:21:52+00:00","og_image":[{"width":1365,"height":768,"url":"https:\/\/d1m75rqqgidzqn.cloudfront.net\/wp-data\/2021\/04\/15163655\/iStock-913017342.jpg","type":"image\/jpeg"}],"author":"Great Learning Editorial Team","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/Great_Learning","twitter_site":"@Great_Learning","twitter_misc":{"Written by":"Great Learning Editorial Team","Est. reading time":"25 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/#article","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/"},"author":{"name":"Great Learning Editorial Team","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad"},"headline":"Scrapy Tutorial - An Introduction","datePublished":"2021-04-26T11:08:00+00:00","dateModified":"2022-10-20T11:21:52+00:00","mainEntityOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/"},"wordCount":6736,"commentCount":0,"publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342.jpg","articleSection":["IT\/Software Development"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/","url":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/","name":"Scrapy Tutorial - An Introduction | Python Scrapy Tutorial","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/#primaryimage"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342.jpg","datePublished":"2021-04-26T11:08:00+00:00","dateModified":"2022-10-20T11:21:52+00:00","description":"Scrapy Tutorial: Scrapy does the work of a web crawler and the work of a web scraper. In this post you will know Scrapy Installation, Scrapy Packages & Scrapy File Structure.","breadcrumb":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/#primaryimage","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342.jpg","width":1365,"height":768,"caption":"scrapy tutorial"},{"@type":"BreadcrumbList","@id":"https:\/\/www.mygreatlearning.com\/blog\/scrapy-tutorial\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog","item":"https:\/\/www.mygreatlearning.com\/blog\/"},{"@type":"ListItem","position":2,"name":"IT\/Software Development","item":"https:\/\/www.mygreatlearning.com\/blog\/software\/"},{"@type":"ListItem","position":3,"name":"Scrapy Tutorial &#8211; An Introduction"}]},{"@type":"WebSite","@id":"https:\/\/www.mygreatlearning.com\/blog\/#website","url":"https:\/\/www.mygreatlearning.com\/blog\/","name":"Great Learning Blog","description":"Learn, Upskill &amp; Career Development Guide and Resources","publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"alternateName":"Great Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.mygreatlearning.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization","name":"Great Learning","url":"https:\/\/www.mygreatlearning.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","width":900,"height":900,"caption":"Great Learning"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/GreatLearningOfficial\/","https:\/\/x.com\/Great_Learning","https:\/\/www.instagram.com\/greatlearningofficial\/","https:\/\/www.linkedin.com\/school\/great-learning\/","https:\/\/in.pinterest.com\/greatlearning12\/","https:\/\/www.youtube.com\/user\/beaconelearning\/"],"description":"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.","email":"info@mygreatlearning.com","legalName":"Great Learning Education Services Pvt. Ltd","foundingDate":"2013-11-29","numberOfEmployees":{"@type":"QuantitativeValue","minValue":"1001","maxValue":"5000"}},{"@type":"Person","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad","name":"Great Learning Editorial Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","caption":"Great Learning Editorial Team"},"description":"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.","sameAs":["https:\/\/www.mygreatlearning.com\/","https:\/\/in.linkedin.com\/school\/great-learning\/","https:\/\/x.com\/https:\/\/twitter.com\/Great_Learning","https:\/\/www.youtube.com\/channel\/UCObs0kLIrDjX2LLSybqNaEA"],"award":["Best EdTech Company of the Year 2024","Education Economictimes Outstanding Education\/Edtech Solution Provider of the Year 2024","Leading E-learning Platform 2024"],"url":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"}]}},"uagb_featured_image_src":{"full":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342.jpg",1365,768,false],"thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342-150x150.jpg",150,150,true],"medium":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342-300x169.jpg",300,169,true],"medium_large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342-768x432.jpg",768,432,true],"large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342-1024x576.jpg",1024,576,true],"1536x1536":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342.jpg",1365,768,false],"2048x2048":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342.jpg",1365,768,false],"web-stories-poster-portrait":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342-640x768.jpg",640,768,true],"web-stories-publisher-logo":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342-96x96.jpg",96,96,true],"web-stories-thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/04\/iStock-913017342-150x84.jpg",150,84,true]},"uagb_author_info":{"display_name":"Great Learning Editorial Team","author_link":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"},"uagb_comment_info":0,"uagb_excerpt":"Web Scraper A web scraper is a tool that is used to extract the data from a website.&nbsp;&nbsp; It&nbsp; involves the following process: Figure out the target website Get the URL of the pages from which the data needs to be extracted. Obtain the HTML\/CSS\/JS of those pages. Find the locators such as XPath or&hellip;","_links":{"self":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/30778","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/comments?post=30778"}],"version-history":[{"count":75,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/30778\/revisions"}],"predecessor-version":[{"id":84158,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/30778\/revisions\/84158"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media\/30870"}],"wp:attachment":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media?parent=30778"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/categories?post=30778"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/tags?post=30778"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/content_type?post=30778"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}