{"id":17311,"date":"2020-08-01T14:52:01","date_gmt":"2020-08-01T09:22:01","guid":{"rendered":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/"},"modified":"2024-09-03T15:15:20","modified_gmt":"2024-09-03T09:45:20","slug":"apache-hadoop-tutorial","status":"publish","type":"post","link":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/","title":{"rendered":"Apache Hadoop Tutorial |What is Apache Hadoop?"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\" id=\"big-data-introduction\"><strong>Big data\u2013 Introduction<\/strong><\/h2>\n\n\n\n<p>Before we jump into our Hadoop Tutorial, lets understand Big Data. Will start with questions like What is Big data, Why big data, What big data signifies so that the companies\/industries are moving to big data from legacy systems, is it worth it to learn big data technologies and will professional get paid high?<a href=\"https:\/\/www.mygreatlearning.com\/apache\/free-courses\" target=\"_blank\" rel=\"noreferrer noopener\"> Learn Apache<\/a> today!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-big-data\"><strong>What is Big data?<\/strong><\/h2>\n\n\n\n<p>By name implies, big data is data with huge size. We get a large amount of data in different forms from different sources and in huge volume, velocity, variety and etc which can be derived from human or machine sources.<\/p>\n\n\n\n<p>We are talking about data and let us see what are the types of data to understand the logic behind big data. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"types-of-data\"><strong>Types of data:<\/strong><\/h2>\n\n\n\n<p>Three types of data can be classified as:<\/p>\n\n\n\n<p><strong>Structured data:<\/strong>&nbsp; Data which is represented in a tabular form. The data can be stored, accessed and processed in the form of fixed format. Ex: databases, tables.<\/p>\n\n\n\n<p><strong>Semi structured data:<\/strong>&nbsp; Data which does not have a formal data model. Ex: XML files.<\/p>\n\n\n\n<p><strong>Unstructured data:<\/strong> Data which does not have a pre-defined data model. Ex: Text files, web logs.<\/p>\n\n\n\n<p>Learn more about <a href=\"https:\/\/www.mygreatlearning.com\/blog\/structured-and-unstructured-data\/\">Structured, Unstructured, and Semi-Structured Data<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"let-us-look-at-the-6-vs-of-big-data\"><strong>Let us look at the 6 V's of Big Data<\/strong><\/h2>\n\n\n\n<p><strong>Volume:<\/strong> The amount of data from various sources like in TB, PB, ZB etc. It is a rise of bytes we are nowhere in GBs now.<\/p>\n\n\n\n<p><strong>Velocity:<\/strong> High frequency data like in stocks. The speed at which big data is generated.<\/p>\n\n\n\n<p><strong>Veracity:<\/strong> Refers to the biases, noises and abnormality in data.<\/p>\n\n\n\n<p><strong>Variety:<\/strong> Refers to the different forms of data. Data can come in various forms and shapes, like visuals data like pictures, and videos, log data etc. This can be the biggest problem to handle for most businesses.<\/p>\n\n\n\n<p><strong>Variability:<\/strong> to what extent, and how fast, is the structure of your data changing? And how often does the meaning or shape of your data change?<\/p>\n\n\n\n<p><strong>Value:<\/strong> This describes what value you can get from which data, how big data will get better results from stored data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"challenges-in-big-data\"><strong>Challenges in Big data<\/strong><\/h2>\n\n\n\n<p><strong>Complex:<\/strong> No proper understanding of the underlying data<\/p>\n\n\n\n<p><strong>Storage:<\/strong> How to accommodate large amounts of data in a single physical machine.<\/p>\n\n\n\n<p><strong>Performance:<\/strong> How to process large amounts of data efficiently and effectively so as to increase the performance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"big-data-technologies\"><strong>Big Data Technologies<\/strong><\/h2>\n\n\n\n<p>Big Data is broad and surrounded by many trends and new technology developments, the top emerging technologies given below are helping users cope with and handle Big Data in a cost-effective manner.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Apache Hadoop<\/li>\n\n\n\n<li>Apache Spark<\/li>\n\n\n\n<li>Apache Hive<\/li>\n<\/ol>\n\n\n\n<p>There are many other technologies. But we will learn about the above 3 technologies in detail.<\/p>\n\n\n\n<p>Also Read: <a href=\"https:\/\/www.mygreatlearning.com\/blog\/javascript-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Introduction to JavaScript  (opens in a new tab)\">Introduction to JavaScript <\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hadoop-tutorial-introduction\"><strong>Hadoop Tutorial Introduction<\/strong><\/h2>\n\n\n\n<p>Hadoop is a distributed parallel processing framework, which facilitates distributed computing.<\/p>\n\n\n\n<p>Now to dig more on Hadoop Tutorial, we need to have understanding on \u201cDistributed Computing\u201d. This will actually give us a root cause of the Hadoop and understand this Hadoop Tutorial. To learn more, you can also take up <a href=\"https:\/\/www.mygreatlearning.com\/hadoop\/free-courses\" target=\"_blank\" rel=\"noreferrer noopener\">Free Hadoop Courses<\/a> learn gain comprehensive knowledge about this in-demand skill. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"distributed-computing\"><strong>Distributed Computing<\/strong><\/h2>\n\n\n\n<p>In simple English, distributed computing is also called parallel processing. Let's take an example, let's say we have a task of painting a room in our house, and we will hire a painter to paint and may approximately take 2 hours to paint one surface. Let's say we have 4 walls and 1 ceiling to be painted and this may take one day(~10 hours) for one man to finish, if he does this non stop.<\/p>\n\n\n\n<p>The same thing to be done by 4 or 5 more people can take half a day to finish the same task. This is the simple real time problem to understand the logic behind distributed computing<\/p>\n\n\n\n<p>Now let's take an actual data related problem and analyse the same.<\/p>\n\n\n\n<p>We have an input file of lets say 1 GB and we need to calculate the sum of these numbers together and the operation may take 50secs to produce a sum of numbers<\/p>\n\n\n\n<p>Then let's take the same example by dividing the dataset into 2 parts and give the input to 2 different machines, then the operation may take 25 secs to produce the same sum results.<\/p>\n\n\n\n<p>This is the fundamental idea of parallel processing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-hadoop\"><strong>Why Hadoop?<\/strong><\/h2>\n\n\n\n<p>The idea of parallel processing was not something new!<\/p>\n\n\n\n<p>The&nbsp; idea ws existing since long back in the time of Super computers (back in 1970s)<\/p>\n\n\n\n<p>There we used to have army of network engineers and cables required in manufacturing supercomputers and there are still few research organizations which use these kind of infrastructures which is called as \u201csuper Computers\u201d.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"lets-see-what-were-the-challenges-of-supercomputing\"><strong>Lets see what were the challenges of SuperComputing<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A general purpose operating system like framework for parallel computing needs did not exist<\/li>\n\n\n\n<li>Companies procuring supercomputers were locked to specific vendors for hardware support<\/li>\n\n\n\n<li>High initial cost of the hardware<\/li>\n\n\n\n<li>Develop custom software for individual use cases<\/li>\n\n\n\n<li>High cost of software maintenance and upgrades which had to be taken care in house the organizations using a supercomputer<\/li>\n\n\n\n<li>Not simple to scale horizontally<\/li>\n<\/ul>\n\n\n\n<p><strong><em>There should be a better reason always!<\/em><\/strong> <strong><em>HADOOP comes to the rescue.<\/em><\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A general purpose operating system like framework for parallel computing needs<\/li>\n\n\n\n<li>Its free software (open source) with free upgrades<\/li>\n\n\n\n<li>Has options for upgrading the software and its free <\/li>\n\n\n\n<li>Opens up the power of distributed computing to a wider set of audience <\/li>\n\n\n\n<li>Mid sized organizations need not be locked to specific vendors for hardware support \u2013 Hadoop works on commodity hardware <\/li>\n\n\n\n<li>The software challenges of the organization having to write proprietary softwares is no longer the case.<\/li>\n<\/ul>\n\n\n\n<p>Data is everywhere. People upload videos, take pictures, use several apps on their phones, search the web and more. Machines too, are generating and keeping more and more data. Existing tools are incapable of processing such large data sets. Hadoop and large-scale distributed data processing, in general, is rapidly becoming an important skill set for many programmers. Hadoop is an open-source framework for writing and running distributed applications that process large amounts of data. This \" <a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/introduction-to-big-data-and-hadoop\" target=\"_blank\" rel=\"noreferrer noopener\">Hadoop map reduce course<\/a>\" introduces Hadoop in terms of distributed systems as well as data processing systems. With this course, get an overview of the MapReduce programming model using a simple word counting mechanism along with existing tools that highlight the challenges of processing data at a large scale. Dig deeper and implement this example using Hadoop to gain a deeper appreciation of its simplicity.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"history-of-hadoop\"><strong>History<\/strong> <strong>of Hadoop<\/strong><\/h2>\n\n\n\n<p>Before getting into the Hadoop Tutorial, let us take a look at the history of Hadoop. <\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The need of the hour was scalable search engine for the growing internet<\/li>\n\n\n\n<li>Internet Archive search director Doug Cutting and University of Washington graduate student Mike Cafarella set out to build a search engine and the project named NUTCH in the year 2001-2002<\/li>\n\n\n\n<li>Google's distributed file system paper came out in 2003 &amp; &nbsp; first file map-reduce paper came out in 2004<\/li>\n\n\n\n<li>In 2006 Dough Cutting joined YAHOO and created an open source framework called HADOOP (name of his son's toy elephant) HADOOP traces back its root to NUTCH, Google's distributed file system and map-reduce processing engine.<\/li>\n\n\n\n<li>It went to become a full fledged Apache project and a stable version of Hadoop was used in Yahoo in the year 2008<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hadoop-framework-stepping-into-hadoop-tutorial\"><strong>Hadoop Framework: Stepping into Hadoop<\/strong> <strong>Tutorial<\/strong><\/h2>\n\n\n\n<p>Let us look at some Key terms used while discussing Hadoop Tutorial.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Commodity hardware: PCs which can be used to make a cluster<\/li>\n\n\n\n<li>Cluster\/grid: Interconnection of systems in a network<\/li>\n\n\n\n<li>Node: A single instance of a computer<\/li>\n\n\n\n<li>Distributed System: A system composed of multiple autonomous computers that communicate through a computer network<\/li>\n\n\n\n<li>ASF: Apache Software Foundation<\/li>\n\n\n\n<li>HA: High Availability<\/li>\n\n\n\n<li>Hot stand-by: Uninterrupted failover whereas cold stand-by will be there will be noticeable delay. If the system goes down, you will have to reboot<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"master-slave-architecture80\"><strong>Master Slave Architecture(80)<\/strong><\/h2>\n\n\n\n<p>Lets try to understand the architectural components of Hadoop 1.0 in this Hadoop Tutorial-<\/p>\n\n\n\n<p>For example:<\/p>\n\n\n\n<p>Suppose there are 10 machines in your cluster, out of which 3 machines will always be working as \u2019 Masters\u2019 and it will be names as-<\/p>\n\n\n\n<p><em>Namenode<\/em><\/p>\n\n\n\n<p><em>Secondary name node<\/em><\/p>\n\n\n\n<p><em>Job tracker<\/em><\/p>\n\n\n\n<p>These 3 will be individual machines and will work in the master mode.<\/p>\n\n\n\n<p>The rest of the 7 machines&nbsp; in \u201cSlave\u201d mode and they will wait for instructions from the master and these nodes are called as <em>Data nodes<\/em>.<\/p>\n\n\n\n<p>All these will be interconnected to each other and all these machines are belonging to a cluster.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hadoop-deployment-modes\"><strong>HADOOP Deployment Modes<\/strong><\/h2>\n\n\n\n<p>HADOOP supports 3 configuration modes when its is implemented on commodity hardware:<\/p>\n\n\n\n<p><strong>Standalone mode&nbsp; :<\/strong> All services run locally on single machine on a single JVM (seldom used)<\/p>\n\n\n\n<p><strong>Pseudo distributed mode :<\/strong> All services run on the same machine but on a different JVM (development and testing purpose)<\/p>\n\n\n\n<p><strong>Fully distributed mode:<\/strong> Each service runs on a separate hardware (a dedicated server). Used in production setup.<\/p>\n\n\n\n<p>Note: Service here refers to namenode, secondary name node, job tracker and data node.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hadoop-ecosystem\"><strong>Hadoop Ecosystem<\/strong><\/h2>\n\n\n\n<p>As we learn more in this Hadoop Tutorial, let us now understand the roles and responsibilities of each <strong>component in the Hadoop ecosystem.<\/strong><\/p>\n\n\n\n<p>Now we know Hadoop has a distributed computing framework, now at the same time it should also have a distributed file storage system. Hadoop has a built- in distributed file system called HDFS<strong><em> <\/em><\/strong>which will be explained in detail, down the line.<\/p>\n\n\n\n<p>HDFS(Hadoop distributed file system) \u2013 saves the file on multiple datanodes.<\/p>\n\n\n\n<p>The files in the Hadoop cluster will be splitted into smaller blocks and these blocks will be residing on datanodes.<\/p>\n\n\n\n<p>Namenodes in other sides will have the information on what is the size of the files, how many blocks are residing in datanodes, and which of the datanodes this file is actually residing in?<\/p>\n\n\n\n<p>The namenode maintains all this information in a form of a file table. So the namenode is the go to place to find the file, where it is located. We hope you're enjoying the Hadoop Tutorial so far! <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"master-nodes\"><strong>Master nodes<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Name node<\/strong>:&nbsp; Central file system manager. But mind you, the namenode doesn\u2019t save any files. All the files will be residing in Datanodes.<br>- <strong>Secondary name node :<\/strong> Data backup of name node (not hot standby)<br>- <strong>Job tracker<\/strong>: Centralized job scheduler<\/li>\n\n\n\n<li><strong>Slave nodes and daemons\/software services<\/strong><\/li>\n\n\n\n<li><strong>Data node<\/strong>: Machine where files gets stored and processed<\/li>\n\n\n\n<li><strong>Task tracker:<\/strong> A software service which monitors the state of job tracker<\/li>\n<\/ul>\n\n\n\n<p><strong>Note: <\/strong>Every slave node keeps sending a heart beat signal to the name node once in every 3 seconds to state that its alive. What happens when a data node goes down would be discussed down the line.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-a-job-in-the-hadoop-ecosystem\"><strong>What is a job in the Hadoop ecosystem?<\/strong><\/h2>\n\n\n\n<p>A job usually is some <strong>task<\/strong> submitted by the user to the Hadoop cluster.<\/p>\n\n\n\n<p>The job is in the form of a <strong>program or collection of programs<\/strong> (a JAR file) which needs to be executed.<\/p>\n\n\n\n<p>&nbsp;A job would have the following attributes to it<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The actual program<\/li>\n\n\n\n<li>Input data to the program (a file or collection of files in a directory)<\/li>\n\n\n\n<li>The output directory where the results of execution is collected in a files<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"core-features-of-apache-hadoop\"><strong>Core features of Apache Hadoop <\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>HDFS <\/strong>(Hadoop Distributed File System) \u2013 data storage<\/li>\n\n\n\n<li><strong>MapReduce<\/strong> Framework \u2013 compute in distributed environment<br>- A Java framework responsible for processing jobs in distributed mode<br>- User-defined map phase, which is a parallel, share-nothing processing of input<br>- User-defined reduce phase aggregates of the output of the map phase<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"submitting-and-executing-a-job-in-a-hadoop-cluster\"><strong>Submitting and executing a job in a hadoop cluster<\/strong><\/h2>\n\n\n\n<p>The engineer\/analyst's machine is not a part of the Hadoop cluster. Usually Hadoop would be installed in pseudo-distributed mode on his\/her machine. The job ( program\/s ) would be submitted to the&nbsp; gateway machine<\/p>\n\n\n\n<p>The gateway machine&nbsp; would have the necessary configuration to communicate to the name node and job tracker.<\/p>\n\n\n\n<p>Job gets submitted to the name node and eventually the job tracker is responsible for scheduling the execution of the job on the data nodes in the cluster.<\/p>\n\n\n\n<p><strong>HDFS\u2013 The storage layer in Hadoop<\/strong><\/p>\n\n\n\n<p>Takes the input file from the client and Namenode(is a part of Hadoop cluster) splits the task and assigns it to Datanodes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"splitting-of-file-into-blocks-in-hdfs\"><strong>Splitting of file into blocks in HDFS<\/strong><\/h2>\n\n\n\n<p>Default size is 64MB(it can be changed). Original file size is 200 MB. 200 MB is split into 4 blocks of N1, N2, N3 and N4.<\/p>\n\n\n\n<p>The block N4 is just 8MB(200\u201364*3)<\/p>\n\n\n\n<p>Each block is now a separate file and N1, N2, N3 and N4 are file names.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"file-storage-in-hdfs\"><strong>File storage in HDFS<\/strong><\/h2>\n\n\n\n<p>Breaking up of the original file into multiple blocks happens in the client machine and not in the name node.<\/p>\n\n\n\n<p>The decision of which block resides on which data node is not done randomly!<\/p>\n\n\n\n<p>Client machine directly writes the files to the data nodes once the name node provides the details about data nodes<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"failure-of-a-data-node\"><strong>Failure of a data node<\/strong><\/h2>\n\n\n\n<p><strong>What happens in the event of a data node<\/strong> <strong>Failure ?&nbsp; (eg : DN 10 fails)<\/strong><\/p>\n\n\n\n<p>Data saved on that node will be lost. To avoid loss of data, copies of the Data blocks on data nodes are stored on multiple data nodes. This is called data replication.<\/p>\n\n\n<p><!--StartFragment--><\/p>\n\n\n<h2 class=\"wp-block-heading\" id=\"replication-of-data-blocks\"><strong>Replication of Data blocks<\/strong><\/h2>\n\n\n\n<p><strong>How many copies of each block to save?&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <\/strong><strong>\t<\/strong><strong>&nbsp;&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p>Its decided by REPLICATION FACTOR (by default its <strong>3<\/strong>, i.e. every block of data on each data node is saved on 2 more machines so that there is&nbsp; total 3 copies of the same data block on different machines). This replication factor can be set on per file basis while the file is being written to HDFS for the first time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"replica-placement-strategy\"><strong>Replica Placement Strategy<\/strong><\/h2>\n\n\n\n<p>Q. How does namenode choose which datanodes to store replicas on?<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Replica&nbsp; Placements are rack aware. Namenode uses the network location when determining where to place block replicas.<\/li>\n\n\n\n<li>Tradeoff: Reliability v\/s read\/write bandwidth e.g.<\/li>\n<\/ul>\n\n\n\n<p>\u2013&nbsp; \tIf all replica is on single node - lowest write bandwidth but no redundancy if nodes fails<\/p>\n\n\n\n<p>\u2013&nbsp; \tIf replica is off-rack - real redundancy but high read bandwidth (more time)<\/p>\n\n\n\n<p>\u2013&nbsp; \tIf replica is off datacenter \u2013 best redundancy at the cost of huge bandwidth<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Hadoop\u2019s default strategy:<\/li>\n<\/ul>\n\n\n\n<p>\u2013&nbsp; \t1st replica on same node as client<\/p>\n\n\n\n<p>\u2013&nbsp; \t2nd replica on off rack any random node<\/p>\n\n\n\n<p>\u2013&nbsp; \t3rd replica is same rack as 2nd but other node<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Clients always read from the nearest node<\/li>\n\n\n\n<li>Once the replica locations is chosen a pipeline is built taking network topology into account<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"name-node-and-secondary-name-node-in-hadoop-1-0\"><strong>Name node and secondary name node in Hadoop 1.0<\/strong><\/h2>\n\n\n\n<p>Name node: I know where the file blocks are\u2026<\/p>\n\n\n\n<p>Secondary name node:  I shall back up the data of the name node<\/p>\n\n\n\n<p><em>But I do not work in HOT STANDBY mode in the event of name node failure<\/em>L<\/p>\n\n\n\n<p><strong>Note<\/strong>: In Hadoop 1.0, there is no active standby secondary name node.&nbsp;<\/p>\n\n\n\n<p>(HA : Highly available is another term used for HOT\/ACTIVE STANDBY )<\/p>\n\n\n\n<p>If the name node fails, the entire cluster goes down ! We need to manually restart<\/p>\n\n\n\n<p>The name node and the contents of the secondary name node has to&nbsp; be copied to it.&nbsp;&nbsp;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"hdfs-advantages\"><strong>HDFS Advantages<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Storing large files<\/li>\n\n\n\n<li>Terabytes, Petabytes, etc.<\/li>\n\n\n\n<li>millions rather billions of files (less number of large files)<\/li>\n\n\n\n<li>Each file typically 100MB or more &nbsp; &nbsp; &nbsp; &nbsp;  <\/li>\n\n\n\n<li>Streaming data<\/li>\n\n\n\n<li>WORM - write once read many times patterns<\/li>\n\n\n\n<li>Optimized for batch\/streaming reads rather than random reads<\/li>\n\n\n\n<li>Append operation added to Hadoop 0.21<\/li>\n\n\n\n<li>Cheap commodity hardware<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"hdfs-disadvantages\"><strong>HDFS Disadvantages<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Large amount of small files<\/li>\n\n\n\n<li>Better for less no of large files instead of more small files<\/li>\n\n\n\n<li>Low latency reads<\/li>\n\n\n\n<li>Many writes: write once, no random writes, append mode write at end of file<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hadoop-installation\"><strong>Hadoop Installation<\/strong><\/h2>\n\n\n\n<p>In this next step of the Hadoop Tutorial, lets look at how to install Hadoop in our machines or work in a Big data cloud lab.<\/p>\n\n\n\n<p>Please see the installation steps:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1. create EC2 instance with Ubuntu 18.04\n2. Create new user, grant root permission(ALL)\nsudo addgroup hadoop\nsudo adduser -- ingroup hadoop hduser\nsudo visudo\nhduser ALL=(ALL:ALL) ALL\nsu - hduser \n3. Install Java (Upload to \/usr\/local using winscp) and follow steps\nMAKE SURE THAT DOUBLE QUOTES (\"\") ARE REPRESENTED CORRECTLY.\nsudo tar xvzf jdk-8u181-linux-x64.tar.gz\nsudo mv jdk1.8.0_181 java\n \nls\ncd ~\nsudo nano ~\/.bashrc\n \nexport JAVA_HOME=\/usr\/local\/java\nexport PATH=$PATH:\/usr\/local\/java\/bin\n \nsource ~\/.bashrc\n \nsudo update-alternatives --install \"\/usr\/bin\/java\" \"java\" \"\/usr\/local\/java\/bin\/java\" 1\nsudo update-alternatives --install \"\/usr\/bin\/javac\" \"javac\" \"\/usr\/local\/java\/bin\/javac\" 1\nsudo update-alternatives --install \"\/usr\/bin\/javaws\" \"javaws\" \"\/usr\/local\/java\/bin\/javaws\" 1\n<\/code><\/pre>\n\n\n\n<p><strong>Verify Java installation<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo update-alternatives --set java \/usr\/local\/java\/bin\/java\nsudo update-alternatives --set javac \/usr\/local\/java\/bin\/javac\nsudo update-alternatives --set javaws \/usr\/local\/java\/bin\/javaws\n \njava -version\n \n4. Passwordless SSH, follow the steps below\n \nssh localhost\n \n$ ssh-keygen -t rsa\n$ cat ~\/.ssh\/id_rsa.pub &gt;&gt; ~\/.ssh\/authorized_keys\n$ chmod 0600 ~\/.ssh\/authorized_keys\n \nThen try ssh localhost\n5. Install Hadoop from apache web site.\n \ncd \/usr\/local\n \nwget hadoop\n \nsudo tar xvzf hadoop-3.0.2.tar.gz\nsudo mv hadoop-3.0.2 hadoop\n \nFirst, provide the ownership of hadoop to \u2018user\u2019 hduser &#91;\u201c This will give ownership only to hduser for running hadoop services \u201d] using chmod &amp; change the mode of hadoop folder to read, write &amp; execute modes of working.\n \nsudo chown -R hduser:hadoop \/usr\/local\/hadoop\nsudo chmod -R 777 \/usr\/local\/hadoop\n \n \n \nDisable IPV6\nHadoop &amp; IPV6 does not agrees on the meaning of address 0.0.0.0 so we need to disable IPV6 editing the file\u2026\nsudo nano \/etc\/sysctl.conf\nwith\u2026\nnet.ipv6.conf.all.disable_ipv6=1\nnet.ipv6.conf.default_ipv6=1\nnet.ipv6.conf.lo.disable_ipv6=1\n \nFor confirming if IPV6 is disable or not! execute the command.\ncat \/proc\/sys\/net\/ipv6\/conf\/all\/disable_ipv6\n \nApply changes in .bashrc file for setting the necessary hadoop environment. Setting changes with hadoop path. Locations of sbin&#91; \u201cIt stores hadoop\u2019s necessary command location\u201d ] &amp; bin directory path are essential otherwise as user you have to always change location to hadoop\u2019s sbin or bin to run required commands.\n \nsudo nano ~\/.bashrc\n \n#HADOOP ENVIRONMENT\nexport HADOOP_PREFIX=\/usr\/local\/hadoop\nexport HADOOP_CONF_DIR=\/usr\/local\/hadoop\/etc\/hadoop\nexport HADOOP_MAPRED_HOME=\/usr\/local\/hadoop\nexport HADOOP_COMMON_HOME=\/usr\/local\/hadoop\nexport HADOOP_HDFS_HOME=\/usr\/local\/hadoop\nexport YARN_HOME=\/usr\/local\/hadoop\nexport PATH=$PATH:\/usr\/local\/hadoop\/bin\nexport PATH=$PATH:\/usr\/local\/hadoop\/sbin\n \n#HADOOP NATIVE PATH:\nexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME\/lib\/native\nexport HADOOP_OPTS=\u201c-Djava.library.path=$HADOOP_PREFIX\/lib\u201d\n \n \ncd \/usr\/local\/hadoop\/etc\/hadoop\/\n \nsudo nano hadoop-env.sh\n \nexport HADOOP_OPTS=-Djava.net.preferIPv4Stack=true\nexport JAVA_HOME=\/usr\/local\/java\nexport HADOOP_HOME_WARN_SUPPRESS=\u201dTRUE\u201d\nexport HADOOP_ROOT_LOGGER=\u201dWARN,DRFA\u201d\n \nsudo nano yarn-site.xml\n \n&lt;property&gt;\n&lt;name&gt;yarn.nodemanager.aux-services&lt;\/name&gt;\n&lt;value&gt;mapreduce_shuffle&lt;\/value&gt;\n&lt;\/property&gt;\n&lt;property&gt;\n&lt;name&gt;yarn.nodemanager.aux-services.mapreduce.shuffle.class&lt;\/name&gt;\n&lt;value&gt;org.apache.hadoop.mapred.ShuffleHandler&lt;\/value&gt;\n&lt;\/property&gt;\n \nsudo nano hdfs-site.xml\n \n&lt;property&gt;\n&lt;name&gt;dfs.replication&lt;\/name&gt;\n&lt;value&gt;1&lt;\/value&gt;\n&lt;\/property&gt;\n&lt;property&gt;\n&lt;name&gt;dfs.namenode.name.dir&lt;\/name&gt;\n&lt;value&gt;file:\/usr\/local\/hadoop\/yarn_data\/hdfs\/namenode&lt;\/value&gt;\n&lt;\/property&gt;\n&lt;property&gt;\n&lt;name&gt;dfs.datanode.data.dir&lt;\/name&gt;\n&lt;value&gt;file:\/usr\/local\/hadoop\/yarn_data\/hdfs\/datanode&lt;\/value&gt;\n&lt;\/property&gt;\n \nsudo nano core-site.xml\n \n&lt;property&gt;\n&lt;name&gt;hadoop.tmp.dir&lt;\/name&gt;\n&lt;value&gt;\/app\/hadoop\/tmp&lt;\/value&gt;\n&lt;\/property&gt;\n&lt;property&gt;\n&lt;name&gt;fs.default.name&lt;\/name&gt;\n&lt;value&gt;hdfs:\/\/localhost:9000&lt;\/value&gt;\n&lt;\/property&gt;\n \nsudo nano mapred-site.xml\n \n&lt;property&gt;\n&lt;name&gt;mapred.framework.name&lt;\/name&gt;\n&lt;value&gt;yarn&lt;\/value&gt;\n&lt;\/property&gt;\n&lt;property&gt;\n&lt;name&gt;mapreduce.jobhistory.address&lt;\/name&gt;\n&lt;value&gt;localhost:10020&lt;\/value&gt;\n&lt;\/property&gt;\n \nsudo mkdir -p \/app\/hadoop\/tmp\nsudo chown -R hduser:hadoop \/app\/hadoop\/tmp\nsudo chmod -R 777 \/app\/hadoop\/tmp\n \nsudo mkdir -p \/usr\/local\/hadoop\/yarn_data\/hdfs\/namenode\nsudo mkdir -p \/usr\/local\/hadoop\/yarn_data\/hdfs\/datanode\nsudo chmod -R 777 \/usr\/local\/hadoop\/yarn_data\/hdfs\/namenode\nsudo chmod -R 700 \/usr\/local\/hadoop\/yarn_data\/hdfs\/datanode\nsudo chown -R hduser:hadoop \/usr\/local\/hadoop\/yarn_data\/hdfs\/namenode\nsudo chown -R hduser:hadoop \/usr\/local\/hadoop\/yarn_data\/hdfs\/datanode\n \nhdfs namenode -format\n \n \nstart-dfs.sh\nstart-yarn.sh\n \n \n \n6. In setting the Hadoop Environment section,\n#HADOOP NATIVE PATH:\nexport \u201cHADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME\/lib\/native\nexport HADOOP_OPTS=\u201c-Djava.library.path=$HADOOP_PREFIX\/lib\u201d\n \nMake sure that HADOOP_COMMON_LIB_NATIVE_DIR has no double quotes.\n \n7. In hadoop-env.sh make sure double quotes are intact.\n \n8. While creating a datanode directory, make sure that the permissions are 700.\n \n9. start daemons. check namenode web UI at 9870. check resource manager web ui.\n<\/code><\/pre>\n\n\n\n<p>Once the installation is successfully done, we can play with some UNIX commands(better to have a touch on basics)<\/p>\n\n\n\n<p>Basic knowledge on UNIX commands is required for us to navigate throughout our file system. Hope you're keeping up with the Hadoop Tutorial so far!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hdfs-basic-commands\"><strong>HDFS- Basic commands<\/strong><\/h2>\n\n\n\n<p><strong>Let us see how:<\/strong><\/p>\n\n\n\n<p>Pwd \u2013 present work directory<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/u_q2sZlOL04r7kWN_QxUA4UCb967nP26jkNL2trNz64CDSj15S00SUp5Aw53bYrXorDT_x1YrErQh-muXvNLXNrMxP9HSpS9MRbZgQfHhgWWtKAVOlAkPeHpB7pxrXKZMyL0SaO_\" alt=\"\"\/><\/figure>\n\n\n\n<p>Here hduser is the name of the user<\/p>\n\n\n\n<p>\u2018\/\u2019 is root.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To navigate to root directory,<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/24GD2zcYO8_FE-_Dnk0VA3PgDgSzAnU8m1U2RH-HbxrG7ZqDQdoGkb1HKY_8Q0Qk2cBQrJq3wp0p2DmGNyahEnHMNpTS-zw7__Mn1YUKlX-9xlw9Vz1fXkvIU1EK-sa0JuyN2PX7\" alt=\"\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To list all the contents:<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/Ig6NHEBJGEYm0X2nkhjXdzEy3nCYJH5byn-RIoKJcuNm52N3dIPx87NwRkapF_c8yI9THcQ9BTdnV9E_4PSWJFP7mblsyjCuym0cB5o9zvlrH1rxWCkBw4JKUVjo0EH8geUFmjxN\" alt=\"hadoop tutorial\"\/><\/figure>\n\n\n\n<p>For demonstrating HDFS commands, we need to first start all the Hadoop services. Since Hadoop is installed in sudo distributed mode, all the Hadoop components, the namenode, secondary namenode, the job tracker&nbsp; (resource manager in Hadoop 2.0).<\/p>\n\n\n\n<p>We will run the built in script.<\/p>\n\n\n\n<p><strong>start-all.sh<\/strong><\/p>\n\n\n\n<p>Type the command and wait for a while<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/VUAKr9O_bdVptx85r9IFASteE2m0Zrhe86qaROAnWuYdAxtcdZBks9xY3zeLt4_qO7DifvjInl8PG4Or4spM_lvpiNfGxTyYfisASgXFQYmKiI_gUIJmvR_-kbU42lESPSMGVFxj\" alt=\"hadoop tutorial\"\/><\/figure>\n\n\n\n<p>When the Hadoop service starts by default, the Map reduce engine is up and running and the HDFS is up and running.<\/p>\n\n\n\n<p>See the messages above(for clear understand)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Now how to check if all these services are running?<\/li>\n<\/ul>\n\n\n\n<p>Type in the command called jps \u2013 which lists the java processes on the system.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/TXE3xURUjKRJPvXJpt5M5Dlp9pprrYdHe83yP0kzUskkKiZbzIYWamrB1ITJdA7jbpeOyxrm_8Atik2FbbK0_CYPUIPBthmLZ4EXtkHhMSmrJtGaatJL-Er3klLathgbluXUfy4h\" alt=\"hadoop tutorial\"\/><\/figure>\n\n\n\n<p>The numbers are process ids for each process(can be ignored)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To see what is there in the home directory of my Hadoop file system, we cannot use the \u2018ls\u2019 command. We use the below command<\/li>\n<\/ul>\n\n\n\n<p>hadoop fs \u2013ls<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/aZU2m5k5swcXmzKngruWNY9okFWlkXkiZcA4kM1lWbmgPzKLoY8d-mxTJiXyTeaYXoTTtVhEN1vZDi7F_xg0PS5l3b6CjYvo4gNU47m6HD28C5upvuSHM1rIu9iIlcc6IDWymx6N\" alt=\"\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Now, to see where exactly this sample is stored?<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/m-cXRnwi_Ou4qMfDsTEhLc0HgNMsoNfwTGveoiobwfxIEqLb7Y0hUewlHi_nq0rw43UNVKFZXi2fH4LZz3yc7tNR2-3IHKq4EDD52oo5W2eiSBL8FAbCtS24zFZ3LMr62OPCepe2\" alt=\"\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>How to make a directory in HDFS?<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/LHK_IqXzsgtoqxmAf8B50yrTtuK6MEQVZcUPtMX_QO-APgqlDkZ1DkxS9tQu9MNs7duqwKYbFQdsLGU1nOUnTNu1rQuNs5DEbj0sVwsksh1KojqpNN61jQXQsLItJNsXTbfKLDXD\" alt=\"hadoop tutorial\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To delete files from HDFS<\/li>\n<\/ul>\n\n\n\n<p>Suppose there are 2 file above sample and sample_new<\/p>\n\n\n\n<p>Ex: to remove \u2018sample_new\u2019 folder see below.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/vaHsD95Na-TbStQuAnHx5hbjL-9y2p9PAv7RB4yYa9-EScT3n8Wew1PxExAmF1nOxz6UTsXReMJjzrppPHGrt16Yr2WBEuFYy-rH0MAASWxJur1aEUMpPhQvhZahRlBLHghgshpx\" alt=\"\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To copy a file from a local file system to Hadoop file system.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>Here, \u2018samplefile.txt\u2019 is a small file in local file system and \u2018sample101\u2019 is the destination folder<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/zQ3tzZkQZtr5SO8ikoFL02_QCR5ESdpmNi4zZjl8bdBzlFEwHsCrhfg2-hGYZ2JviVx9KahVu8pXf0Xxln2WxtHIkRkr9_fA9hV0pw4rsHw_0MUf3-ElHiLzV0qtVck-4Gki0v6M\" alt=\"\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To see the content of the file, here(samplefile.txt)<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/or5xq72Ay8tkpReCiDAUHv37ExOcFbFlkhg_0gqK9tgGChGbnIUFj-74Acr6mbM2_fAgRJTr_MJLrcazdV7Y19Un7dWop8bIauU7Zi8qoPuStGa7DZPnKSXAiNw5au7MD_pRF_qf\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>Note: This command is not recommended to see the content of big data file in a Hadoop cluster<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"mapreduce-a-programming-paradigm\"><strong>Mapreduce: A Programming Paradigm<\/strong><\/h2>\n\n\n\n<p>As we now know Hadoop comes with 2 important components, HDFS(storage) and Mapreduce(processing)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A Java framework for processing parallelizable problems across huge datasets, using commodity hardware, in a distributed environment<\/li>\n\n\n\n<li>Google has used it to process its \u201cbig-data\u201d sets (~ 20,000 PB\/day)<\/li>\n\n\n\n<li>Can be implemented in many languages: Java, C++, Ruby, Python etc.<\/li>\n<\/ul>\n\n\n\n<p>Even Apache Spark uses, mapreduce approach of processing so the idea is going to be useful to understand spark as well. Hold it, we will learn Spark in detail.<\/p>\n\n\n\n<p>Now let us understand the basic logic of map reduce by looking at the wordcount problem.<\/p>\n\n\n\n<p><strong><em>Problem statement: To count the frequency of words in a file.<\/em><\/strong><\/p>\n\n\n\n<p>Input file name: <strong>secret.txt<\/strong> and has just 2 lines of data. Contents of the file: <strong><em>this is not a secret if you read it<\/em><\/strong><\/p>\n\n\n\n<p><strong><em>it is a secret if you do not read it<\/em><\/strong><\/p>\n\n\n\n<p>THE EXPECTED OUTPUT AS BELOW<\/p>\n\n\n\n<p>This &nbsp; &nbsp;  1<\/p>\n\n\n\n<p>is &nbsp; &nbsp; &nbsp; &nbsp; 2<\/p>\n\n\n\n<p>not&nbsp; &nbsp; &nbsp;  2<\/p>\n\n\n\n<p>a&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2<\/p>\n\n\n\n<p>secret  2<\/p>\n\n\n\n<p>if &nbsp; &nbsp; &nbsp; &nbsp; 2<\/p>\n\n\n\n<p>you &nbsp; &nbsp;  2<\/p>\n\n\n\n<p>do &nbsp; &nbsp; &nbsp; 1<\/p>\n\n\n\n<p>read &nbsp; &nbsp;2<\/p>\n\n\n\n<p>it&nbsp; &nbsp; &nbsp; &nbsp; 2<\/p>\n\n\n\n<p>The mapreduce framework, as the name implies, it is a two stage approach, within 3 stages, we have substages.<\/p>\n\n\n\n<p>In the Map stage, the first sub-stage is the Record reader and it will read the program line by line. See the demonstration below.<\/p>\n\n\n\n<p>Stage 2: Approach using Map and Reduce:<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-map-stage\"><strong>THE MAP STAGE<\/strong><\/h2>\n\n\n\n<p>&nbsp;<strong><em>this is not a secret if you read it<\/em><\/strong><\/p>\n\n\n\n<p>(The above is the first line in the input file)<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/AaOVdaBEFDfqbDJYjRGw7HRQWU8ixOCCSGcBo4cClGMFs-6h04_YrlyMObim_0mhofozk_Eeb_6PpZX1WVHqLQ4Dwa5OiP7NR6oYCDR_BAwqKW7egSJKz_8BY47mFCIKDQ6_ule5\" alt=\"hadoop tutorial\"\/><\/figure>\n\n\n\n<p>Now, for the sake of naming conventions, we will choose the term called as key to refer to the output 1 and value to refer to the Output 2.<\/p>\n\n\n\n<p><strong>Output of the record reader<\/strong><\/p>\n\n\n\n<p>Output 1 (A number), Output 2&nbsp;(The entire line)<\/p>\n\n\n\n<p>Output 1 is always called as <strong>KEY<\/strong><\/p>\n\n\n\n<p>Output 2 is always called as <strong>VALUE<\/strong><\/p>\n\n\n\n<p>(The same naming conventions would be used \tthroughout the discussion hereafter)<\/p>\n\n\n\n<p><strong>KEY:<\/strong> 0&nbsp; (file offset)<\/p>\n\n\n\n<p><strong>VALUE:<\/strong> this is not a secret if you read it (first line)<\/p>\n\n\n\n<p>To understand, what exactly is the idea of line offset in a file is:<\/p>\n\n\n\n<p><strong>Consider this file having 2 lines<\/strong><\/p>\n\n\n\n<p>It\u2019s a new file<\/p>\n\n\n\n<p>Which is almost used for nothing !<\/p>\n\n\n\n<p>Each character in the file occupies one byte of data<\/p>\n\n\n\n<p>First character of line one starts at location 0<\/p>\n\n\n\n<p>Number of characters in the first line<\/p>\n\n\n\n<p>It\u2019s: 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; space: 1 (total 5)<\/p>\n\n\n\n<p>a: 1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;    space: 1 (total 2)<\/p>\n\n\n\n<p>new: 3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;space: 1 (total 4)<\/p>\n\n\n\n<p>file:&nbsp; 4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;new line:1 (total 5)<\/p>\n\n\n\n<p>Total of&nbsp;16 characters or 16 bytes. The next line would begin at location <strong>17.<\/strong> File offset for next line is <strong>17<\/strong>.<\/p>\n\n\n\n<p><strong>Record reader\u2019s Output to the mapper<\/strong><\/p>\n\n\n\n<p><strong><em>0&nbsp; (KEY)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; this is not a secret if you read it (VALUE)<\/em><\/strong><\/p>\n\n\n\n<p>MAPPER can be programmed and accept only one key value pairs as input and produce key-value-pairs as output<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Mapper can process only one key &amp; value at a time<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Produce output in key, value pairs based on what  its programmed to perform<\/li>\n<\/ul>\n\n\n\n<p><strong>Programming the Mapper:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Mapper can be programmed based on the problem statement<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The input is a key value pair (file offset, one line from file)<\/li>\n<\/ul>\n\n\n\n<p><strong>0, this is not a secret if you read it<\/strong><\/p>\n\n\n\n<p>In the word count problem we shall program the mapper to do the following<\/p>\n\n\n\n<p><strong>Step1 : <\/strong>Ignore the key (file offset)<\/p>\n\n\n\n<p><strong>Step 2: <\/strong>Extract each word from the line<\/p>\n\n\n\n<p><strong>Step 3: <\/strong>Produce the output in key value pairs where key is each word of the line and value as 1 (integer\/a number)<\/p>\n\n\n\n<p><strong>Output of the Mapper:<\/strong><\/p>\n\n\n\n<p>0, this is not a secret if you read it.<\/p>\n\n\n\n<p>this    1<\/p>\n\n\n\n<p>is       1<\/p>\n\n\n\n<p>not     1<\/p>\n\n\n\n<p>a        1<\/p>\n\n\n\n<p>secret 1<\/p>\n\n\n\n<p>if&nbsp;       1<\/p>\n\n\n\n<p>you    1<\/p>\n\n\n\n<p>read   1<\/p>\n\n\n\n<p>it        1&nbsp; &nbsp;  <strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-sort-operation\"><strong>The sort operation<\/strong><\/h2>\n\n\n\n<p>Output of the mapper is fed into the <strong>sorter <\/strong>which sorts the mapper output in ascending order of the KEYS! (lexicographic ordering or dictionary ordering since the keys are of string type)<\/p>\n\n\n\n<p><strong>Note : Sorter can be reprogrammed (overridden) to sort based on values if required. Its called the sort comparator<\/strong>.<\/p>\n\n\n\n<p><strong>Input to the SORT phase:<\/strong><\/p>\n\n\n\n<p>this       1<\/p>\n\n\n\n<p>is          1<\/p>\n\n\n\n<p>not       1 <\/p>\n\n\n\n<p>a          1<\/p>\n\n\n\n<p>secret   1<\/p>\n\n\n\n<p>if&nbsp;         1<\/p>\n\n\n\n<p>you       1<\/p>\n\n\n\n<p>read      1<\/p>\n\n\n\n<p>it 1<\/p>\n\n\n\n<p>it 1<\/p>\n\n\n\n<p>is&nbsp; 1<\/p>\n\n\n\n<p>a 1<\/p>\n\n\n\n<p><strong>Output of the sort phase<\/strong><\/p>\n\n\n\n<p>a 1<\/p>\n\n\n\n<p>a 1<\/p>\n\n\n\n<p>do 1<\/p>\n\n\n\n<p>if&nbsp; 1<\/p>\n\n\n\n<p>if 1<\/p>\n\n\n\n<p>is 1<\/p>\n\n\n\n<p>it 1<\/p>\n\n\n\n<p>it 1<\/p>\n\n\n\n<p>it 1<\/p>\n\n\n\n<p>not 1&nbsp;&nbsp;<\/p>\n\n\n\n<p>not 1<\/p>\n\n\n\n<p>read 1<\/p>\n\n\n\n<p>read 1<\/p>\n\n\n\n<p>secret 1<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"reduce-stage\"><strong>REDUCE STAGE:<\/strong><\/h2>\n\n\n\n<p><strong>It has 3 sub-stages - merge, shuffle and reducer operation&nbsp;<\/strong><\/p>\n\n\n\n<p>The output of the several mapper's will be merged into a single file at reduce stage<\/p>\n\n\n\n<p>SHUFFLE\/aggregate phase in REDUCE stage:<\/p>\n\n\n\n<p>Shuffling is a phase where duplicate keys from the input are aggregated.<\/p>\n\n\n\n<p>Consider the simple example<\/p>\n\n\n\n<p>Input is key value pairs (<strong>contains duplicate keys<\/strong>)<\/p>\n\n\n\n<p>Output is a set of key value pairs <strong>without duplicate<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Key<\/strong><\/td><td><strong>Value<\/strong><\/td><\/tr><tr><td>Apple<\/td><td>2<\/td><\/tr><tr><td>Apple<\/td><td>4<\/td><\/tr><tr><td>Mango<\/td><td>1<\/td><\/tr><tr><td>Orange<\/td><td>11<\/td><\/tr><tr><td>Orange<\/td><td>06<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/zMMYyGHMHIHo4u_g29UPjN6g-eLRWwIJ7JljT5PnMWf4aGob0k8By_23KlYuehPDzzql5Vut-21v4y2MVaGHdQRCkdKcd9w04mjBPlt-sUrc4KPjEnEdO1TBUjYi8EKtm8dYlykp\" alt=\"\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Key<\/strong><\/td><td><strong>Value<\/strong><\/td><\/tr><tr><td>Apple<\/td><td>2, 4<\/td><\/tr><tr><td>Mango<\/td><td>1<\/td><\/tr><tr><td>Orange<\/td><td>11, 6<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>So the Shuffle operation at the reduce stage<\/p>\n\n\n\n<p>a 1<\/p>\n\n\n\n<p>a 1<\/p>\n\n\n\n<p>do 1<\/p>\n\n\n\n<p>if &nbsp; 1<\/p>\n\n\n\n<p>if &nbsp; 1<\/p>\n\n\n\n<p>is &nbsp; 1<\/p>\n\n\n\n<p>is &nbsp; 1<\/p>\n\n\n\n<p>it &nbsp; 1<\/p>\n\n\n\n<p>it &nbsp; 1<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/q7t2Fy5v9mP5WfbpHATcOmXOp-9tZwykvyZSntFDV-Q7HCH9QMYYcekEHrlWY2bfDKlSyfSmjhxMA6N-Ysr2Iz2Hmz2pe0c-xdFOEU0URq8PP4KgEUn2TJ5r6198Busuvquhyy_b\" alt=\"hadoop tutorial\"\/><\/figure>\n\n\n\n<p>a &nbsp; 1,1<\/p>\n\n\n\n<p>do&nbsp; 1<\/p>\n\n\n\n<p>if &nbsp; 1,1<\/p>\n\n\n\n<p>is &nbsp; 1,1<\/p>\n\n\n\n<p>it \t1,1<br><\/p>\n\n\n\n<p><strong>The REDUCER operation in REDUCE stage<\/strong><\/p>\n\n\n\n<p>Reducer accepts the input from the shuffle stage<\/p>\n\n\n\n<p>Reducer produces output in key value pairs based on what it is programmed to do as per the problem statement<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"programming-the-reducer\"><strong>Programming the reducer<\/strong><\/h2>\n\n\n\n<p>Output of the shuffle is the input to the reducer<\/p>\n\n\n\n<p>Reducer can handle only one key value pair at a time<\/p>\n\n\n\n<p>Step 1: Input to the reducer is&nbsp;a 1, 1<\/p>\n\n\n\n<p>Step 2: The reducer must add the list of values from the input i.e<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <strong>sum=1+1 = 2<\/strong><\/p>\n\n\n\n<p>Step 3: Output the key and sum as output key, value pairs&nbsp; to an output file. The o\/ would look like a&nbsp; 2.<\/p>\n\n\n\n<p>Step 4: Repeat the above operations (1,2&amp;3) for entire input<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"final-output-of-the-reducer\"><strong>Final output of the Reducer:<\/strong><\/h4>\n\n\n\n<p>a &nbsp; 1,1<\/p>\n\n\n\n<p>do&nbsp; 1<\/p>\n\n\n\n<p>if &nbsp; 1,1<\/p>\n\n\n\n<p>is&nbsp; 1,1<\/p>\n\n\n\n<p>it&nbsp; 1,1,1<\/p>\n\n\n\n<p>not&nbsp; 1,1<\/p>\n\n\n\n<p>read&nbsp; 1,1<\/p>\n\n\n\n<p>secret&nbsp; 1,1<\/p>\n\n\n\n<p>this&nbsp; 1<\/p>\n\n\n\n<p>you &nbsp; 1,1<\/p>\n\n\n\n<p>Reducer output(final o\/p)<\/p>\n\n\n\n<p>a&nbsp; 2<\/p>\n\n\n\n<p>do&nbsp; 1<\/p>\n\n\n\n<p>if &nbsp; 2<\/p>\n\n\n\n<p>is&nbsp; 2<\/p>\n\n\n\n<p>it&nbsp; 3<\/p>\n\n\n\n<p>not &nbsp; 2<\/p>\n\n\n\n<p>read&nbsp; 2<\/p>\n\n\n\n<p>secret&nbsp; 2<\/p>\n\n\n\n<p>this&nbsp; 1<\/p>\n\n\n\n<p>you&nbsp; 2<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"summary\"><strong>Summary<\/strong><\/h2>\n\n\n\n<p>As we're coming to the end of our Hadoop Tutorial, let us summarize. Input file processed by record reader output goes to MAPPER and its output is sorted<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"closer-look-to-map-reduce\"><strong>Closer look to Map reduce<\/strong><\/h2>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"map-reduce-approach-to-anagram-problem\"><strong>Map-Reduce Approach to Anagram Problem:<\/strong><\/h4>\n\n\n\n<p><strong>Identifying the Anagrams in a Text file:<\/strong><\/p>\n\n\n\n<p><strong>What are anagrams?<\/strong><\/p>\n\n\n\n<p>MARY is a word and ARMY is another word which is formed by re arranging the letters in the original word MARY<\/p>\n\n\n\n<p>\u2022 &nbsp; \tMARY and ARMY are <em>anagrams<\/em><\/p>\n\n\n\n<p>\u2022 &nbsp; \tPOOL and LOOP are <em>anagrams<\/em>. There could a lot of such examples.<\/p>\n\n\n\n<p>Note: We are interested in finding out anagram combinations from a text document which does not contain irrelevant gibberish words<\/p>\n\n\n\n<p><strong>Problem Statement:<\/strong><\/p>\n\n\n\n<p>To identify and list all the anagrams found in a document. Eg A book (a novel)<\/p>\n\n\n\n<p>Input file name: sample.txt (a file in text format) and has 2 lines in the file.<\/p>\n\n\n\n<p><strong>File contents:<\/strong> mary worked in army<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    the loop fell into the pool<\/p>\n\n\n\n<p><strong>Expected output:<\/strong> (must contain all the anagrams)<\/p>\n\n\n\n<p>mary army<\/p>\n\n\n\n<p>loop pool<\/p>\n\n\n\n<p><strong>Output of record Reader:<\/strong><\/p>\n\n\n\n<p>This is going to the be output of the record reader after reading the first line of the file<\/p>\n\n\n\n<p>Contents of the file: <\/p>\n\n\n\n<p>mary worked in army<\/p>\n\n\n\n<p>loop fell into the pool<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>KEY&nbsp; <\/td><td>VALUE <\/td><\/tr><tr><td>file offset &nbsp; &nbsp; <\/td><td>entire line of the file <\/td><\/tr><tr><td>0<\/td><td>Mary worked in army<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The above (key-value pair) is now going to be fed into the mapper as an input.<\/p>\n\n\n\n<p><strong>Programming the Mapper<\/strong><\/p>\n\n\n\n<p>Mapper is programmed do the following<\/p>\n\n\n\n<p>Step 1: Ignore the key from the record reader<\/p>\n\n\n\n<p>Step 2: Split the words in the value (the full line)<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\tmary works in army<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\t[mary] [works] [in] [army] (the line is split)<\/p>\n\n\n\n<p>Step 3: Compute the word length of each word<\/p>\n\n\n\n<p>Step 4: Output the word length as key and original word as value . The sample output of mapper would look like-<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Key <\/td><td>Value<\/td><\/tr><tr><td>4<\/td><td>Mary<\/td><\/tr><tr><td>5<\/td><td>works<\/td><\/tr><tr><td>2<\/td><td>in<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Step 5: Repeat the above steps for all the words in the line<\/p>\n\n\n\n<p><strong>Output of the Mapper after Processing the Entire file.<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Key<\/td><td>Value<\/td><\/tr><tr><td>4<\/td><td>mary<\/td><\/tr><tr><td>6<\/td><td>worked<\/td><\/tr><tr><td>2<\/td><td>in<\/td><\/tr><tr><td>3<\/td><td>the<\/td><\/tr><tr><td>4<\/td><td>army<\/td><\/tr><tr><td>4<\/td><td>loop<\/td><\/tr><tr><td>4<\/td><td>fell<\/td><\/tr><tr><td>4<\/td><td>into<\/td><\/tr><tr><td>3<\/td><td>the<\/td><\/tr><tr><td>4<\/td><td>pool <\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Output after sorting the keys:<\/strong><\/p>\n\n\n\n<p><strong>KEY &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <\/strong><strong>\t<\/strong><strong>VALUE<\/strong><\/p>\n\n\n\n<p>&nbsp;&nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tin<\/p>\n\n\n\n<p>&nbsp;&nbsp;3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \t &nbsp; the<\/p>\n\n\n\n<p>&nbsp;&nbsp;3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tthe<\/p>\n\n\n\n<p>&nbsp;&nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tmary&nbsp;<\/p>\n\n\n\n<p>&nbsp;&nbsp;4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tarmy<\/p>\n\n\n\n<p>&nbsp;&nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tloop<\/p>\n\n\n\n<p>&nbsp;&nbsp;4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tfell<\/p>\n\n\n\n<p>&nbsp;&nbsp;4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tinto<\/p>\n\n\n\n<p>&nbsp;&nbsp;4&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tpool<\/p>\n\n\n\n<p>&nbsp;&nbsp;6 &nbsp; &nbsp; &nbsp; &nbsp; \t &nbsp; &nbsp; &nbsp; &nbsp; worked\t<\/p>\n\n\n\n<p><strong>This is the Output to the Reducer:<\/strong><\/p>\n\n\n\n<p><strong>KEY&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <\/strong><strong>\t<\/strong><strong>VALUE<\/strong><\/p>\n\n\n\n<p>2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tin<\/p>\n\n\n\n<p>3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tthe, the<\/p>\n\n\n\n<p>4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tmary , army , loop, fell , into, pool&nbsp;&nbsp;<\/p>\n\n\n\n<p>6 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \tworked&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p><strong>What can we do in the Reducer now to identify the Anagrams?<\/strong><\/p>\n\n\n\n<p><strong>KEY&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <\/strong><strong>\t<\/strong><strong>VALUE<\/strong><\/p>\n\n\n\n<p>2 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; in<\/p>\n\n\n\n<p>3 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; the, the<\/p>\n\n\n\n<p>4 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mary , army , loop, fell , into, pool&nbsp;&nbsp;<\/p>\n\n\n\n<p>6 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; worked&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>\u2022 &nbsp; \tPick one word at a time from the list of values for every key value pair<\/p>\n\n\n\n<p>\u2022 &nbsp; \tCheck if the same combination of letters are present in every other word in the list<\/p>\n\n\n\n<p>i.e, the letters m,a,r and y is present in amry, if true then mary and army are anagrams<\/p>\n\n\n\n<p>\u2022 &nbsp; \tHow to revolve <em>the<\/em> and <em>the<\/em> as both contain the same combination of alphabets ?<\/p>\n\n\n\n<p><em>Its simple, we can choose do a string comparison and if the strings are identical then we can ignore them!<\/em><\/p>\n\n\n\n<p><strong>Problem with this Approach<\/strong><\/p>\n\n\n\n<p>\u2022 &nbsp; \tThis looks like a solution however has several challenges<\/p>\n\n\n\n<p>\u2022 &nbsp; \tConsider the below key value pair<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4&nbsp; &nbsp;  mary, army, loo, fell, into, pool&nbsp;&nbsp;<\/p>\n\n\n\n<p>\u2022 &nbsp; \tTo compare the alphabet combinations m,a,r and y is present in one other word takes 4 X 4 = 16 comparisons<\/p>\n\n\n\n<p>\u2022 &nbsp; \t16 comparison operation multiplied by number of words in the value list = 16 X 6 = 96 comparison operations<\/p>\n\n\n\n<p>\u2022 &nbsp;  What is the list it too long? This just worsens the computation time in the event of large data sets (big data)<\/p>\n\n\n\n<p>\u2022 &nbsp;  Reducer is overloaded here!<\/p>\n\n\n\n<p>\u2022 &nbsp; \tWhat seemed as a solution, is not so practical approach for BIG DATA set<\/p>\n\n\n\n<p><strong>The Alternate approach would be as follows<\/strong><\/p>\n\n\n\n<p><strong>Output of the Record reader<\/strong><\/p>\n\n\n\n<p>This is going to the be output of the record reader after reading the first line of the file<\/p>\n\n\n\n<p>Contents of the file: mary worked in army<\/p>\n\n\n\n<p>&nbsp;&nbsp;  &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; loop fell into the pool<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Key<\/td><td>Value<\/td><\/tr><tr><td>file offset<\/td><td>entire line of the field <\/td><\/tr><tr><td>0<\/td><td>mary worked in army<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The above (key-value pair) is now going to be fed into the mapper as an input.<\/p>\n\n\n\n<p><strong>Programming the Mapper:<\/strong><\/p>\n\n\n\n<p>Mapper is programmed do the following<\/p>\n\n\n\n<p>Step 1: Ignore the key from the record reader<\/p>\n\n\n\n<p>Step 2: Split the words in the value (the full line)<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\tmary works in army<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;\t &nbsp; &nbsp; [mary] [works] [in] [army] (the line is split)<\/p>\n\n\n\n<p>Step 3: sort each word in dictionary order (lexicographic ordering)<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\tmary after sorting would become <em>amry<\/em><\/p>\n\n\n\n<p>Step 4: Output the sorted word as key and original word as value . The sample output of mapper would look like<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Key<\/td><td>Value<\/td><\/tr><tr><td>Army <\/td><td>Mary<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Step 5: Repeat the above steps for all the words in the line<\/p>\n\n\n\n<p><strong>Output of the mapper after processing the entire file<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Key<\/td><td>Value<\/td><\/tr><tr><td>amry<\/td><td>mary<\/td><\/tr><tr><td>dekorw<\/td><td>worked<\/td><\/tr><tr><td>in<\/td><td>in<\/td><\/tr><tr><td>eht<\/td><td>the<\/td><\/tr><tr><td>amry<\/td><td>army<\/td><\/tr><tr><td>loop<\/td><td>loop<\/td><\/tr><tr><td>efll<\/td><td>fell<\/td><\/tr><tr><td>eth<\/td><td>the<\/td><\/tr><tr><td>loop<\/td><td>pool<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Output after sorting the keys<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Key<\/td><td>Value<\/td><\/tr><tr><td>amry<\/td><td>mary<\/td><\/tr><tr><td>amry<\/td><td>army<\/td><\/tr><tr><td>dekorw  <\/td><td>worked<\/td><\/tr><tr><td>eth<\/td><td>the<\/td><\/tr><tr><td>in<\/td><td>in<\/td><\/tr><tr><td>inot<\/td><td>into<\/td><\/tr><tr><td>loop<\/td><td>loop<\/td><\/tr><tr><td>loop<\/td><td>pool<\/td><\/tr><tr><td>efll<\/td><td>fell<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>Output after shuffling the keys(aggregation of duplicate keys)]<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>Key<\/td><td>Value<\/td><\/tr><tr><td>amry<\/td><td>mary, army<\/td><\/tr><tr><td>dekoew<\/td><td>worked<\/td><\/tr><tr><td>eht<\/td><td>the, the<\/td><\/tr><tr><td>in<\/td><td>in<\/td><\/tr><tr><td>inot<\/td><td>into<\/td><\/tr><tr><td>loop<\/td><td>loop, pool<\/td><\/tr><tr><td>efll<\/td><td>fell<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>LOOK CLOSER!<\/strong><\/p>\n\n\n\n<p>There are some keys with more than one value. We need to only look at such key, value pairs<\/p>\n\n\n\n<p>amry &nbsp; mary, army<\/p>\n\n\n\n<p>eht &nbsp; &nbsp; &nbsp;the, the<\/p>\n\n\n\n<p>loop    loop, pool<\/p>\n\n\n\n<p><strong>Logic to list the anagrams<\/strong><\/p>\n\n\n\n<p><strong>amry&nbsp; <\/strong>&nbsp;mary, army<\/p>\n\n\n\n<p><strong> eht&nbsp; <\/strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the, the<\/p>\n\n\n\n<p> <strong>loop &nbsp; &nbsp; <\/strong>loop, pool<\/p>\n\n\n\n<p><strong>Problem : <\/strong>We need to only print the values belonging to keys \u201c<strong>amry<\/strong>\u201d and \u201c<strong>loop<\/strong>\u201d since only their values qualify for anagrams.<\/p>\n\n\n\n<p>We need to ignore the values belonging to the keys \u201c<strong>eht<\/strong>\u201d since its corresponding values do not qualify for being anagrams.<\/p>\n\n\n\n<p><strong>How to ignore the non-anagram values?<\/strong><\/p>\n\n\n\n<p><strong>amry&nbsp; <\/strong>&nbsp;mary, army<\/p>\n\n\n\n<p><strong>eht&nbsp; <\/strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;the, the<\/p>\n\n\n\n<p><strong>loop &nbsp; &nbsp; <\/strong>loop, pool<\/p>\n\n\n\n<p><strong>We need to program the following into the reducer<\/strong><\/p>\n\n\n\n<p><strong>Step 1:<\/strong> Check if the number of values are &gt; 1 for each key<\/p>\n\n\n\n<p><strong>Step 2: <\/strong>Compare the first and second value in the values&nbsp; list&nbsp; for every key, if they match, ignore them.<\/p>\n\n\n\n<p><strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/strong><strong>\t<\/strong><strong>key&nbsp; val1 &nbsp; val2<\/strong><strong>\t<\/strong><\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<strong>eht<\/strong>&nbsp; <strong>the &nbsp; &nbsp; the&nbsp;&nbsp;<\/strong><\/p>\n\n\n\n<p><strong>Step 3: <\/strong>If the values don\u2019t match in step 2. Compose a single string comprising of all the values in the list and print it to the output file. This final string is the KEY and value can be NULL (do not print anything for value)<\/p>\n\n\n\n<p>&nbsp;\tKEY = \u201cmary army\u201d &nbsp; VALUE =\u201c &nbsp; \u201c<\/p>\n\n\n\n<p><strong>Step 4:<\/strong> Repeat the above steps for all the key value pairs input to the reducer.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"final-output-of-the-reducer\"><strong>Final Output of the Reducer?<\/strong><\/h2>\n\n\n\n<p>mary army<\/p>\n\n\n\n<p>&nbsp;loop&nbsp; pool<\/p>\n\n\n\n<p>The above output is for the case of a file with just 2 lines of data.<\/p>\n\n\n\n<p>What if the file is 640MB in size?<\/p>\n\n\n\n<p>How does map reduce&nbsp; help in speeding up the job completion?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-does-map-reduce-speed-up-the-processing\"><strong>How does Map reduce speed up the processing?<\/strong><\/h2>\n\n\n\n<p><em>What is the input file used in this example is 640MB instead of just containing 2 lines ?<\/em><\/p>\n\n\n\n<p>\u2022 &nbsp; \tThe HADOOP framework would first split the entire file into 10 blocks each of 64MB<\/p>\n\n\n\n<p>\u2022 &nbsp; \tEach 64MB block would be treated as a single file<\/p>\n\n\n\n<p>\u2022 &nbsp; \tThere would be one record-reader and one mapper assigned to each such block<\/p>\n\n\n\n<p>\u2022 &nbsp; \tOutput of all the mappers would finally reach the reducer (one reducer is used by default) however we can have multiple reducers depending on degree of optimization required<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-map-reduce\"><strong>Why Map Reduce?<\/strong><\/h2>\n\n\n\n<p>\u2022 &nbsp; \tScale <em>out<\/em> not scale <em>up: <\/em>MR is designed to work with commodity hardware<\/p>\n\n\n\n<p>\u2022 &nbsp; \t<em>Move<\/em> code where the data is: cluster have limited bandwidth<\/p>\n\n\n\n<p>\u2022 &nbsp; \t<em>Hide<\/em> system-level details from developers: no more race condition, dead locks etc<\/p>\n\n\n\n<p>\u2022 &nbsp; \tSeparating the <em>what<\/em> from <em>how: <\/em>developer specifies the computation, framework handles actual execution<\/p>\n\n\n\n<p>\u2022 &nbsp; \t<em>Failures <\/em>are common and handled automatically<\/p>\n\n\n\n<p>\u2022 &nbsp; \tBatch processing: access data sequentially instead of random to avoid locking up<\/p>\n\n\n\n<p>\u2022 &nbsp; \tLinear Scalability: once the MR algorithm is designed, it can work on any size cluster<\/p>\n\n\n\n<p>\u2022 &nbsp; \t<em>Divide<\/em> &amp; <em>Conquer<\/em>: MR follows Partition and Combine in Map\/Reduce phase<\/p>\n\n\n\n<p>\u2022 &nbsp; \tHigh-level <em>system<\/em> <em>details<\/em>: monitoring of the status of data and processing<\/p>\n\n\n\n<p>\u2022 &nbsp; \tEverything happens on top-of a <em>HDFS<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"use-case-of-mapreduce\"><strong>Use Case of MapReduce?<\/strong><\/h2>\n\n\n\n<p>\u2022 &nbsp; \tMainly used for searching keywords in massive amount of data<\/p>\n\n\n\n<p>\u2022 &nbsp; \tGoogle uses it for wordcount, adwords, pagerank, indexing data for Google Search, article clustering for Google News<\/p>\n\n\n\n<p>\u2022 &nbsp; \tYahoo: \u201cweb map\u201d powering Search, spam detection for Mail<\/p>\n\n\n\n<p>\u2022 &nbsp; \tSimple algorithms such as grep, text-indexing, reverse indexing<\/p>\n\n\n\n<p>\u2022 &nbsp; \tData mining domain<\/p>\n\n\n\n<p>\u2022 &nbsp; \tFacebook uses it for data mining, ad optimization, spam detection<\/p>\n\n\n\n<p>\u2022 &nbsp; \tFinancial services use it for analytics<\/p>\n\n\n\n<p>\u2022 &nbsp; \tAstronomy: Gaussian analysis for locating extra-terrestrial objects<\/p>\n\n\n\n<p>\u2022 &nbsp; \tMost batch oriented non-interactive jobs analysis tasks<\/p>\n\n\n\n<p>Now that we know the flow of map reduce using an example of word count and Anagram problem using map reduce, See the source code in the link below from the documentation to get a practical approach.<\/p>\n\n\n\n<figure class=\"wp-block-embed\"><div class=\"wp-block-embed__wrapper\">\nhttps:\/\/hadoop.apache.org\/docs\/stable\/hadoop-mapreduce-client\/hadoop-mapreduce-client-core\/MapReduceTutorial.html#Source_Code\n<\/div><\/figure>\n\n\n\n<p>Now that we have discussed two examples that are Wordcount frequency and Anagram problem, let's quickly take a practical approach to understand what a java map reduce program would look like.<\/p>\n\n\n\n<p><em>Note: For the detailed java code explanation, please refer to the above link.<\/em><\/p>\n\n\n\n<p>Also, we will have a look at map reduce commands only as we already learnt about HDFS command.<\/p>\n\n\n\n<p>Step 1: Create a small folder in which a text file is stored(input file) in a local file system and put in HDFS folder(keep this set up ready)<\/p>\n\n\n\n<p>Step 2: You can run the java program in Eclipse(IDE)<\/p>\n\n\n\n<p>Step 3: In the root folder, inside the folder, we can see the jar file, the jar file consists of all the 3 source code of java files from eclipse and get an executable file (example : wc_temp.jar)<\/p>\n\n\n\n<p>Step 4: Now let us run this jar file on a cluster. Open the terminal, and make sure all the Hadoop services are running in the background. Check if the input file is there in the HDFS.<\/p>\n\n\n\n<p>Step 5: Command to execute the jar file, first navigate to the folder where the jar file is located.<\/p>\n\n\n\n<p>Command to deploy the jar file on Hadoop cluster is:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/HuPf5TKpJQXhaqZooMzaexnmqOgkl4pPq8qqu8TPVtq2yKnamQTRXcxJNVUDQfSws8DVXfJ0DOsumjnGfFW0m99aKdn4Sbgdb672Ca9W4meIU2mNNni3nSc6pDeImY30lBPtcnFZ\" alt=\"hadoop tutorial\"\/><\/figure>\n\n\n\n<p>Here, WordCountDrivername - name of the driver code<\/p>\n\n\n\n<p>wcin\/sampledata.txt-name of the input file path<\/p>\n\n\n\n<p>wcout_demo-Destination HDFS folder<\/p>\n\n\n\n<p>Step 6: Once the execution is done, if the output folder has been created.<\/p>\n\n\n\n<p>Then if the program is successfully completed, you must see the below output.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/ZOoQ_7u6gQBbeTR6EDcY3vDqqCnFziqhNFNqCvZx9Wl_Ic0dVcfg_-sL6cguj8to4NjbFmFvAc9wI0ETwEXq6R84EpJ7awndEytcwqatKTuvoJGi-p9va3aDuKVnfus0r_8cSj47\" alt=\"hadoop tutorial\"\/><\/figure>\n\n\n\n<p>You can see the output file in the destination folder.<\/p>\n\n\n\n<p><strong>HADOOP 2.0 \u2013 Introducing YARN<\/strong><\/p>\n\n\n\n<p><strong>YARN \u2013 Yet another resource negotiator<\/strong><\/p>\n\n\n\n<p>YARN (Hadoop 2.0) is the new scheduler and centralized resource manager in the cluster.<\/p>\n\n\n\n<p>\u2219&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; It replaces the Job tracker in Hadoop 2.0<\/p>\n\n\n\n<p>\u2219&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; The started back in 2012 as an apache sub project and a beta version was released in mid 2013 stable versions became available from 2014 onwards<\/p>\n\n\n\n<p><strong>YARN(80)<\/strong><\/p>\n\n\n\n<p><strong>Components of YARN(80)<\/strong><\/p>\n\n\n\n<p>It's a 2 tiered model with some components (demons) in master mode and some operating in slave mode.<\/p>\n\n\n\n<p>Resource manager works in master mode (runs on a dedicated hardware in production setup<\/p>\n\n\n\n<p>Node manager works in slave mode &amp; its services are run in the data nodes.<\/p>\n\n\n\n<p>Resource manager comprises of 2 components&nbsp;<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\t1)Scheduler<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\t2)Applications Manager<\/p>\n\n\n\n<p>Node manager consists of a container (an encapsulation of resources for running a job) and an app Master (application master). <\/p>\n\n\n\n<p>Also Read: <a href=\"https:\/\/www.mygreatlearning.com\/blog\/top-hadoop-interview-questions\/\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Top 40 Hadoop Interview Questions (opens in a new tab)\">Top 40 Hadoop Interview Questions<\/a><\/p>\n\n\n\n<p><strong>Hadoop 1.0 vs Hadoop 2.0(80)<\/strong><\/p>\n\n\n\n<p><strong>Hadoop 3.0<\/strong><\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/June-29-banner-for-GL-hadoop-2.png\"><a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/predictive-modeling-and-analytics-regression\" target=\"_blank\" rel=\"noreferrer noopener\"><img decoding=\"async\" width=\"1000\" height=\"242\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/June-29-banner-for-GL-hadoop-2.png\" alt=\"\" class=\"wp-image-17913\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/June-29-banner-for-GL-hadoop-2.png 1000w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/June-29-banner-for-GL-hadoop-2-300x73.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/June-29-banner-for-GL-hadoop-2-768x186.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/June-29-banner-for-GL-hadoop-2-696x168.png 696w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/a><\/figure>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Big data\u2013 Introduction Before we jump into our Hadoop Tutorial, lets understand Big Data. Will start with questions like What is Big data, Why big data, What big data signifies so that the companies\/industries are moving to big data from legacy systems, is it worth it to learn big data technologies and will professional get [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":17439,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[9],"tags":[],"content_type":[],"class_list":["post-17311","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>What is Apache Hadoop &amp; Tutorial? | All You Need to Know<\/title>\n<meta name=\"description\" content=\"Apache Hadoop Tutorial: Hadoop is a distributed parallel processing framework, which facilitates distributed computing. Let us learn more through this Hadoop Tutorial!\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Hadoop Tutorial |What is Apache Hadoop?\" \/>\n<meta property=\"og:description\" content=\"Apache Hadoop Tutorial: Hadoop is a distributed parallel processing framework, which facilitates distributed computing. Let us learn more through this Hadoop Tutorial!\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/\" \/>\n<meta property=\"og:site_name\" content=\"Great Learning Blog: Free Resources what Matters to shape your Career!\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/GreatLearningOfficial\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-08-01T09:22:01+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-03T09:45:20+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Great Learning Editorial Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/Great_Learning\" \/>\n<meta name=\"twitter:site\" content=\"@Great_Learning\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Great Learning Editorial Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"30 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/\"},\"author\":{\"name\":\"Great Learning Editorial Team\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\"},\"headline\":\"Apache Hadoop Tutorial |What is Apache Hadoop?\",\"datePublished\":\"2020-08-01T09:22:01+00:00\",\"dateModified\":\"2024-09-03T09:45:20+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/\"},\"wordCount\":6255,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/Blog-Featured-Images-for-Articles-03.jpg\",\"articleSection\":[\"Data Science and Analytics\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/\",\"name\":\"What is Apache Hadoop & Tutorial? | All You Need to Know\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/Blog-Featured-Images-for-Articles-03.jpg\",\"datePublished\":\"2020-08-01T09:22:01+00:00\",\"dateModified\":\"2024-09-03T09:45:20+00:00\",\"description\":\"Apache Hadoop Tutorial: Hadoop is a distributed parallel processing framework, which facilitates distributed computing. Let us learn more through this Hadoop Tutorial!\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/Blog-Featured-Images-for-Articles-03.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/Blog-Featured-Images-for-Articles-03.jpg\",\"width\":1000,\"height\":700,\"caption\":\"hadoop tutorial\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-hadoop-tutorial\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Blog\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Data Science and Analytics\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-science\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Apache Hadoop Tutorial |What is Apache Hadoop?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"name\":\"Great Learning Blog\",\"description\":\"Learn, Upskill &amp; Career Development Guide and Resources\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"alternateName\":\"Great Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\",\"name\":\"Great Learning\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"width\":900,\"height\":900,\"caption\":\"Great Learning\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/GreatLearningOfficial\\\/\",\"https:\\\/\\\/x.com\\\/Great_Learning\",\"https:\\\/\\\/www.instagram.com\\\/greatlearningofficial\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/in.pinterest.com\\\/greatlearning12\\\/\",\"https:\\\/\\\/www.youtube.com\\\/user\\\/beaconelearning\\\/\"],\"description\":\"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.\",\"email\":\"info@mygreatlearning.com\",\"legalName\":\"Great Learning Education Services Pvt. Ltd\",\"foundingDate\":\"2013-11-29\",\"numberOfEmployees\":{\"@type\":\"QuantitativeValue\",\"minValue\":\"1001\",\"maxValue\":\"5000\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\",\"name\":\"Great Learning Editorial Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"caption\":\"Great Learning Editorial Team\"},\"description\":\"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.\",\"sameAs\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/\",\"https:\\\/\\\/in.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/x.com\\\/https:\\\/\\\/twitter.com\\\/Great_Learning\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCObs0kLIrDjX2LLSybqNaEA\"],\"award\":[\"Best EdTech Company of the Year 2024\",\"Education Economictimes Outstanding Education\\\/Edtech Solution Provider of the Year 2024\",\"Leading E-learning Platform 2024\"],\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/author\\\/greatlearning\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"What is Apache Hadoop & Tutorial? | All You Need to Know","description":"Apache Hadoop Tutorial: Hadoop is a distributed parallel processing framework, which facilitates distributed computing. Let us learn more through this Hadoop Tutorial!","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/","og_locale":"en_US","og_type":"article","og_title":"Apache Hadoop Tutorial |What is Apache Hadoop?","og_description":"Apache Hadoop Tutorial: Hadoop is a distributed parallel processing framework, which facilitates distributed computing. Let us learn more through this Hadoop Tutorial!","og_url":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/","og_site_name":"Great Learning Blog: Free Resources what Matters to shape your Career!","article_publisher":"https:\/\/www.facebook.com\/GreatLearningOfficial\/","article_published_time":"2020-08-01T09:22:01+00:00","article_modified_time":"2024-09-03T09:45:20+00:00","og_image":[{"width":1000,"height":700,"url":"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg","type":"image\/jpeg"}],"author":"Great Learning Editorial Team","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/Great_Learning","twitter_site":"@Great_Learning","twitter_misc":{"Written by":"Great Learning Editorial Team","Est. reading time":"30 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/#article","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/"},"author":{"name":"Great Learning Editorial Team","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad"},"headline":"Apache Hadoop Tutorial |What is Apache Hadoop?","datePublished":"2020-08-01T09:22:01+00:00","dateModified":"2024-09-03T09:45:20+00:00","mainEntityOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/"},"wordCount":6255,"commentCount":0,"publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg","articleSection":["Data Science and Analytics"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/","url":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/","name":"What is Apache Hadoop & Tutorial? | All You Need to Know","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/#primaryimage"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg","datePublished":"2020-08-01T09:22:01+00:00","dateModified":"2024-09-03T09:45:20+00:00","description":"Apache Hadoop Tutorial: Hadoop is a distributed parallel processing framework, which facilitates distributed computing. Let us learn more through this Hadoop Tutorial!","breadcrumb":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/#primaryimage","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg","width":1000,"height":700,"caption":"hadoop tutorial"},{"@type":"BreadcrumbList","@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog","item":"https:\/\/www.mygreatlearning.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Data Science and Analytics","item":"https:\/\/www.mygreatlearning.com\/blog\/data-science\/"},{"@type":"ListItem","position":3,"name":"Apache Hadoop Tutorial |What is Apache Hadoop?"}]},{"@type":"WebSite","@id":"https:\/\/www.mygreatlearning.com\/blog\/#website","url":"https:\/\/www.mygreatlearning.com\/blog\/","name":"Great Learning Blog","description":"Learn, Upskill &amp; Career Development Guide and Resources","publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"alternateName":"Great Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.mygreatlearning.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization","name":"Great Learning","url":"https:\/\/www.mygreatlearning.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","width":900,"height":900,"caption":"Great Learning"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/GreatLearningOfficial\/","https:\/\/x.com\/Great_Learning","https:\/\/www.instagram.com\/greatlearningofficial\/","https:\/\/www.linkedin.com\/school\/great-learning\/","https:\/\/in.pinterest.com\/greatlearning12\/","https:\/\/www.youtube.com\/user\/beaconelearning\/"],"description":"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.","email":"info@mygreatlearning.com","legalName":"Great Learning Education Services Pvt. Ltd","foundingDate":"2013-11-29","numberOfEmployees":{"@type":"QuantitativeValue","minValue":"1001","maxValue":"5000"}},{"@type":"Person","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad","name":"Great Learning Editorial Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","caption":"Great Learning Editorial Team"},"description":"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.","sameAs":["https:\/\/www.mygreatlearning.com\/","https:\/\/in.linkedin.com\/school\/great-learning\/","https:\/\/x.com\/https:\/\/twitter.com\/Great_Learning","https:\/\/www.youtube.com\/channel\/UCObs0kLIrDjX2LLSybqNaEA"],"award":["Best EdTech Company of the Year 2024","Education Economictimes Outstanding Education\/Edtech Solution Provider of the Year 2024","Leading E-learning Platform 2024"],"url":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"}]}},"uagb_featured_image_src":{"full":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg",1000,700,false],"thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03-150x150.jpg",150,150,true],"medium":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03-300x210.jpg",300,210,true],"medium_large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03-768x538.jpg",768,538,true],"large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg",1000,700,false],"1536x1536":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg",1000,700,false],"2048x2048":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg",1000,700,false],"web-stories-poster-portrait":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg",640,448,false],"web-stories-publisher-logo":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg",96,67,false],"web-stories-thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-03.jpg",150,105,false]},"uagb_author_info":{"display_name":"Great Learning Editorial Team","author_link":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"},"uagb_comment_info":0,"uagb_excerpt":"Big data\u2013 Introduction Before we jump into our Hadoop Tutorial, lets understand Big Data. Will start with questions like What is Big data, Why big data, What big data signifies so that the companies\/industries are moving to big data from legacy systems, is it worth it to learn big data technologies and will professional get&hellip;","_links":{"self":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/17311","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/comments?post=17311"}],"version-history":[{"count":59,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/17311\/revisions"}],"predecessor-version":[{"id":105222,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/17311\/revisions\/105222"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media\/17439"}],"wp:attachment":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media?parent=17311"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/categories?post=17311"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/tags?post=17311"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/content_type?post=17311"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}