{"id":18222,"date":"2020-08-21T18:02:39","date_gmt":"2020-08-21T12:32:39","guid":{"rendered":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/"},"modified":"2024-09-03T15:15:19","modified_gmt":"2024-09-03T09:45:19","slug":"apache-spark","status":"publish","type":"post","link":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/","title":{"rendered":"Apache Spark"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\" id=\"spark-intro\"><strong>Spark Intro:<\/strong><\/h2>\n\n\n\n<p>Spark is another parallel processing framework. However, just not yet another parallel processing framework, Hadoop for example, what we have been seen all this while was a very popular parallel processing framework but it had actually had a lot of shortcomings especially in the area of machine learning, Hence a lot of those shortcomings of Hadoop was addressed and all-new parallel processing frameworks was designed and that is actually called Spark.<\/p>\n\n\n\n<p>Yes, Spark sounds like an actual spark into the life where it has made a lot of tasks made much easier.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-spark\"><strong>What is Spark?<\/strong><strong><\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Apache Spark is a fast, in-memory data processing engine<\/li>\n\n\n\n<li>With development APIs, it allows executing streaming, machine learning or SQL.<\/li>\n\n\n\n<li>Fast, expressive cluster computing system compatible with Apache Hadoop<\/li>\n\n\n\n<li>Improves efficiency through:\n<ul class=\"wp-block-list\">\n<li>In-memory computing primitives <\/li>\n\n\n\n<li>General computation graphs (DAG)<\/li>\n\n\n\n<li>Up to 100\u00d7 faster<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Improves usability through:\n<ul class=\"wp-block-list\">\n<li>Rich APIs in Java, Scala, Python<\/li>\n\n\n\n<li>Interactive shell<\/li>\n\n\n\n<li>Often 2-10\u00d7 less code<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>open-source parallel process computational framework primarily used for data engineering and analytics.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"about-apache-spark\"><strong>About Apache Spark:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Initially started at UC Berkeley in 2009<\/li>\n\n\n\n<li>Open source cluster computing framework<\/li>\n\n\n\n<li>Written in Scala (gives the power of functional Programming)<\/li>\n\n\n\n<li>Provides high-level APIs in\n<ul class=\"wp-block-list\">\n<li>Java<\/li>\n\n\n\n<li>Scala<\/li>\n\n\n\n<li>Python<\/li>\n\n\n\n<li>R <\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Integration with Hadoop and its ecosystem and can read existing data.<\/li>\n\n\n\n<li>Designed to be fast for iterative algorithms and interactive queries for which MapReduce is inefficient.<\/li>\n\n\n\n<li>Most popular for running Iterative Machine Learning Algorithms.<\/li>\n\n\n\n<li>With support for in-memory storage and efficient fault recovery.\n<ul class=\"wp-block-list\">\n<li>10x (on disk) - 100x (In-Memory) faster<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"why-spark\"><strong>Why Spark?<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Most of Machine Learning Algorithms are iterative because each iteration can improve the results<\/li>\n\n\n\n<li>With Disk-based approach each iteration output is written to disk making it slow<\/li>\n\n\n\n<li><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-scala\"><strong>What is Scala?<\/strong><strong><\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A high-level programming language<\/li>\n\n\n\n<li>Supports the functional style of programming<\/li>\n\n\n\n<li>Supports OO-style of programming<\/li>\n\n\n\n<li>It\u2019s actually a multi-paradigm language &nbsp;<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"functional\"><strong>What is\nfunctional programming?<\/strong><strong><\/strong><\/h2>\n\n\n\n<p>Consider the following code\nsnippet&nbsp;<\/p>\n\n\n\n<p>int a = 10; \/\/A new variable \u201ca\u201d is\ngetting created<\/p>\n\n\n\n<p>a++ ;&nbsp; \/\/ We are incrementing the\nvariable a<\/p>\n\n\n\n<p>Print(a) &nbsp; \/\/ will generate 11 as\nthe o\/p<\/p>\n\n\n\n<p>Consider another way of doing the same\nthing<\/p>\n\n\n\n<p>int a= 10;&nbsp;&nbsp;<\/p>\n\n\n\n<p>int a1 = a + 1 ;&nbsp; \/\/ We are\ncreating a new variable a1<\/p>\n\n\n\n<p>Print(a) ; \/\/ a will still contain the\nvalue 10, state of the \/\/variable a is not changed.<\/p>\n\n\n\n<p>What is happening behind the scenes\nhere?<\/p>\n\n\n\n<p>An object or a variable, \u2018a\u2019 was\nactually created and the value called 10 was allocated and it is got stored in\nthe memory.<\/p>\n\n\n\n<p>When I increment the value which I\nstored in the variable \u201ca\u201d&nbsp; by using the\noperator \u201ca++\u201d which is very common in c or c++ style of programming, what\neventually happens is in the same memory location the value will get\nincremented, meaning we are changing the value of the original variable. So\nwhat do we say here is like the state of the variable is changing actually in\nthe course of the program execution.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"functional-programming\"><strong>Functional Programming:<\/strong><\/h3>\n\n\n\n<p>Functional style of programming inherently supports parallelism<\/p>\n\n\n\n<p>Functional style of programming is not restricted to a particular programming language.<\/p>\n\n\n\n<p>It can be implemented in any language, just like the way OO concepts can also be implemented in a programming language like C<\/p>\n\n\n\n<p>Certain programming languages inherently support the functional style of programming reducing the programmer\u2019s burden<\/p>\n\n\n\n<p>Scala is one of the functional\nprogramming languages&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"spark-in-comparison-to-hadoop\"><strong>Spark in comparison to Hadoop:<\/strong><\/h2>\n\n\n\n<p>Now let us\ntry to compare Spark to Hadoop and is the easiest way to understand Spark.<\/p>\n\n\n\n<p>However, a lot of things are very different in Spark and we will learn as go forward.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The high-level architecture of SPARK is very similar to that of HADOOP. Just like you have the Master-slave architecture, several data nodes and certain master nodes and is almost the same in Spark.<\/li>\n\n\n\n<li>A quick way of understanding SPARK is by comparing it with HADOOP<\/li>\n\n\n\n<li>Several drawbacks of Hadoop are addressed in Spark which gives a 10x-100x performance improvement over Hadoop<\/li>\n\n\n\n<li>It is well suited for interactive and real time analytics of big data and streaming data<\/li>\n\n\n\n<li>Its highly popular among machine learning users since it outperforms Hadoop in iterative workloads. This is because the framework itself facilitates to work better on iterative workloads, whereas the Hadoop framework does not hold good for iterative workloads because of intensive disk I\/O operations(which is behind the scenes in Hadoop ecosystem)<\/li>\n<\/ul>\n\n\n\n<p>Here certain mechanisms are actually put in place\nin spark to minimize the disk I\/O operations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"let-us-know-some-interesting-facts-but-not-myths\"><strong>Let us\nknow some interesting facts but not myths:<\/strong><\/h3>\n\n\n\n<p>The common myth around spark is that Spark is\n10x\/100x faster when compared to Hadoop because of completely in-memory based\narchitecture, which is false.<\/p>\n\n\n\n<p>It is not completely in memory architecture. There are also situations where disk I\/O operations become inevitable but what to be kept in the memory and how long the results can be kept in the memory is completely under the control of programmer here in the case of Spark, thereby making it very favourable to gain increased performance when compared to Hadoop. <\/p>\n\n\n\n<p>In this way, Spark is advanced but not completely in-memory based computing architecture.<\/p>\n\n\n\n<p>In Hadoop, we basically have the data nodes to work in the slave mode and then they have the active name nodes, standby name nodes, and secondary name node and then we have the resource manager which does the work of the job tracker when compared to Hadoop version 1 and then the centralized scheduling and resource allocation actually happens with the help of resource manager scheduler in conjunction with the node manager and per application, application master in your datanodes.<\/p>\n\n\n\n<p>Now let us have a quick recap of job execution in Hadoop:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"overview-of-job-execution-in-hadoop\"><strong>Overview of job execution in Hadoop:<\/strong><\/h3>\n\n\n\n<p>A job submitted by the user is picked up by the name node and the resource manager<\/p>\n\n\n\n<p>The job gets submitted to the name node and eventually, the resource manager is responsible for scheduling the execution of the job on the data nodes in the cluster<\/p>\n\n\n\n<p>The data nodes in cluster consist of the data blocks on which the user\u2019s program will be executed in parallel <\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Driver is the starting point of a job submission (this can be compared to the driver code in Java MR)<\/li>\n\n\n\n<li>Cluster Manager can be compared to the Resource Manager in Hadoop <\/li>\n\n\n\n<li>Worker nodes are the data nodes in a HADOOP cluster <\/li>\n\n\n\n<li>The executor can be compared to that of a node manager in Hadoop 2 or task tracker in Hadoop 1<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"deployment\"><strong>Spark Deployment\nmodes:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standalone (used for learning &amp; development)\n<ul class=\"wp-block-list\">\n<li>This is very similar to the pseudo-distributed mode of Hadoop <\/li>\n\n\n\n<li>Spark services run on multiple JVM\u2019s<\/li>\n\n\n\n<li>This is also known as a standalone cluster&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Local mode (used for learning &amp; development)\n<ul class=\"wp-block-list\">\n<li>There is a single JVM (no need of HDFS)&nbsp;<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cluster mode (can work with MESOS or YARN)\n<ul class=\"wp-block-list\">\n<li>Used for production environment &amp; it\u2019s a fully distributed mode <\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><\/li>\n\n\n\n<li><\/li>\n<\/ul>\n\n\n\n<p>In Map stage, it will have a record reader and then a mapper, then in-memory sort operation ad the output of this will be fed to the Reduce stage, in which it has a stage called merge, where the output of multiple mapper machines are going to be merged as in they have arrived, then after merging they are going to be subject to the shuffle operation or an aggregation operation, after which the reducer logic which is written by the programmer, depending upon the problem statement and will be implemented in the reduce stage.<\/p>\n\n\n\n<p>The final output will be written on to the HDFS as per the above block diagram.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The above example demonstrates a\nmap-reduce&nbsp; job involving 3 mappers on 3\ninput splits <\/li>\n\n\n\n<li>There is 1 reducer <\/li>\n\n\n\n<li>Each input split on each data resides on the\nhard disc. Mapper reading them would involve a disc read operation. <\/li>\n\n\n\n<li>There would be 3 disc read operations from\nall the 3 mappers put together <\/li>\n\n\n\n<li>Merging in the reduce stage involves 1 disc\nwrite operation <\/li>\n\n\n\n<li>Reducer would write the final output file to\nthe HDFS, which indeed is another disc write operation<\/li>\n\n\n\n<li>Totally there are a minimum of 5 disc I\/O\noperations in the above example (3 from the map stage and 2 from reduce stage)<\/li>\n\n\n\n<li>The number of disc read operations from the\nmap stage is equal to the number of input splits<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"calculating-the-number-of-disc-i-o-operations-on-a-large-data-set\"><strong>Calculating the\nnumber of disc I\/O operations on a large data set:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Typically an HDFS input split size would be 128 MB<\/li>\n\n\n\n<li>Let\u2019s consider a file of size 100TB and the number of file blocks on HDFS would be (100 * 1024 * 1024) \/ 128 = 8,19,200 <\/li>\n\n\n\n<li>There would around 8.2 lakh mappers which needs to run on the above data set once a job is launched using Hadoop MapReduce <\/li>\n\n\n\n<li>8.2 lakh mappers mean, 8.2 lakh disc read operations<\/li>\n\n\n\n<li>Disc read operations are 10 times slower when compared to a memory read operation <\/li>\n\n\n\n<li>Map-Reduce does not inherently support iterations on the data set<\/li>\n\n\n\n<li>Several rounds of Map-Reduce jobs needs to be chained to achieve the result of an iterative job in Hadoop<\/li>\n\n\n\n<li>Most of the machine learning algorithms involves an iterative approach <\/li>\n\n\n\n<li>10 rounds of iterations in a single job leads to 8.2 lakh X 10 disc I\/O operations<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"sparks-approach-to-problem-solving\"><strong>Spark\u2019s approach to problem-solving:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Spark allows the results of the computation to be saved in the memory for future re-use <\/li>\n\n\n\n<li>Reading the data from the memory is much faster than that of reading from the disc <\/li>\n\n\n\n<li>Caching the result in memory is under the programmer\u2019s control <\/li>\n\n\n\n<li>Not always is possible to save such results completely in memory especially when the object is too large and memory is low<\/li>\n\n\n\n<li>In such cases, the objects need to be moved to the disc <\/li>\n\n\n\n<li>Spark, therefore is not completely in memory-based parallel processing platform <\/li>\n\n\n\n<li>Spark, however, is 3X to 10X faster in most of the jobs when compared to that of Hadoop <\/li>\n\n\n\n<li><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"let-us-now-see-the-major-differences-between-hadoop-and-spark\"><strong>Let us now see the major differences between Hadoop and Spark:<\/strong><\/h3>\n\n\n\n<p>In the left-hand side, we see 1 round of MapReduce job, were in the map stage, data is being read from the HDFS(which is hard drives from the data nodes) and after the reduce operation has finished, the result of the computation is written back to the HDFS. Immediately if result produces by a particular MapReduce job has to be consumed by another MapReduce job. Once again the second mapper in the bottom have to read the data from the HDFS cluster which again involves a disk read operation from the hard drives of your data nodes and once again the result of the computation in the reduce stage will be written back to the HDFS<\/p>\n\n\n\n<p>Now, in the right-hand side, we can see a Spark job meaning we have a map operation, then we have a reduce operation and then the result of map and join will be cached in memory. So there is no disk write operation involved over here immediately this can be fed as an input to the next job which might involve something like a map or something like a reduce or in this particular example, there is a transform operation happening here, so here we are eliminating a disk write operation.<\/p>\n\n\n\n<p>In this example, we are eliminating 1 disk write operation. So caching plays a very important role in speeding up the computations when it comes to a spark ecosystem. But remember, Spark is not a completely in-memory data computation framework if this cache data is too large to fit in the memory, eventually we will have to force into the written on to the disk but we will try to keep that as many as possible and we will see how we can achieve this and will go deeper into this part down the line.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-spark-gives-hadoop\"><strong>What Spark gives Hadoop?<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Machine learning module delivers capabilities not easily exploited in Hadoop.<\/li>\n\n\n\n<li>in-memory processing of sizeable data volumes remains an important contribution to the capabilities of a Hadoop cluster.<\/li>\n\n\n\n<li>Valuable for enterprise use cases\n<ul class=\"wp-block-list\">\n<li>Spark\u2019s SQL capabilities for interactive analysis over big data<\/li>\n\n\n\n<li>Streaming capabilities (Spark streaming)<\/li>\n\n\n\n<li>Graph processing capabilities (GraphX) <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-hadoop-1.png\"><a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/predictive-modeling-and-analytics-regression\" target=\"_blank\" rel=\"noreferrer noopener\"><img decoding=\"async\" width=\"1000\" height=\"242\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-hadoop-1.png\" alt=\"\" class=\"wp-image-20064\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-hadoop-1.png 1000w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-hadoop-1-300x73.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-hadoop-1-768x186.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-hadoop-1-696x168.png 696w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/a><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-hadoop-gives-spark\"><strong>What Hadoop\ngives Spark?<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>YARN resource manager<\/li>\n\n\n\n<li>DFS<\/li>\n\n\n\n<li>Disaster Recovery capabilities<\/li>\n\n\n\n<li>Data Security<\/li>\n\n\n\n<li>A distributed data platform<\/li>\n\n\n\n<li>Leverage existing clusters<\/li>\n\n\n\n<li>Data locality<\/li>\n\n\n\n<li>Manage workloads using advanced policies\n<ul class=\"wp-block-list\">\n<li>Allocate shares to different teams and users <\/li>\n\n\n\n<li>Hierarchical queues <\/li>\n\n\n\n<li>Queue placement policies<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Take advantage of Hadoop\u2019s security\n<ul class=\"wp-block-list\">\n<li>Run-on Kerberized clusters<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Driver is the starting point of a job\nsubmission (this can be compared to the driver code in Java MR)<\/li>\n\n\n\n<li>Cluster Manager can be compared to the\nResource Manager in Hadoop <\/li>\n\n\n\n<li>Worker is a&nbsp;\nsoftware service running on slave nodes, similar to the&nbsp; are the data nodes in a HADOOP cluster <\/li>\n\n\n\n<li>The executor is a container which is\nresponsible for running the tasks <\/li>\n<\/ul>\n\n\n\n<p>Spark\nDeployment Modes:<\/p>\n\n\n\n<p><strong>Standalone mode : <\/strong><\/p>\n\n\n\n<p>All the spark services run on a single machine but in separate JVM\u2019s. Mainly used for learning and development purposes(something like the pseudo-distributed mode of Hadoop deployment) <\/p>\n\n\n\n<p><strong>Cluster mode with YARN or MESOS: <\/strong><\/p>\n\n\n\n<p>This is the fully distributed mode of SPARK used in a\nproduction environment<\/p>\n\n\n\n<p><strong>Spark in Map Reduce (SIMR) : <\/strong><\/p>\n\n\n\n<p>Allows Hadoop MR1 users to run their MapReduce jobs as spark jobs<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"rdds\"><strong>RDDs:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Key Spark Construct<\/li>\n\n\n\n<li>A distributed collection of elements<\/li>\n\n\n\n<li>Each RDD is split into multiple partitions which may be computed on different nodes of the cluster <\/li>\n\n\n\n<li>Spark automatically distributes the data in RDD across cluster and parallelize the operations<\/li>\n\n\n\n<li>RDD has the following properties\n<ul class=\"wp-block-list\">\n<li>Immutable<\/li>\n\n\n\n<li>Lazy evaluated <\/li>\n\n\n\n<li>Cacheable<\/li>\n\n\n\n<li>Type inferred<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p><strong>RDDs as an advantage:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They are objects in SPARK which represent BIG DATA<\/li>\n\n\n\n<li>They are distributed <\/li>\n\n\n\n<li>They are fault-tolerant <\/li>\n\n\n\n<li>They are lazily evaluated <\/li>\n\n\n\n<li>Scalable <\/li>\n\n\n\n<li>Supports near real-time stream processing <\/li>\n\n\n\n<li>They have a lot of other benefits which makes it THE element of BIG DATA computation in SPARK<strong> <\/strong><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"rdd-operations\"><strong>RDD Operations:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>How to create RDD:\n<ul class=\"wp-block-list\">\n<li>Loading external data sources\n<ul class=\"wp-block-list\">\n<li>lines=sc.textfile(\u201creadme.txt\u201d)<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Parallelizing a collection in a driver program\n<ul class=\"wp-block-list\">\n<li>Lines=sc.parallelize([\u201cpandas\u201d, \u201cI like pandas\u201d])<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Transformation\n<ul class=\"wp-block-list\">\n<li>transform RDD to another RDD by applying some functions<\/li>\n\n\n\n<li>Lineage graph (DAG): keep the track of dependencies between transformed RDDs using which on-demand new RDDs can be created or part of persistent RDD can be recovered in case of failure.<\/li>\n\n\n\n<li>Examples : map, filter,flapmap,distinct,sample , union, intersect,subtract ,cartesian etc.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Action\n<ul class=\"wp-block-list\">\n<li>Actual output is being generated in transformed RDD once the action is applied<\/li>\n\n\n\n<li>Return values to the driver program or write data to external storage<\/li>\n\n\n\n<li>The entire RDD gets computed from scratch on new action call and if intermediate results are not persisted. <\/li>\n\n\n\n<li>Examples : reduce, collect,count,countByvalue,take,top,takeSample,aggreagate,foreach etc.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"immutability\"><strong>Immutability:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Immutability means once created it never changes<\/li>\n\n\n\n<li>Big data by default immutable in nature<\/li>\n\n\n\n<li>Immutability helps to\n<ul class=\"wp-block-list\">\n<li>Parallelize<\/li>\n\n\n\n<li>Caching<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"immutability-in-action\"><strong>Immutability in action<\/strong><\/h3>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; const int a = 0 \/\/immutable<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int b = 0;&nbsp;&nbsp; \/\/ mutable<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"updation\"><strong>Updation<\/strong><\/h3>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; b ++&nbsp;&nbsp; \/\/ in place<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; c = a + 1<\/p>\n\n\n\n<p>Immutability\nis about value not about reference.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"challenges-of-immutability\"><strong>Challenges of\nImmutability:<\/strong><\/h3>\n\n\n\n<p>Immutability\nis great for parallelism but not good for space<\/p>\n\n\n\n<p>&nbsp;Doing multiple transformations result in <\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \u25cb Multiple copies of data <\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \u25cb Multiple passes over data <\/p>\n\n\n\n<p>In big\ndata, multiple copies and multiple passes will have poor performance\ncharacteristics<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"lazy\"><strong>Lazy\nEvaluation:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Laziness means not computing\ntransformation till it\u2019s need<\/li>\n\n\n\n<li>Once, any action is performed then\nthe actual computation starts<\/li>\n\n\n\n<li>A DAG (Directed acyclic graph) will\nbe created for the tasks<\/li>\n\n\n\n<li>Catalyst Engine is used to optimize\nthe tasks &amp; queries<\/li>\n\n\n\n<li>It helps reduce the number of passes<\/li>\n\n\n\n<li>Laziness in action<\/li>\n<\/ul>\n\n\n\n<p>val c1 =\ncollection.map(value =&gt; value +1) \/\/do not compute anything<\/p>\n\n\n\n<p>val c2 =\nc1.map(value =&gt; value +2) \/\/ don\u2019t compute<\/p>\n\n\n\n<p>print&nbsp; c2&nbsp;\n\/\/&nbsp; Now transform into<\/p>\n\n\n\n<p>Multiple\ntransformations are combined to one<\/p>\n\n\n\n<p>val c2 =\ncollection.map (value =&gt; {var result = value +1 <\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; result = result + 2 } )<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"challenges-of-laziness\"><strong>Challenges of Laziness:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Laziness poses challenges in terms of data type <\/li>\n\n\n\n<li>If laziness defers execution, determining the type of the variable becomes challenging <\/li>\n\n\n\n<li>If we can\u2019t determine the right type, it allows having semantic issues <\/li>\n\n\n\n<li>Running big data programs and getting semantics errors are not fun.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"type-inference\"><strong>Type Inference:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Type inference is part of the compiler to determine the type by value<\/li>\n\n\n\n<li>As all the transformation are side effect free, we can determine the type by operation <\/li>\n\n\n\n<li>Every transformation has a specific return type <\/li>\n\n\n\n<li>Having type inference relieves you think about representation for many transforms.<\/li>\n\n\n\n<li>Example:\n<ul class=\"wp-block-list\">\n<li>c3 = c2.count( ) \/\/ inferred as Int&nbsp; <\/li>\n\n\n\n<li>collection = [1,2,4,5] \/\/&nbsp; explicit type Array<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"caching\"><strong>Caching:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Immutable data allows you to cache data for a long time<\/li>\n\n\n\n<li>Lazy transformation allows recreating data on failure<\/li>\n\n\n\n<li>Transformations can also be saved<\/li>\n\n\n\n<li>Caching data improves execution engine performance<\/li>\n\n\n\n<li>Reduces lots of I\/O operations of reading\/writing data from HDFS<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"loading-spark-data-objects-rdd\"><strong>Loading spark data objects (RDD)<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The data loaded into a SPARK object is called\nas an RDD<\/li>\n\n\n\n<li>A detailed discussion about RDD\u2019s will be\ncovered shortly <\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"under-the-hood\"><strong>Under the hood:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Job execution starts with loading the data from a data source (e.g. HDFS) into spark environment <\/li>\n\n\n\n<li>Data read from the hard drives of worker nodes&nbsp; and&nbsp; loaded into the RAM of multiple machines<\/li>\n\n\n\n<li>The data could be spread out into different files (each file could be a block in HDFS) <\/li>\n\n\n\n<li>After the computation, the final result is captured <\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"partitions-and-data-locality\"><strong>Partitions&nbsp; and data locality:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Loading of the data from the hard drives to\nthe RAM of the worker nodes is based on data locality <\/li>\n\n\n\n<li>The data in the data blocks is illustrated in\nthe block diagram below<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"rdd-properties\"><strong>RDD properties:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We\nhave just understood 3 important properties of an RDD in spark <\/li>\n<\/ul>\n\n\n\n<p>&nbsp;1) They are immutable<\/p>\n\n\n\n<p>&nbsp;2) They are\npartitioned <\/p>\n\n\n\n<p>&nbsp;3) They are\ndistributed and spread across multiple nodes in a machine<\/p>\n\n\n\n<p><strong>Workflow:<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"rdd-lazy-evaluation\"><strong>RDD Lazy evaluation:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Lets start calling these objects as RDD\u2019s\nhereafter<\/li>\n\n\n\n<li>RDD\u2019s are immutable &amp; partitioned&nbsp; <\/li>\n\n\n\n<li>RDD\u2019s mostly <em>reside in the RAM (memory) <\/em>unless\nthe RAM (memory) is running of space <\/li>\n<\/ul>\n\n\n\n<p>Execution starts only when Action starts:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"rdds-are-fault-tolerantresilient\"><strong>RDDs are fault-tolerant(resilient)<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>RDD\u2019s lost or corrupted during the course of\nexecution can be reconstructed from the lineage <\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Create\nBase RDD<\/strong><\/li>\n<\/ul>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Increment the data elements&nbsp; <\/li>\n\n\n\n<li>Filter the even numbers <\/li>\n\n\n\n<li>Pick only those divisible by 6<\/li>\n\n\n\n<li>Select only those greater than 78<\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Lineage is a history of how an RDD was created from it\u2019s parent RDD through a transformation<\/li>\n\n\n\n<li>The steps in the transformation are re-executed to create a lost RDD<\/li>\n<\/ul>\n\n\n\n<p><strong>RDD properties:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They are RESILIENT DISTRIBUTED DATA sets <\/li>\n\n\n\n<li>Resilience (fault tolerant) due to the\nlineage feature in SPARK <\/li>\n\n\n\n<li>They are distributed and spread across many\ndata nodes <\/li>\n\n\n\n<li>They are in-memory objects <\/li>\n\n\n\n<li>They are immutable <\/li>\n<\/ul>\n\n\n\n<p><strong>Spark\u2019s approach to problem solving:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Spark\nreads the data from the disc once initially and loads it into its memory<\/li>\n\n\n\n<li>The\nin-memory data objects are called RDD\u2019s in spark&nbsp;&nbsp;&nbsp; <\/li>\n\n\n\n<li>Spark\ncan read the data from HDFS where large files are spit into smaller blocks and\ndistributed across several data nodes <\/li>\n\n\n\n<li>Data\nnodes are called as worker nodes in Spark eco system <\/li>\n\n\n\n<li>Spark\u2019s\nway of problem solving also involves map and reduce operations<\/li>\n\n\n\n<li>The\nresults of the computation can be saved in memory in case if its going to be\nre-used as the part of an iterative job <\/li>\n\n\n\n<li>&nbsp;Saving a SPARK object(RDD) in memory for\nfuture re-use is called caching <\/li>\n<\/ul>\n\n\n\n<p>Note : RDD\u2019s are not\nalways cached by default in the RAM (memory). They will have to be written on\nto the disc when the system is facing a low memory condition due to too many\nRDD\u2019s already in the RAM. Hence SPARK is not a completely in memory based\ncomputing framework<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"rddinspark\"><strong>3 different ways of creating an RDD\nin spark:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Created\nby read a big data file directly from an external file system, this is used\nwhile working on large data sets&nbsp; <\/li>\n\n\n\n<li>Using\nthe parallelize API, this is usually used on small data sets&nbsp; <\/li>\n\n\n\n<li>Using\nthe makeRDD API <\/li>\n<\/ul>\n\n\n\n<p>Example:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>The parallelize() method<\/strong><\/li>\n<\/ol>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sc =\nSparkContext.getOrCreate()<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; myrdd =\nsc.parallelize([1,2,3,4,5,6,7,8,9]) <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Data source API (eg. The textFile API)<\/strong><\/li>\n<\/ol>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sc =\nsc.textFile(\u201cFILE PATH\")<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-is-the-data-stored-in-an-rdd\"><strong>How is the data stored in an RDD?<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What\nis the data format inside an RDD?<\/li>\n<\/ul>\n\n\n\n<p>&nbsp;&nbsp;&nbsp; It\u2019s just raw\nsequence of bytes. Especially when its created using a<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp; the textFile() or\nparallelize() API<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>&nbsp;&nbsp;Creating RDD\u2019s using structured data (CSV\ndata)<\/li>\n<\/ul>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sc =\nsc.textFile(\"sample.csv\")<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; myrdd =\nsc.flatMap(lambda e:e.split(\" \"))<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\nmyrdd.collect()<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"drawbacks-of-rdds\"><strong>Drawbacks of RDDs:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They store data as a sequence of bytes <\/li>\n\n\n\n<li>This suits unstructured data manipulation <\/li>\n\n\n\n<li>Data does not have any schema <\/li>\n\n\n\n<li>There is no direct support to handle structured data <\/li>\n\n\n\n<li>The native API\u2019s on RDD\u2019s seldom provide any support for structured data handling<\/li>\n\n\n\n<li>Special RDD format is needed to store structured data (BIG DATA)<\/li>\n\n\n\n<li>A special set of API\u2019s are needed to manipulate and query such RDD\u2019s<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"schema-rddsdata-frames\"><strong>Schema RDDs(Data frames)<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The idea of named columns and schema for the\nRDD data is borrowed from data frames <\/li>\n\n\n\n<li>Spark allows creation of data frames using\nspecific data frame API\u2019s<\/li>\n\n\n\n<li>An RDD can be created using the standard set\nof regular data set API\u2019s , schema can be assigned to the data and explicitly\nconverted into a data frame <\/li>\n\n\n\n<li>There is provision to manipulate the spark\ndata frames using the library \u201cSPARK SQL\u201d<\/li>\n\n\n\n<li>&nbsp;The\nstandard set of SPARK transformation API\u2019s can also be used on SPARK data\nframes<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"creating-a-spark-data-frame\"><strong>Creating a Spark Data frame:<\/strong><\/h2>\n\n\n\n<p><strong>Consider the following CSV data saved in a text file<\/strong><\/p>\n\n\n\n<p>1,Ram,48.78,45<\/p>\n\n\n\n<p>2,Sita,12.45,40<\/p>\n\n\n\n<p>3,Bob,3.34,23<\/p>\n\n\n\n<p>4,Han,16.65,36<\/p>\n\n\n\n<p>5,Ravi,24.6,46<\/p>\n\n\n\n<p><strong>rdd = sc.textFile(\"sample.csv\") <\/strong><\/p>\n\n\n\n<p>csvrdd =\nrdd.map(lambda e:e.split(\",\"))<\/p>\n\n\n\n<p>emp =\ncsvrdd.map(lambda e: Row( id=long(e[0]), name=e[1],sal=e[2],\nage=int(e[3].strip())))<\/p>\n\n\n\n<p>empdf =&nbsp; sqlContext.createDataFrame(emp)<\/p>\n\n\n\n<p><strong>Dataframes in Spark:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Unlike an RDD, data is organized into named columns.<\/li>\n\n\n\n<li>Allows developers to impose a structure onto a distributed collection of data.<\/li>\n\n\n\n<li>Enables wider audiences beyond \u201cBig Data\u201d engineers to leverage the power of distributed processing<\/li>\n\n\n\n<li>allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization.<\/li>\n\n\n\n<li>Spark can serialize the data into off-heap storage in a binary format and then perform many transformations directly on this off-heap memory, <\/li>\n\n\n\n<li>Custom memory management\n<ul class=\"wp-block-list\">\n<li>Data is stored in off-heap memory in binary format.<\/li>\n\n\n\n<li>No Garbage collection is involved, due to avoidance of serialization<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Query optimization plan\n<ul class=\"wp-block-list\">\n<li>Catalyst Query optimizer<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"datasets-in-spark\"><strong>Datasets in Spark:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Aims to provide the best of both worlds\n<ul class=\"wp-block-list\">\n<li>RDDs \u2013 OOP and compile-time safely <\/li>\n\n\n\n<li>Data frames - Catalyst query optimizer, custom memory management<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>How dataset scores over Data frame is an additional feature it has:&nbsp;Encoders.<\/li>\n\n\n\n<li>Encoders act as an interface between JVM objects and off-heap custom memory binary format data.&nbsp;<\/li>\n\n\n\n<li>Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to deserialize an entire object.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"actions-and-transformations\"><strong>Actions and transformations:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Transformations are any operations on the\nRDD\u2019s which are subjected to manipulations during the course of analysis <\/li>\n\n\n\n<li>A SPARK job is a collection of a sequence of\na several TRANSFORMATIONS <\/li>\n\n\n\n<li>The above job is usually a program written in\nSCALA or Python <\/li>\n\n\n\n<li>Actions are &nbsp;those\noperations which trigger the execution of a sequence of transformations <\/li>\n\n\n\n<li>There are over 2 dozen transformations and 1\ndozen actions <\/li>\n\n\n\n<li>A glimpse of the actions and transformations\nin SPARK can be found in the official SPARK programming documentation guide <\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/spark.apache.org\/docs\/2.2.0\/rdd-programming-guide.html\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">https:\/\/spark.apache.org\/docs\/2.2.0\/rdd-programming-guide.html<\/a><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Most of them will be discussed in detail\nduring the demo<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"spark-sql\"><strong>Spark SQL:<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Spark SQL is a Spark module for structured data processing<\/li>\n\n\n\n<li>It lets you query structured data inside Spark programs, using SQL or a familiar DataFrame API.<\/li>\n\n\n\n<li>Connect to any data source the same way\n<ul class=\"wp-block-list\">\n<li>Hive, Avro, Parquet, ORC, JSON, and JDBC. <\/li>\n\n\n\n<li>You can even join data across these sources.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Run SQL or HiveQL queries on existing warehouses.<\/li>\n\n\n\n<li>A server mode provides industry-standard JDBC and ODBC connectivity for business intelligence tools.<\/li>\n\n\n\n<li>Writing code in RDD API in scala can be difficult, using Spark SQL easy SQL format code can be written which internally converts to Spark API &amp; optimized using the DAG &amp; catalyst engine. <\/li>\n\n\n\n<li>There is no reduction in performance.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"spark-runtime-architecture\"><strong>Spark Runtime Architecture:<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"spark-runtime-architecture-driver\"><strong>Spark Runtime Architecture \u2013 Driver:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Master node <\/li>\n\n\n\n<li>Process where&nbsp; \u201cmain\u201d method runs<\/li>\n\n\n\n<li>Runs user code that\n<ul class=\"wp-block-list\">\n<li>creates a SparkContext<\/li>\n\n\n\n<li>Performs RDD operations <\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>When runs performs two main duties\n<ul class=\"wp-block-list\">\n<li>Converting user program into tasks\n<ul class=\"wp-block-list\">\n<li>logical DAG of operations -&gt; physical execution plan.<\/li>\n\n\n\n<li>Optimization Ex: pipelining map transforms together to merge them and convert execution graph into set of stages.<\/li>\n\n\n\n<li>Bundled up task and send them to cluster. <\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Scheduling tasks on executors\n<ul class=\"wp-block-list\">\n<li>Executor register themselves to driver<\/li>\n\n\n\n<li>Look at current set executors and try to schedule a task in appropriate location, based on data placement.<\/li>\n\n\n\n<li>Track location of cache data and use it to schedule future tasks that access that data, to avoid the side effect of storing cached data while running a task. <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"spark-runtime-architecture-spark-context\"><strong>Spark Runtime Architecture \u2013 Spark\nContext:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Driver accesses Spark functionality\nthrough SC object<\/li>\n\n\n\n<li>Represents a connection to the\ncomputing cluster<\/li>\n\n\n\n<li>Used to build RDDs<\/li>\n\n\n\n<li>Works with the cluster manager <\/li>\n\n\n\n<li>Manages executors running on worker\nnodes<\/li>\n\n\n\n<li>Splits jobs as parallel task&nbsp; and execute them on worker nodes <\/li>\n\n\n\n<li>Partitions RDDs and distributes them\non the cluster<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"spark-runtime-architecture-executor\"><strong>Spark Runtime Architecture \u2013Executor:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Runs individual tasks in a given spark job<\/li>\n\n\n\n<li>Launched once at the beginning of spark application <\/li>\n\n\n\n<li>Two main roles\n<ul class=\"wp-block-list\">\n<li>Runs the task and return results to the driver<\/li>\n\n\n\n<li>Provides in-memory data stored for RDDs that are cached by user programs, through service called Block Manager that lives within each executor.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"spark-runtime-architecture-cluster-manager\"><strong>Spark Runtime Architecture - Cluster Manager<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Launches Executors and sometimes the driver<\/li>\n\n\n\n<li>Allows sparks to run on top of different external managers\n<ul class=\"wp-block-list\">\n<li>YARN<\/li>\n\n\n\n<li>Mesos<\/li>\n\n\n\n<li>Spark built-in stand alone cluster manager&nbsp; <\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Deploy modes\n<ul class=\"wp-block-list\">\n<li>Client mode <\/li>\n\n\n\n<li>Cluster mode <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"running-spark-applications-on-cluster\"><strong>Running Spark applications on\ncluster:<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Submit an application using <strong><em>spark-submit<\/em><\/strong><\/li>\n\n\n\n<li><strong><em>spark-submit<\/em><\/strong> launches driver program and invoke\nmain() method <\/li>\n\n\n\n<li>Driver program contact cluster\nmanager to ask for resources to launch executors<\/li>\n\n\n\n<li>Cluster manager launches executors\non behalf of the driver program.<\/li>\n\n\n\n<li>Drive process runs through user\napplication and send work to executors in the form of tasks.<\/li>\n\n\n\n<li>Executors runs the tasks and save\nthe results.<\/li>\n\n\n\n<li>driver\u2019s main() method exits or\nSparkContext.stop() \u2013 terminate the executors and release the resources. <\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"pyspark-hands-on-spark-dataframes\"><strong>Pyspark Hands-on - Spark Dataframes <\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"spark-dataframe-basics\"><strong>Spark DataFrame Basics<\/strong><\/h3>\n\n\n\n<p>Spark DataFrames are the workhouse and main way of working with Spark and Python post Spark 2.0. DataFrames act as powerful versions of tables, with rows and columns, easily handling large datasets. The shift to DataFrames provides many advantages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A much simpler syntax<\/li>\n\n\n\n<li>Ability to use SQL directly in the dataframe<\/li>\n\n\n\n<li>Operations are automatically distributed across RDDs<\/li>\n<\/ul>\n\n\n\n<p>If you've used R or even the pandas library with Python you are probably already familiar with the concept of DataFrames. Spark DataFrame expand on a lot of these concepts, allowing you to transfer that knowledge easily by understanding the simple syntax of Spark DataFrames. Remember that the main advantage to using Spark DataFrames vs those other programs is that Spark can handle data across many RDDs, huge data sets that would never fit on a single computer. That comes at a slight cost of some \"peculiar\" syntax choices, but after this course you will feel very comfortable with all those topics!<\/p>\n\n\n\n<p>Let's get started!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"creating-a-dataframe\"><strong>Creating a DataFrame<\/strong><\/h3>\n\n\n\n<p>First we need to start a SparkSession:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from pyspark.sql import SparkSession<\/code><\/pre>\n\n\n\n<p>Then start the SparkSession<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># May take a little while on a local computer\nspark = SparkSession.builder.appName(\"Basics\").getOrCreate()<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>spark<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/image-40.png\"><img decoding=\"async\" width=\"256\" height=\"196\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/image-40.png\" alt=\"Apache Spark\" class=\"wp-image-19423\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/image-40.png 256w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/image-40-80x60.png 80w\" sizes=\"(max-width: 256px) 100vw, 256px\" \/><\/figure>\n\n\n\n<p>You will first need to get the data from a file (or connect to a large distributed file like HDFS, we'll talk about this later once we move to larger datasets on AWS EC2).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># We'll discuss how to read other options later.\n# This dataset is from Spark's examples\n\n# Might be a little slow locally\ndf = spark.read.json('people.json')<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"showing-the-data\"><strong>Showing the data<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code># Note how data is missing!\ndf.show()\n\n<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture1.png\"><img decoding=\"async\" width=\"158\" height=\"128\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture1.png\" alt=\"\" class=\"wp-image-19504\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.printSchema()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture2.png\"><img decoding=\"async\" width=\"316\" height=\"75\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture2.png\" alt=\"\" class=\"wp-image-19505\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture2.png 316w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture2-300x71.png 300w\" sizes=\"(max-width: 316px) 100vw, 316px\" \/><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.columns<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture17.png\"><img decoding=\"async\" width=\"145\" height=\"30\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture17.png\" alt=\"\" class=\"wp-image-19523\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.describe()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture18.png\"><img decoding=\"async\" width=\"430\" height=\"36\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture18.png\" alt=\"\" class=\"wp-image-19524\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture18.png 430w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture18-300x25.png 300w\" sizes=\"(max-width: 430px) 100vw, 430px\" \/><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.describe().show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture3.png\"><img decoding=\"async\" width=\"324\" height=\"170\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture3.png\" alt=\"\" class=\"wp-image-19506\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture3.png 324w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture3-300x157.png 300w\" sizes=\"(max-width: 324px) 100vw, 324px\" \/><\/figure>\n\n\n\n<p>Some data types make it easier to infer schema (like tabular formats such as csv which we will show later). <\/p>\n\n\n\n<p>However you often have to set the schema yourself if you aren't dealing with a .read method that doesn't have inferSchema() built-in.<\/p>\n\n\n\n<p>Spark has all the tools you need for this, it just requires a very specific structure:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from pyspark.sql.types import StructField,StringType,IntegerType,StructType<\/code><\/pre>\n\n\n\n<p>Next we need to create the list of Structure fields<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>:param name: string, name of the field.<\/li>\n\n\n\n<li>:param dataType: :class:<code>DataType<\/code> of the field.<\/li>\n\n\n\n<li>:param nullable: boolean, whether the field can be null (None) or not<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>data_schema = &#91;StructField(\"age\", IntegerType(), True),StructField(\"name\", StringType(), True)]<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>final_struc = StructType(fields=data_schema)\ndf = spark.read.json('people.json', schema=final_struc)<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>df.printSchema()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture4.png\"><img decoding=\"async\" width=\"310\" height=\"87\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture4.png\" alt=\"\" class=\"wp-image-19507\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture4.png 310w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture4-300x84.png 300w\" sizes=\"(max-width: 310px) 100vw, 310px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"grabbing-the-data\"><strong>Grabbing the data<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>df&#91;'age']<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture19.png\"><img decoding=\"async\" width=\"147\" height=\"28\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture19.png\" alt=\"\" class=\"wp-image-19526\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>type(df&#91;'age'])<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture20.png\"><img decoding=\"async\" width=\"252\" height=\"31\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture20.png\" alt=\"\" class=\"wp-image-19527\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.select('age')<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture21.png\"><img decoding=\"async\" width=\"186\" height=\"31\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture21.png\" alt=\"\" class=\"wp-image-19528\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>type(df.select('age'))<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture22.png\"><img decoding=\"async\" width=\"273\" height=\"29\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture22.png\" alt=\"\" class=\"wp-image-19529\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture22.png 273w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture22-265x29.png 265w\" sizes=\"(max-width: 273px) 100vw, 273px\" \/><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.select('age').show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture5.png\"><img decoding=\"async\" width=\"121\" height=\"131\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture5.png\" alt=\"\" class=\"wp-image-19508\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code># Returns list of Row objects\ndf.head(2)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"multiple-columns\"><strong>Multiple Columns:<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>df.select(&#91;'age','name'])<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture23.png\"><img decoding=\"async\" width=\"294\" height=\"34\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture23.png\" alt=\"\" class=\"wp-image-19530\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.select(&#91;'age','name']).show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture6.png\"><img decoding=\"async\" width=\"209\" height=\"149\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture6.png\" alt=\"\" class=\"wp-image-19509\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture6.png 209w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture6-100x70.png 100w\" sizes=\"(max-width: 209px) 100vw, 209px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"creating-new-columns\"><strong>Creating new columns<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code># Adding a new column with a simple copy\ndf.withColumn('newage',df&#91;'age']).show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture7.png\"><img decoding=\"async\" width=\"214\" height=\"134\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture7.png\" alt=\"\" class=\"wp-image-19510\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture8.png\"><img decoding=\"async\" width=\"173\" height=\"134\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture8.png\" alt=\"\" class=\"wp-image-19511\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code># Simple Rename\ndf.withColumnRenamed('age','supernewage').show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture9.png\"><img decoding=\"async\" width=\"229\" height=\"138\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture9.png\" alt=\"\" class=\"wp-image-19512\"><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"more-complicated-operations-to-create-new-columns\"><strong>More complicated operations to create new columns<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>df.withColumn('doubleage',df&#91;'age']*2).show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture10.png\"><img decoding=\"async\" width=\"252\" height=\"130\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture10.png\" alt=\"\" class=\"wp-image-19513\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.withColumn('add_one_age',df&#91;'age']+1).show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture11.png\"><img decoding=\"async\" width=\"249\" height=\"133\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture11.png\" alt=\"\" class=\"wp-image-19514\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.withColumn('half_age',df&#91;'age']\/2).show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture12.png\"><img decoding=\"async\" width=\"225\" height=\"128\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture12.png\" alt=\"\" class=\"wp-image-19515\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>df.withColumn('half_age',df&#91;'age']\/2)<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture13.png\"><img decoding=\"async\" width=\"443\" height=\"33\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture13.png\" alt=\"\" class=\"wp-image-19516\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture13.png 443w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture13-300x22.png 300w\" sizes=\"(max-width: 443px) 100vw, 443px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"using-sql\"><strong>Using SQL<\/strong><\/h3>\n\n\n\n<p>To use SQL queries directly with the dataframe, you will need to register it to a temporary view:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Register the DataFrame as a SQL temporary view\ndf.createOrReplaceTempView(\"people\")<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>sql_results = spark.sql(\"SELECT * FROM people\")<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>sql_results<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture16.png\"><img decoding=\"async\" width=\"277\" height=\"29\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture16.png\" alt=\"\" class=\"wp-image-19518\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture16.png 277w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture16-265x29.png 265w\" sizes=\"(max-width: 277px) 100vw, 277px\" \/><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>sql_results.show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture14.png\"><img decoding=\"async\" width=\"166\" height=\"129\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture14.png\" alt=\"\" class=\"wp-image-19519\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code>spark.sql(\"SELECT * FROM people WHERE age=30\").show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture15.png\"><img decoding=\"async\" width=\"138\" height=\"100\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture15.png\" alt=\"\" class=\"wp-image-19520\"><\/figure>\n\n\n\n<pre class=\"wp-block-code\"><code># DataFrame approach\ndf.filter(df.age == 30).show()<\/code><\/pre>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture15-1.png\"><img decoding=\"async\" width=\"138\" height=\"100\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/08\/Capture15-1.png\" alt=\"\" class=\"wp-image-19521\"><\/figure>\n\n\n\n<p>This covers the basics of Dataframes.<\/p>\n\n\n\n<p>Happy Learning!<\/p>\n\n\n\n<p><\/p>\n\n\n<figure class=\"wp-block-image size-large zoomable\" data-full=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-big-data-analytics-2-1.png\"><a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/mastering-big-data-analytics\" target=\"_blank\" rel=\"noreferrer noopener\"><img decoding=\"async\" width=\"1000\" height=\"242\" src=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-big-data-analytics-2-1.png\" alt=\"\" class=\"wp-image-20065\" srcset=\"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-big-data-analytics-2-1.png 1000w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-big-data-analytics-2-1-300x73.png 300w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-big-data-analytics-2-1-768x186.png 768w, https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/09\/June-29-banner-for-GL-big-data-analytics-2-1-696x168.png 696w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/a><\/figure>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Spark Intro: Spark is another parallel processing framework. However, just not yet another parallel processing framework, Hadoop for example, what we have been seen all this while was a very popular parallel processing framework but it had actually had a lot of shortcomings especially in the area of machine learning, Hence a lot of those [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":18250,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[9],"tags":[],"content_type":[],"class_list":["post-18222","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Apache Spark<\/title>\n<meta name=\"description\" content=\"This article is a detailed tutorial on Apache Spark.\" \/>\n<meta name=\"robots\" content=\"noindex, follow\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Spark\" \/>\n<meta property=\"og:description\" content=\"This article is a detailed tutorial on Apache Spark.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/\" \/>\n<meta property=\"og:site_name\" content=\"Great Learning Blog: Free Resources what Matters to shape your Career!\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/GreatLearningOfficial\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-08-21T12:32:39+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-03T09:45:19+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1000\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Great Learning Editorial Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/Great_Learning\" \/>\n<meta name=\"twitter:site\" content=\"@Great_Learning\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Great Learning Editorial Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"26 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/\"},\"author\":{\"name\":\"Great Learning Editorial Team\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\"},\"headline\":\"Apache Spark\",\"datePublished\":\"2020-08-21T12:32:39+00:00\",\"dateModified\":\"2024-09-03T09:45:19+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/\"},\"wordCount\":4823,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/Blog-Featured-Images-for-Articles-04.jpg\",\"articleSection\":[\"Data Science and Analytics\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/\",\"name\":\"Apache Spark\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/Blog-Featured-Images-for-Articles-04.jpg\",\"datePublished\":\"2020-08-21T12:32:39+00:00\",\"dateModified\":\"2024-09-03T09:45:19+00:00\",\"description\":\"This article is a detailed tutorial on Apache Spark.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/Blog-Featured-Images-for-Articles-04.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/Blog-Featured-Images-for-Articles-04.jpg\",\"width\":1000,\"height\":700},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/apache-spark\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Blog\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Data Science and Analytics\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/data-science\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Apache Spark\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"name\":\"Great Learning Blog\",\"description\":\"Learn, Upskill &amp; Career Development Guide and Resources\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"alternateName\":\"Great Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\",\"name\":\"Great Learning\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"width\":900,\"height\":900,\"caption\":\"Great Learning\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/GreatLearningOfficial\\\/\",\"https:\\\/\\\/x.com\\\/Great_Learning\",\"https:\\\/\\\/www.instagram.com\\\/greatlearningofficial\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/in.pinterest.com\\\/greatlearning12\\\/\",\"https:\\\/\\\/www.youtube.com\\\/user\\\/beaconelearning\\\/\"],\"description\":\"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.\",\"email\":\"info@mygreatlearning.com\",\"legalName\":\"Great Learning Education Services Pvt. Ltd\",\"foundingDate\":\"2013-11-29\",\"numberOfEmployees\":{\"@type\":\"QuantitativeValue\",\"minValue\":\"1001\",\"maxValue\":\"5000\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\",\"name\":\"Great Learning Editorial Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"caption\":\"Great Learning Editorial Team\"},\"description\":\"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.\",\"sameAs\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/\",\"https:\\\/\\\/in.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/x.com\\\/https:\\\/\\\/twitter.com\\\/Great_Learning\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCObs0kLIrDjX2LLSybqNaEA\"],\"award\":[\"Best EdTech Company of the Year 2024\",\"Education Economictimes Outstanding Education\\\/Edtech Solution Provider of the Year 2024\",\"Leading E-learning Platform 2024\"],\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/author\\\/greatlearning\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Apache Spark","description":"This article is a detailed tutorial on Apache Spark.","robots":{"index":"noindex","follow":"follow"},"og_locale":"en_US","og_type":"article","og_title":"Apache Spark","og_description":"This article is a detailed tutorial on Apache Spark.","og_url":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/","og_site_name":"Great Learning Blog: Free Resources what Matters to shape your Career!","article_publisher":"https:\/\/www.facebook.com\/GreatLearningOfficial\/","article_published_time":"2020-08-21T12:32:39+00:00","article_modified_time":"2024-09-03T09:45:19+00:00","og_image":[{"width":1000,"height":700,"url":"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg","type":"image\/jpeg"}],"author":"Great Learning Editorial Team","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/Great_Learning","twitter_site":"@Great_Learning","twitter_misc":{"Written by":"Great Learning Editorial Team","Est. reading time":"26 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/#article","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/"},"author":{"name":"Great Learning Editorial Team","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad"},"headline":"Apache Spark","datePublished":"2020-08-21T12:32:39+00:00","dateModified":"2024-09-03T09:45:19+00:00","mainEntityOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/"},"wordCount":4823,"commentCount":0,"publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg","articleSection":["Data Science and Analytics"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/","url":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/","name":"Apache Spark","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/#primaryimage"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg","datePublished":"2020-08-21T12:32:39+00:00","dateModified":"2024-09-03T09:45:19+00:00","description":"This article is a detailed tutorial on Apache Spark.","breadcrumb":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/#primaryimage","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg","width":1000,"height":700},{"@type":"BreadcrumbList","@id":"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog","item":"https:\/\/www.mygreatlearning.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Data Science and Analytics","item":"https:\/\/www.mygreatlearning.com\/blog\/data-science\/"},{"@type":"ListItem","position":3,"name":"Apache Spark"}]},{"@type":"WebSite","@id":"https:\/\/www.mygreatlearning.com\/blog\/#website","url":"https:\/\/www.mygreatlearning.com\/blog\/","name":"Great Learning Blog","description":"Learn, Upskill &amp; Career Development Guide and Resources","publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"alternateName":"Great Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.mygreatlearning.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization","name":"Great Learning","url":"https:\/\/www.mygreatlearning.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","width":900,"height":900,"caption":"Great Learning"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/GreatLearningOfficial\/","https:\/\/x.com\/Great_Learning","https:\/\/www.instagram.com\/greatlearningofficial\/","https:\/\/www.linkedin.com\/school\/great-learning\/","https:\/\/in.pinterest.com\/greatlearning12\/","https:\/\/www.youtube.com\/user\/beaconelearning\/"],"description":"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.","email":"info@mygreatlearning.com","legalName":"Great Learning Education Services Pvt. Ltd","foundingDate":"2013-11-29","numberOfEmployees":{"@type":"QuantitativeValue","minValue":"1001","maxValue":"5000"}},{"@type":"Person","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad","name":"Great Learning Editorial Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","caption":"Great Learning Editorial Team"},"description":"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.","sameAs":["https:\/\/www.mygreatlearning.com\/","https:\/\/in.linkedin.com\/school\/great-learning\/","https:\/\/x.com\/https:\/\/twitter.com\/Great_Learning","https:\/\/www.youtube.com\/channel\/UCObs0kLIrDjX2LLSybqNaEA"],"award":["Best EdTech Company of the Year 2024","Education Economictimes Outstanding Education\/Edtech Solution Provider of the Year 2024","Leading E-learning Platform 2024"],"url":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"}]}},"uagb_featured_image_src":{"full":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg",1000,700,false],"thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04-150x150.jpg",150,150,true],"medium":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04-300x210.jpg",300,210,true],"medium_large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04-768x538.jpg",768,538,true],"large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg",1000,700,false],"1536x1536":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg",1000,700,false],"2048x2048":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg",1000,700,false],"web-stories-poster-portrait":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg",640,448,false],"web-stories-publisher-logo":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg",96,67,false],"web-stories-thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2020\/07\/Blog-Featured-Images-for-Articles-04.jpg",150,105,false]},"uagb_author_info":{"display_name":"Great Learning Editorial Team","author_link":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"},"uagb_comment_info":0,"uagb_excerpt":"Spark Intro: Spark is another parallel processing framework. However, just not yet another parallel processing framework, Hadoop for example, what we have been seen all this while was a very popular parallel processing framework but it had actually had a lot of shortcomings especially in the area of machine learning, Hence a lot of those&hellip;","_links":{"self":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/18222","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/comments?post=18222"}],"version-history":[{"count":34,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/18222\/revisions"}],"predecessor-version":[{"id":104704,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/18222\/revisions\/104704"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media\/18250"}],"wp:attachment":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media?parent=18222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/categories?post=18222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/tags?post=18222"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/content_type?post=18222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}