{"id":36312,"date":"2021-06-09T09:39:39","date_gmt":"2021-06-09T04:09:39","guid":{"rendered":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/"},"modified":"2025-01-06T19:05:29","modified_gmt":"2025-01-06T13:35:29","slug":"pyspark-tutorial-for-beginners","status":"publish","type":"post","link":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/","title":{"rendered":"PySpark Tutorial : A beginner\u2019s Guide"},"content":{"rendered":"\n<p>In this guide, you'll learn what PySpark is, why it's used, who uses it, and what everybody should know before diving into PySpark, such as what Big Data, Hadoop, and MapReduce are, as well as a summary of SparkContext, SparkSession, and SQLContext. Check out the <a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/spark-pyspark\" target=\"_blank\" rel=\"noreferrer noopener\">PySpark course <\/a>to learn PySpark modules such as spark RDDs, spark DataFrame, spark streaming and structured, spark MLlib, spark ml, Graph Frames, and the benefits of PySpark.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"introduction-to-pyspark\"><strong>Introduction to PySpark<\/strong><\/h2>\n\n\n\n<p>Pyspark is an <a href=\"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Spark<\/a> and Python partnership for Big Data computations. Apache Spark is an open-source cluster-computing framework for large-scale data processing written in Scala and built at UC Berkeley's AMP Lab, while Python is a high-level programming language. Spark was originally written in <a href=\"https:\/\/www.mygreatlearning.com\/blog\/scala-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\">Scala<\/a>, and its Framework PySpark was later ported to Python through Py4J due to industry adaptation. It is a Java library built into PySpark that helps Python interact with JVM objects dynamically; therefore, to run PySpark, you must also have Java enabled in addition to Python and Apache Spark.<\/p>\n\n\n\n<p>Spark programmes operate independently on a cluster, which is a collection of computers linked together to perform computations on vast amounts of data, with each computer called a node, some of which are master nodes and others slave nodes. PySpark is used in distributed systems; in distributed systems, data and measurements are distributed because these systems combine the resources of lesser computers and potentially provide more cores and capacities than even a powerful local single computer. Check out How <a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/data-analysis-using-pyspark\" target=\"_blank\" rel=\"noreferrer noopener\">Data Analysis Using Pyspark<\/a> is done.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"beginning-steps-for-pyspark\"><strong>Beginning steps for PySpark\u200a\u200a<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Connecting to a cluster is the first step in Spark (a group of nodes at a remote location where the master node splits the data among the worker nodes, all the worker nodes report the results of the computations on data to the master node). It is as easy as building an object\/instance of the class Spark Context to bind to the cluster.<\/li>\n\n\n\n<li>You may use the SparkContext class to generate a SparkSession object that acts as an intercept with the cluster relation. Creating several SparkSessions will lead to problems.<\/li>\n\n\n\n<li>&nbsp;pyspark.sql \u2014 module from which the SparkSession object can be imported.<\/li>\n\n\n\n<li>SparkSession.builder.getOrCreate() \u2014 function restores a current SparkSession if one exists, or produces a new one if one does not exist.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-do-we-view-tables\"><strong>How do we view Tables\u200a<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>After building the session, use Catalog to see what data is used in the cluster.<\/li>\n\n\n\n<li>A Catalog is a SparkSession feature that lists all of the data in the cluster. There are various techniques for collecting various pieces of material.<\/li>\n\n\n\n<li>spark.catalog.listTables() \u2014 this returns a list of all the tables in the catalogue.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"when-do-we-use-pyspark\"><strong>When do we use PySpark?<\/strong><\/h2>\n\n\n\n<p>PySpark is widely used in the Computer Science and <a href=\"https:\/\/www.mygreatlearning.com\/blog\/what-is-machine-learning\/\">Machine Learning<\/a> communities since many widely used data science libraries, such as <a href=\"https:\/\/www.mygreatlearning.com\/blog\/python-numpy-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\">NumPy<\/a>, are written in Python. TensorFlow is also commonly used due to its ability to handle large datasets quickly. Many businesses, including Walmart and Trivago, have used PySpark.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"which-data-is-big-data\"><strong>Which data is Big&nbsp;Data?<\/strong><\/h2>\n\n\n\n<p>It is not considered <a href=\"https:\/\/www.mygreatlearning.com\/blog\/big-data-analytics\/\" target=\"_blank\" rel=\"noreferrer noopener\">Big Data<\/a> as data will fit on a local computer on a scale of 0\u201332 GB based on RAM. But what if you have to work with a larger collection of data? Consider using a SQL client to transfer data from RAM to a hard disc. Using a distributed system instead, which distributes data to several machines\/computers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"local-vs-distributed\"><strong>local vs distributed<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A local device can permit the use of computing tools from a single computer.<\/li>\n\n\n\n<li>A distributed machine has access to computational services that are accessed by a group of machines connected by a network.<\/li>\n\n\n\n<li>Beyond a certain stage, scaling out to multiple lower CPU machines is easier than scaling up to a single high CPU unit. Distributed computers often have the advantage of being easily scalable; simply connect more units.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-hadoop\"><strong>What is&nbsp;Hadoop?<\/strong><\/h2>\n\n\n\n<p><a href=\"https:\/\/www.mygreatlearning.com\/blog\/apache-hadoop-tutorial\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hadoop<\/a> is a method for distributing extremely large files through a large number of computers.<\/p>\n\n\n\n<p>It employs Hadoop's Distributed File System (HDFS). Users can communicate with massive quantities of data using HDFS. In addition, HDFS duplicates data blocks for fault tolerance. It then makes use of MapReduce.<\/p>\n\n\n\n<p>MapReduce helps you to perform computations on data. By default, HDFS can use data chunks with a maximum size of 128 MB. Each of these sections is repeated three times. The chunks are distributed in such a way that fault tolerance is provided. Smaller fragments allow for more parallel computing during processing. A large number of copies of a chunk aids in preventing data loss due to node error, security violations, or errors.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-mapreduce\">What is MapReduce?<\/h2>\n\n\n\n<p>Hadoop employs MapReduce, which enables distributed data computations. MapReduce is a technique for breaking down a computational task into a distributed set of files (such as HDFS). It includes a Job Tracker as well as other Task Trackers.<\/p>\n\n\n\n<p>The Job Tracker sends codes to the Task Trackers for execution.<\/p>\n\n\n\n<p>Task trackers allocate Memory space to workers and report on their progress to slave nodes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"spark-vs-mapreduce\"><strong>Spark vs. MapReduce<\/strong><\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>1. Spark does not need a file server, whereas MapReduce can store files in a Hadoop distributed file system.<\/li>\n\n\n\n<li>2. Spark outperforms MapReduce by up to 100 times when it comes to running operations.<\/li>\n\n\n\n<li>3. Using MapReduce MapReduce writes the remaining data to disc for each Map (here, the input data is processed and stored in an HDFS after which the mapper method produces small chunks of data) and Reduce (here, the input data from the map stage is processed and produced a new set of output for storage in the HDFS) procedure. whereas The majority of the data is loaded into memory after each Spark shift.<\/li>\n\n\n\n<li>If the memory in Spark runs out, it will overflow onto the disc.<\/li>\n\n\n\n<li>MapReduce writes the majority of the data to disc after each map and reduces operation.<\/li>\n\n\n\n<li>Spark retains the bulk of the data in memory after each transformation.<\/li>\n<\/ol>\n\n\n\n<p><em><strong>Check out <a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/spark-basics\">sparks basics <\/a>to handle and optimize Big Data workloads.<\/strong><\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"brief-description-of-apache-spark-and-pyspark\"><strong>Brief description of Apache Spark and&nbsp;PySpark<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Open-source software Apache Spark is a real-time processing system that analyses and computes real-time data. One limitation of MapReduce was its inability to perform real-time processing, which prompted the creation of Apache Spark, which was capable of handling both batch and real-time processes. It has a cluster manager where applications can be hosted. How are we going to write Spark? \u2014 Scala (the primary language of Apache Spark), Python (PySpark), R (SparkR, sparklyr), and SQL (Spark SQL) are all supported languages by Spark; users can use their favourite libraries from any of these languages and they are ready to go!!<br><\/li>\n\n\n\n<li>PySpark is a forum created by the Apache Spark Community to help Python interact with Spark. Because of PySpark, it was possible to process RDDs (Resilient Distributed Datasets) in Python. Because of its comprehensive library collection, Python is used by the vast majority of data scientists and analytics experts today. Python's integration with Spark is a major bonus to any distributed computing enthusiast.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-sparksession\"><strong>What is SparkSession?<\/strong><\/h2>\n\n\n\n<p>After Spark 2.0, SparkSession was used as a portal into PySpark to work with RDD and DataFrame. SparkContext served as an entry point before version 2.0. Spark 2.0 introduced the SparkSession class, which is a centralised class that contains all of the contexts that existed before the 2.0 update (SQLContext and Hive Context etc.). SparkSession may be used in place of SQLContext, HiveContext, and other pre-2.0 circumstances after version 2.0. Even though SparkContext was an entry point before 2.0, it has not been completely replaced by SparkSession; certain SparkContext functions are still present and are used in Spark 2.0 and later. It should also be noted that SparkSession internally generates SparkConfig and SparkContext based on the configuration provided by SparkSession.<\/p>\n\n\n\n<p>As previously said, SparkSession serves as a key to PySpark, and creating a SparkSession case is the first statement you can write to code with RDD, DataFrame. SparkSession will be generated using SparkSession.builder patterns.<\/p>\n\n\n\n<p><strong>Here's how to make a SparkSession:<\/strong><\/p>\n\n\n\n<p><em>\" from pyspark.sql import SparkSession<br>spark = SparkSession.builder.appName('rev').getOrCreate()\" <\/em><\/p>\n\n\n\n<p>builder() \u2014 The builder pattern is used to construct a SparkSession.<\/p>\n\n\n\n<p>If a SparkSession already exists, getOrCreate() either generates it or returns it.<\/p>\n\n\n\n<p>appName() \u2014 returns the name of your app (it can be anything).<\/p>\n\n\n\n<p>Any SparkSession approaches are as follows:<\/p>\n\n\n\n<p>getActiveSession() \u2014 returns an active SparkSession if one exists.<\/p>\n\n\n\n<p>version \u2014 it obtains the spark version in which the current application is running.<\/p>\n\n\n\n<p>read() \u2014 Sends a DataFrameReader class element, which is used to read records into DataFrame from CSV, Parquet, and other file formats.<\/p>\n\n\n\n<p>table() returns a DataFrame, which may be a table or a view.<\/p>\n\n\n\n<p>SQL context() \u2014 Initializes the SQL context.<\/p>\n\n\n\n<p>stop() \u2014 Brings the current sqlContext to a halt.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-sparkcontext\"><strong>What is SparkContext?<\/strong><\/h2>\n\n\n\n<p>The SparkContext portal is where Apache Spark functionality is accessed. The generation of SparkContext is a much more important phase in any Spark operator programme. It allows the Spark Application to communicate with the Spark Cluster through the Resource Manager (YARN\/Mesos). SparkContext cannot be generated before SparkConf has been created. Our Spark driver software can use SparkConf to send a configuration parameter to the SparkContext. This was before the introduction of SparkSession.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-sqlcontext-about\"><strong>What is SQLContext about?&nbsp;<\/strong><\/h2>\n\n\n\n<p>The DataFrame is a more practical choice. Since the SparkContext has already been developed, it will be used to build the dataFrame. The SQLContext must also be specified. SQLContext makes it possible to link the engine to several data sources. It is used to enable Spark SQL's features.<\/p>\n\n\n\n<p><em>\" from pyspark.sql import Row<br>from pyspark.sql import SQLContext<br><br>sql_Context = SQLContext(sql__context)\" <\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-do-we-get-started-with-pyspark\"><strong>How do we get started with&nbsp;PySpark?<\/strong><\/h2>\n\n\n\n<p>The two approaches I'll describe here are user-friendly and suitable for getting started with Pyspark. Both approaches are unaffected by the local system. As a result, requiring a complex device configuration will be unnecessary.<\/p>\n\n\n\n<p>The steps and necessary code snippets are mentioned below in case they are useful \u2014<\/p>\n\n\n\n<p><strong>Approach 1\u200a\u2014\u200aGoogle Colab<\/strong><\/p>\n\n\n\n<p>Using Google Colab is a simple and efficient process. Why are we using Colab \u2014 Colab is based on Jupyter Notebook, an incredibly scalable platform that leverages Google Docs software. Since it runs on a Google server, we don't need to instal something locally in our code, whether it's Spark, deep learning, or a machine learning model. One of Colab's most appealing features is the free GPU and TPU support! Since the GPU support is hosted on Google's cloud, it is also faster than other currently available GPUs such as Nvidia.<\/p>\n\n\n\n<p>1. Since spark is written in Scala and requires JDK to work, So install JDK.<\/p>\n\n\n\n<p>2.Downloading and Unzipping the spark package  from&nbsp;&nbsp;https:\/\/spark.apache.org\/downloads.html.<\/p>\n\n\n\n<p>3. Findspark finds spark and initializes the spark environment, hence install it.<\/p>\n\n\n\n<p>\" #Installing the JDK here<\/p>\n\n\n\n<p>!apt-get install openjdk-8-jdk-headless -qq<\/p>\n\n\n\n<p>#Downloading the spark 2.4 version from apache.org<br>!wget -q<\/p>\n\n\n\n<p>#processing the tar file<\/p>\n\n\n\n<p>!tar xf spark-2.4.7-bin-hadoop2.6.tgz<\/p>\n\n\n\n<p><br>#installing find spark in order to locate spark<\/p>\n\n\n\n<p>!pip install -q findspark\" <\/p>\n\n\n\n<p>4. Setting paths for JAVA_HOME, SPARK_HOME So that it finds Java, spark package respectively.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># importing os package\nimport os\n\n# setting the paths of java, spark which we downloaded before in order to get started \nos.environ&#91;\"JAVA_HOME\"] = \"\/usr\/lib\/jvm\/java-8-openjdk-amd64\"\nos.environ&#91;\"SPARK_HOME\"] = \"\/content\/spark-2.4.7-bin-hadoop2.6\"\n<\/code><\/pre>\n\n\n\n<p>5. Test the installation of all packages with the help of findspark.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># importing findspark in order to locate spark\nimport findspark\nfindspark.init()\n\nfrom pyspark.sql import SparkSession\n# master is local since the environment is not distributed\nspark = SparkSession.builder.master(\"local&#91;*]\").getOrCreate()<\/code><\/pre>\n\n\n\n<p><strong>Approach 2\u200a\u2014\u200aDataBricks<\/strong><\/p>\n\n\n\n<p>Databricks is a company that provides AWS-based clusters with the convenience of already having a Notebook System set up and the ability to easily add data. A comprehensive open data analytics platform for data engineering, big data analytics, machine learning, and data science. The same people who created Databricks also created Apache SparkTM, Delta Lake, MLflow, and Koalas. There are no installation requirements since it comes with its cluster and sparks setting to get started with. It has a free community version that supports a 6 GB cluster. It also has its file system, which is called DBFS.&nbsp;For advanced features such as collaboration and using more than one cluster, an upgrade is needed, but for beginners and getting acquainted with Pyspark, the DataBricks Community Edition is sufficient.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Begin with the DataBricks Community Edition, which is free to use and ideal for getting started. It is free unless you choose to use advanced features such as multiple clusters. Please keep in mind that Databricks Community Edition only allows for the development of one driver cluster.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh4.googleusercontent.com\/V7bE_pYI4wyU_qgOqWfyLV4bhJorsX3CntmG2feTTilHCUK6xRQWahXiDzoaoH4-D7yeM_O8zhbNBNTBGATASiZWaHlJz5cVal0nYzddc48LPJ_yxgM8kyfhOCg1_NS1mWVyXvg\" alt=\"\"\/><\/figure>\n\n\n\n<p><em>Figure 1:Welcome page of DataBricks Community Edition<\/em><\/p>\n\n\n\n<p>2. Click on New Notebook, as seen in Figure 1, to get started with an experience similar to, if not better than, Jupiter notebooks.<\/p>\n\n\n\n<p>3. After you've opened the new notebook, go to the cluster attachment section on the top of the notebook, as shown in figures 2 and 3.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/l7Q15DyUCQsnQ-Hn6y5efK6jGafSBCyYusYXm-gGIWkGRmdlvGONzDIPZnNp-pncSNwNANlkJgYIAvgg2yIWnBNPio8N1EuvC_6PTt9xQJr76ZQMEBYIwNfuNqlxXrQF4KPamAQ\" alt=\"\"\/><\/figure>\n\n\n\n<p><em>Figure 2:Create Cluster<\/em><\/p>\n\n\n\n<p>Give the cluster a tag, pick the runtime version, and click on build a cluster to start the cluster. It takes about a minute to start the cluster.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/hnEYDTSqbS2910cEZ_f17LpMhJPbCUwqh2mdajmgPExVEizhMHvA70_GU2UIGGl0aDYeII6NDLCuhHvqBucMhNcdEJlcdFcWWP6CfZ_a5wm7wE-zi4IcNrTlktyB8N_cgawnxtI\" alt=\"\"\/><\/figure>\n\n\n\n<p><em>Figure 3:New Cluster creation window<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"components-of-pyspark\">Components of&nbsp;Pyspark<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Spark Resilient Distributed Dataset(RDDs)- <\/strong>A fundamental PySpark building block consisting of a fault-tolerant, changeless distributed collection of properties. The term \"changeless\" refers to the fact that once an RDD is created, it cannot be changed. RDD divides each record into logical partitions that can be computed on completely different cluster nodes. In other words, RDDs are a set of objects analogous to a list in Python, except that RDDs are calculated on multiple methods spread through several physical servers, also known as cluster nodes. The following are the main characteristics of a Resilient Distributed Dataset (RDD):<\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Fault-tolerance&nbsp;<\/li>\n\n\n\n<li>Skills for distributed data collection<\/li>\n\n\n\n<li>Segmented parallel services<\/li>\n\n\n\n<li>The freedom to access a variety of data sources<\/li>\n\n\n\n<li>The operations that are done on RDDs\u200a\u2014\u200a<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Actions are procedures that enable the computation of returned RDD values.<\/li>\n\n\n\n<li>Transformations are&nbsp;Lazy operations that return another RDD rather than changing an RDD.<\/li>\n<\/ul>\n\n\n\n<p>Let\u2019s make an RDD by calling spark context. parallelize() with a Python list. In Python, a list is a data structure that contains a set of objects. Objects in a list are enclosed in square brackets, as in [data1, data2, data3]. When you generate an RDD in PySpark, if you have data in a registry, which means you have a set of data in PySpark driver storage, this selection would be parallelized.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pyspark\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName('rev').getOrCreate()\nemployee = &#91;(\"Radhika\",10), \n        (\"Kaavya\",20), \n        (\"Varnika\",30), \n        (\"Akshay\",40) \n      ]\nrdd = spark.sparkContext.parallelize(employee)<\/code><\/pre>\n\n\n\n<p>if one wants to convert RDD to a Spark DataFrame one can use toDf() as follows\u200a\u2014\u200a<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>dataframee = rdd.toDF()\ndataframee.printSchema()\ndataframee.show()\n<\/code><\/pre>\n\n\n\n<p>One can even define column names within an argument which toDf() function takes which you can have a look at below\u200a\u2014\u200a<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>emp_columns = &#91;\"emp_name\",\"emp_id\"]\ndf = rdd.toDF(emp_columns)\ndf.printSchema()\ndf.show(truncate=False)\n<\/code><\/pre>\n\n\n\n<p>another way is using StructType and StructField along with createDataframe for setting up the column names\u2014&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from pyspark.sql.types import StructType,StructField, StringType\nempSchema = StructType(&#91;       \n    StructField('emp_name', StringType(), True),\n    StructField('emp_id', StringType(), True)\n])\n\nemployee_df = spark.createDataFrame(data=employee, schema = empSchema)\nemployee_df.printSchema()\nemployee_df.show(truncate=False)\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Spark DataFrame and SQL\u200a\u2014\u200a<\/strong> DataFrame has a wide range of operations that are very useful when working with data. It can be created in several different data formats. Loading data from JSON, CSV, and other formats, as well as data loaded from an existing RDD. Programmatically, schemas may be described.<\/li>\n<\/ol>\n\n\n\n<p>Converting a Spark Dataframe to a Pandas Dataframe-<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>After querying a big dataset and querying a large dataset and aggregating it down to something manageable, a platform like pandas can be used.<\/li>\n\n\n\n<li>.toPandas() \u2014 used to convert a spark DataFrame to a pandas DataFrame.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code># the select query to fetch attributes from the table\nq = 'SELECT name, place, phone, count(*) as N, FROM directory GROUP BY name,place'\n# .sql processes the sql query\nemp_count = spark.sql(q)\n# topandas() converts the query results to pandas df\npd_counts = emp_count.toPandas()\n# head() gives the starting records from the df\nprint(pd_counts.head())\n<\/code><\/pre>\n\n\n\n<p>Transitioning from Pandas to spark\u200a\u2014\u200a<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>createDataFrame() \u2014 To obtain a Spark DataFrame, the procedure of the SparkSession class takes a Pandas DataFrame and returns the above. This translated data frame is not saved in the catalogue and cannot be queried for data using the.sql() tool.<\/li>\n\n\n\n<li>createTempView() \u2014 A Spark DataFrame system for creating a temporary table and storing it in the catalogue. It accepts one statement, which is the name of the table to be recorded.<\/li>\n\n\n\n<li>createOrReplaceTempView() \u2014 This method either build a new temporary table or replaces current ones.<\/li>\n<\/ul>\n\n\n\n<p>Spark DataFrames Operations\u200a\u2014\u200a<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Loading a CSV to a Spark DataFrame.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code># Every field\u2019s data types are automatically guessed with the help of inferschema\ndf = spark.read.csv('\/FileStore\/tables\/appl_stock.csv', inferSchema=True, header=True)<\/code><\/pre>\n\n\n\n<p>To print the schema. A schema is a tracery that represents the logical view of the records. It defines how data is organised and the relationships between them. It specifies all of the constraints that will be applied to the performance.<\/p>\n\n\n\n<p># displays the schema<br>df.printSchema()<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To display a spark DataFrame in a table format, we use show.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>df.show()<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We can also filter some records by applying a certain condition from the Spark DataFrame.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;df.filter('Close &lt; 500').show()<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We can even select only a few from the conditionally filtered results according to our requirement with the help of select.<\/li>\n<\/ul>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;df.filter('Close &lt; 500').select('Open','Close').show()<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We can use conditional operators for filtering with conditions.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>~ is not, &amp; is and, | is or, etc. We would be using circular brackets in case of filtering with two conditions. &nbsp;<\/p>\n\n\n\n<p>df.filter((df['Close'] &lt;200) &amp; ~(df['Open']&gt;200)).show()<\/p>\n\n\n\n<p>\u2022 The collect() feature is used to fetch all of the dataset's items (from all nodes) to the driver node. On smaller datasets, collect() can be used after filter(), group(), and count().<\/p>\n\n\n\n<p>res = df.filter(df['low'] == 197.16).collect()<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We can even convert a Spark DataFrame\u2019s row to a dictionary using asDict().&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>row = res[0]<br>row.asDict()['Volume']<\/p>\n\n\n\n<p>The groupby() functionality on DataFrame is used to separate related data into groups and perform aggregate functions on the grouped data.<\/p>\n\n\n\n<p>df.groupBy('Company').show()<\/p>\n\n\n\n<p>\u2022 orderby() is used to sort items in ascending or descending order based on a certain attribute.<\/p>\n\n\n\n<p>df.orderBy(df['Sales'].desc()).show()<\/p>\n\n\n\n<p>\u2022 We can also use aggregate functions like min(), max(), mean(), and count() on clustered data, as shown in the code snippet below. min() returns the minimum of the applied attribute, max() returns the limit of the applied attribute, mean() returns the data's mean, count() returns the number of instances of the applied attribute, avg() returns the average, and stddev() returns the standard deviation value.<\/p>\n\n\n\n<p>df.groupBy('Company').mean().show()<br>df.groupBy('Company').max().show()<br>df.groupBy('Company').min().show()<br>df.groupBy('Company').count().show()<br>df.select(avg('Sales')).show()<br>df.select(stddev(\u2018Sales\u2019)).show()<\/p>\n\n\n\n<p>date and timestamps\u200a\u2014\u200aIf you're using PySpark for ETL, date and time are crucial. They are supported on DataFrame which SQL queries and work in the same way as regular SQL. Most of these functions accept input in the form of a Name, Timestamp, or Sequence. If you use a String, it should be in a common format that can be translated to a date.<\/p>\n\n\n\n<p>from pyspark.sql.functions import (date_format, format_number ,dayofyear, month, year, weekofyear, dayofmonth, hour)<\/p>\n\n\n\n<p>df.select(dayofmonth(df['Date'])).show()<br>df.select(hour(df['Date'])).show()<br>df.select(month(df['Date'])).show()<br>df.select(year(df['Date'])).show()<\/p>\n\n\n\n<p>Also Read: <a href=\"https:\/\/www.mygreatlearning.com\/blog\/apache-spark\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Spark<\/a><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Streaming\u200a<\/strong><\/li>\n<\/ol>\n\n\n\n<p><strong>Spark Streaming<\/strong> is a Spark API framework that enables flexible, high-throughput, fault-tolerant live streaming data processing. Data can be consumed from a variety of outlets, including Kafka, Flume, and HDFS\/S3. These are open-source libraries for actually establishing a streaming infrastructure; the data is then processed using algorithms expressed with high-level functions such as a map, reduce, and enter. Spark Streaming captures live source data streams from inside and divides them into batches, which are then analysed by the Spark engine to generate the final batch of production seen in <em>figure 4.<\/em><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/cJV8x2xe1_iiKJKpjj2-Ayl9io9-rshv8gaZe4yYPQBHQhQGCu_iVTl-DzEr49wdmsbWT1uLtlBADHl8yxB12WAYsZ2GDGjvRpCcCi7vkDc60NHHfaNGUBTDQNbrpI42XmFzTFw\" alt=\"\"\/><\/figure>\n\n\n\n<p><em>Figure 4:process flow of Spark Streaming<\/em><\/p>\n\n\n\n<p>The measures for streaming will be as follows:<\/p>\n\n\n\n<p>\u2022 The first step will be to create a SparkContext.<\/p>\n\n\n\n<p>\u2022 The next step will be to build a StreamingContext.<\/p>\n\n\n\n<p>\u2022 Establishing a Socket Text Stream<\/p>\n\n\n\n<p>\u2022 Reading between the lines as a 'DStream' (or the data stream).<\/p>\n\n\n\n<p>Following the acquisition of the data source, the following steps will be taken:<\/p>\n\n\n\n<p>\u2022 Splitting the input line into a list of terms.<\/p>\n\n\n\n<p>\u2022 Converting each phrase to a tuple: (1st word)<\/p>\n\n\n\n<p>\u2022 Grouping the tuple by reducing by the main term and summarising the second statement, which is number one (1).<\/p>\n\n\n\n<p>\u2022 This will result in a word count of ('word', 4) for each line.<\/p>\n\n\n\n<p><strong>Structured Streaming<\/strong>&nbsp;<\/p>\n\n\n\n<p>The central concept in Structured Streaming is to think of a live data stream as a table that is constantly being appended to. As a result, a new stream processing model is created that is somewhat close to a batch processing model.&nbsp;Think of the input data stream as the \"Input Table.\" Per data object that enters the stream is like adding a new row to the Input Table. The \u201cResult Table\u201d is generated by running a query on the input. Fresh rows are appended to the Input Table per trigger interval (say, every 1 second), which finally updates the Result Table. We'd like to write the modified result rows to an external sink once the result table is checked.<\/p>\n\n\n\n<p>4. <strong>Spark Machine Learning Library (MLlib)\u200a\u2014\u200a<\/strong><\/p>\n\n\n\n<p>What exactly is machine learning? \u2014 Machine learning is a computer analysis technique that automates the creation of analytical models. Machine learning, which employs algorithms that learn from results iteratively, allows computers to uncover hidden knowledge without being explicitly programmed where to search. Detecting hacking, search engine performance, real-time ads on blogs, credit appraisal, and second-best deals are all possible use cases. Prediction of equipment failure, new pricing techniques, detection of network intrusions Customer Segmentation, Text Sentiment Analysis, Customer Churn Prediction, Pattern and Image Recognition, Financial Simulation, Email Spam Filtering, and Recommendation systems. The MLlib in Spark is mostly designed for Supervised and Unsupervised Learning functions, and the bulk of its algorithms fall into those two categories.<\/p>\n\n\n\n<p>MLlib is Spark's machine learning (ML) library. Its goal is to create functional machine learning that is both scalable and basic. One of the most important \"features\" of using MLlib is that you would format the data so that it eventually has just one or two columns:<\/p>\n\n\n\n<p>Labels and Features (Supervised)<\/p>\n\n\n\n<p>Features (Unsupervised)<\/p>\n\n\n\n<p>Mechanisms such as ML are available at a high level in Spark MLlib. Traditional learning algorithms include classification, regression, clustering, and collective filtering. Feauturization includes feature extraction, transformation, dimensionality reduction, and collection. Pipelines are tools for developing, analysing, and <a href=\"https:\/\/www.mygreatlearning.com\/blog\/what-is-fine-tuning\/\">fine-tuning<\/a> machine learning models. Infrastructure includes pipelines.<\/p>\n\n\n\n<p>Persistence is the ability to save and restart algorithms, templates, and pipelines.<\/p>\n\n\n\n<p>Utilities include linear algebra, statistics, and data handling, among other things.<\/p>\n\n\n\n<p>High-Level Goals of MLlib \u2014<\/p>\n\n\n\n<p>\u2014 Functional machine learning that is scalable and fast.<\/p>\n\n\n\n<p>\u2014 Simplifies the development and deployment of scalable machine learning pipelines.<\/p>\n\n\n\n<p><strong>mlib. classification\u200a<\/strong>\u2014\u200a There are several binary and multi-class classification and regression techniques that can be used in conjunction with algorithms such as Random Forest, Decision Trees, XGBoost, and Gradient Boosted Trees.<\/p>\n\n\n\n<p>Decision trees, logistic regression, random forests, naive Bayes, and gradient-boosted trees are examples of binary classification algorithms.<\/p>\n\n\n\n<p>Random forests, naive Bayes, logistic regression, and decision trees are examples of <a href=\"https:\/\/www.mygreatlearning.com\/blog\/multiclass-classification-explained\/\">multiclass classification algorithms<\/a>.<\/p>\n\n\n\n<p><strong>mllib.regression\u200a\u2014\u200a<\/strong> Regression analysis aims to find correlations and dependencies between variables. The interface for dealing with linear regression models is similar to that for dealing with logistic regression models.<\/p>\n\n\n\n<p>Though regression algorithms such as Lasso, ridge regression, decision trees, random forests, and gradient-boosted trees are available.<\/p>\n\n\n\n<p><strong>mllib.clustering\u200a\u2014\u200a<\/strong> Unsupervised learning is a method of obtaining references from databases that include only input data and no labelled effects.<\/p>\n\n\n\n<p>Clustering is the process of identifying a population or sequence of data points in such a way that data points in the same classification are more analogous to data points in the same category and dissimilar to data points in other categories. It is simply a catalogue of objects arranged according to their resemblance and dissimilarity. The KMeans algorithm is one such example. k-means is a common clustering algorithm that divides data points into a fixed number of clusters. The KMeans|| method is a parallelized variant of the k-means++ method used in the MLlib implementation. As the base model, KMeans is applied as an Estimator and produces a KMeansModel.<\/p>\n\n\n\n<p><strong>mlllib.stat\u200a\u2014\u200a<\/strong> MLlib provides summary statistics for RDD through the Statistics package's function colStats() . colStats() \u2014 for each column, returns the minimum, maximum, mean, difference, number of non-zero values, and total count.<\/p>\n\n\n\n<p>Here's a quick description of a quantitative overview using mlllib.<\/p>\n\n\n\n<p>Creating the sparkSession and importing the required packets.<\/p>\n\n\n\n<p>#importing sparksession module<br>from pyspark.sql import SparkSession<\/p>\n\n\n\n<p>#importing statistics module from mllib.stat<br>from pyspark.mllib.stat import Statistics<br>import pandas as pd<\/p>\n\n\n\n<p>#creating spark object<br>spark = SparkSession.builder.appName('StatisticalSummary').getOrCreate()<\/p>\n\n\n\n<p>reading the data&nbsp;<\/p>\n\n\n\n<p># reading the spark data<br>data = spark.read.csv('\/FileStore\/tables\/Admission_Prediction.csv', header = True, inferSchema = True)<\/p>\n\n\n\n<p># displaying the data<br>data.show()<\/p>\n\n\n\n<p># printing the schema<br>data.printSchema()<\/p>\n\n\n\n<p># geting the column names<br>data.columns<\/p>\n\n\n\n<p># knowing the datatype<br>type(data)<\/p>\n\n\n\n<p>Checking for the null values and dealing with the null values.<\/p>\n\n\n\n<p># Importing necessary sql functions<br>from pyspark.sql.functions import col, count, isnan, when<\/p>\n\n\n\n<p>data.select([count(when(col(c).isNull(), c)).alias(c) for c in data.columns]).show()<\/p>\n\n\n\n<p>The imputer estimator fills in missing values in a dataset by using either the mean or the median of the columns in which the missing values are found, with the mean being the standard.<\/p>\n\n\n\n<p># importing imputer from ml.feature module<br>from pyspark.ml.feature import Imputer<\/p>\n\n\n\n<p>#making imputer object<br>imputer = Imputer(inputCols=['GRE Score',<br>'TOEFL Score',<br>'University Rating'], outputCols = ['GRE Score',<br>'TOEFL Score',<br>'University Rating'])<\/p>\n\n\n\n<p># fitted model<br>model = imputer.fit(data)<br>imputed_data = model.transform(data)<\/p>\n\n\n\n<p>imputed_data.select([count(when(col(c).isNull(), c)).alias(c) for c in imputed_data.columns]).show()<\/p>\n\n\n\n<p>making RDD of the vectors and then converting it to DataFrame.&nbsp;<\/p>\n\n\n\n<p>features = imputed_data.drop('Chance of Admit')<br>column_names = features.columns<br>features_rdd = features.rdd.map(lambda row: row[0:])<\/p>\n\n\n\n<p>features.show()<\/p>\n\n\n\n<p>features_rdd.toDF().show()<\/p>\n\n\n\n<p>statistical summary of the data<\/p>\n\n\n\n<p>summary = Statistics.colStats(features_rdd)<br>print(a dense vector representing each column's mean value:\\n', summary.mean())<br>print('\\n')<br>print('column-wise variance:\\n', summary.variance())<br>print('\\n')<br>print('number of nonzeros in each column:\\n', summary.numNonzeros())<\/p>\n\n\n\n<p>Checking for correlation using Pearson method<\/p>\n\n\n\n<p>corr_mat=Statistics.corr(features_rdd, method=\"pearson\")<br>corr_df = pd.DataFrame(corr_mat)<br>corr_df.index, corr_df.columns = column_names, column_names<\/p>\n\n\n\n<p>5.<strong> Spark ml<\/strong><\/p>\n\n\n\n<p>spark.ml is a package introduced in 1.2 version that includes a consistent set of high-level APIs to help developers build and tune practical machine learning pipelines. It's still in alpha, and we'd like to hear from the community about how it suits real-world use cases and how it might be improved.<\/p>\n\n\n\n<p>Here's a quick demonstration of spark ml pipelines \u2014<\/p>\n\n\n\n<p>\u2022 Transformers (transform())-Preprocessing step of function extraction, turning data into a consumable format, taking an input column and transforming it to an output column, normalising data, tokenization, converting categorical values to numerical values are a few instances.<\/p>\n\n\n\n<p>\u2022 Estimator (fit())-A data-driven learning algorithm that trains (fits) and returns a model, which is a kind of transformer; String Indexer and One Hot Coder are two instances.<\/p>\n\n\n\n<p>\u2022 Evaluator-Test the model's performance using a specific metric \u2014 ROC, RMSE, and so on. Helps to automate the model tuning process. The model utility is compared, and the best model for prediction generation is chosen.<\/p>\n\n\n\n<p>Spark.ml also has a variety of classification and regression algorithms that you can use to solve your problems.<\/p>\n\n\n\n<p>Here is a brief implementation of the logistic regression issue with code snippets that might be of assistance.<\/p>\n\n\n\n<p><strong>Creating a spark object and importing a spark session<\/strong><\/p>\n\n\n\n<p>from pyspark.sql import SparkSession<br>spark = SparkSession.builder.appName('log_pro').getOrCreate()<\/p>\n\n\n\n<p>reading the data and printing schema<\/p>\n\n\n\n<p># reading a csv file<br>df = spark.read.csv('\/FileStore\/tables\/customer_churn.csv', inferSchema=True, header=True)<\/p>\n\n\n\n<p># getting the schema of the spark dataframe<br>df.printSchema()<\/p>\n\n\n\n<p>Making a vector assembler object with two arguments: inputCols, which contains all of the functions, and outputCol, which contains the label. A vector assembler's job is to combine raw features and features created by various transforms into a single function vector. In general, vector assembler can remain in your workflow if you want to add all of your features before training or ranking your model.<\/p>\n\n\n\n<p>#Importing vector assembler from ml.feature module<br>from pyspark.ml.feature import VectorAssembler<\/p>\n\n\n\n<p># creating assembler object<br>assembler = VectorAssembler(inputCols=['Age',<br>'Total_Purchase',<br>'Account_Manager',<br>'Years','Num_Sites'], outputCol = 'features')<\/p>\n\n\n\n<p>getting the output from the assembler.<\/p>\n\n\n\n<p># output variable has the transformed data<br>output = assembler.transform(df)<br>fin_data = output.select('features','churn')<\/p>\n\n\n\n<p>splitting the data into train and test.<\/p>\n\n\n\n<p>train, test = fin_data.randomSplit([0.7,0.3])<\/p>\n\n\n\n<p>Importing the LogisticRegression package from the ml.classification module and making the model object.<\/p>\n\n\n\n<p>from pyspark.ml.classification import LogisticRegression<\/p>\n\n\n\n<p>lr_model = LogisticRegression(labelCol='churn')<\/p>\n\n\n\n<p>training the model&nbsp;<\/p>\n\n\n\n<p>fitted_model = lr_model.fit(train)<\/p>\n\n\n\n<p>#getting the training summary<br>training_summary = fitted_model.summary<\/p>\n\n\n\n<p>getting the final predictions<\/p>\n\n\n\n<p>training_summary.predictions.describe().show()<\/p>\n\n\n\n<p>Here\u2019s my detailed data bricks notebook on Logistic Regression\u200a\u2014\u200a<\/p>\n\n\n\n<p><a href=\"https:\/\/databricks-prod-cloudfront.cloud.databricks.com\/public\/4027ec902e239c93eaaa8714f173bcfc\/8231051908381811\/1415228458837848\/5591676128474300\/latest.html\"><strong>log_reg_pro - Databricks<\/strong><br><em>Edit description<\/em>databricks-prod-cloudfront.cloud.databricks.com<\/a><\/p>\n\n\n\n<p>Spark ml also offers clustering algorithm methods like Kmeans clustering.<\/p>\n\n\n\n<p>Here\u2019s my detailed clustering data bricks notebook for reference\u200a\u2014\u200a<\/p>\n\n\n\n<p><a href=\"https:\/\/databricks-prod-cloudfront.cloud.databricks.com\/public\/4027ec902e239c93eaaa8714f173bcfc\/8231051908381811\/3012407350223290\/5591676128474300\/latest.html\"><strong>K_means_clustering - Databricks<\/strong><br><em>Edit description<\/em>databricks-prod-cloudfront.cloud.databricks.com<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"spark-mllib-vs-spark-ml\"><strong>Spark Mllib vs Spark ML<\/strong><\/h2>\n\n\n\n<p>6. <strong>Spark Serializers<\/strong><\/p>\n\n\n\n<p>Serialization is used to fine-tune the performance of Apache Spark. Data should be serialised when it is sent over the network, written to disc, or stored in memory. In high-cost operations, serialisation is critical.<\/p>\n\n\n\n<p>PySpark allows you to fine-tune output by using custom serializers. PySpark supports the two serializers mentioned below:<\/p>\n\n\n\n<p>MarshalSerializer \u2014 The Marshal Serializer in Python is used to serialise objects. While this serializer is faster than PickleSerializer, it only supports a subset of data types.<\/p>\n\n\n\n<p>PickleSerializer \u2014 The Pickle Serializer in Python is used to serialise objects. This serializer can handle almost any Python object, although it is likely to be slower than more sophisticated serializers.<\/p>\n\n\n\n<p>7. <strong>Spark GraphFrames<\/strong><\/p>\n\n\n\n<p>GraphFrames is an Apache Spark package that provides graphs based on DataFrames. It is compatible with high-level APIs written in Java, Python, and Scala. It aims to include GraphX features as well as extended capabilities in Python and Scala through the use of Spark DataFrames. Motif recognition, DataFrame-based serialisation, and highly expressive graph queries are among the latest features.<\/p>\n\n\n\n<p>Graph Theory and Graph Processing\u200a\u2014\u200aGraph analysis is an essential aspect of science that has a wide variety of applications. The basic goal of graph theory and processing is to define associations between different nodes and edges. Edges define the relationships between nodes or vertices, which are the classes. This is ideal for social network analysis and running algorithms like PageRank to help understand and weigh relationships.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"pros-of-using-pyspark\"><strong>Pros of using&nbsp;pyspark<\/strong><\/h2>\n\n\n\n<p>\u2022 PySpark is a specialised in-memory distributed processing engine that allows you to efficiently process data in a distributed manner.<\/p>\n\n\n\n<p>\u2022 Programs running on PySpark are 100 times faster than regular applications.<\/p>\n\n\n\n<p>\u2022 By using PySpark for data ingestion pipelines, you can learn a lot. PySpark can be used to process data from Hadoop HDFS, AWS S3, and a host of file systems.<\/p>\n\n\n\n<p>\u2022 PySpark is also used to process real-time data through the use of Streaming and Kafka.<\/p>\n\n\n\n<p>\u2022 With PySpark streaming, you can switch data from the file system as well as from the socket.<\/p>\n\n\n\n<p>\u2022 PySpark, by chance, has machine learning and graph libraries.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>PySpark is a great place to start when it comes to Big Data Processing. You discovered in this guide that if you\u2019re familiar with a few practical programming principles like map (), filter (), and basic Python, you don\u2019t have to spend a lot of time learning upfront. You can use any Python tool you're already familiar with in your PySpark programmes, like NumPy and Pandas. You will now be capable of understanding built-in Python Big Data concepts. Build clear PySpark services. Run PySpark programmes on small datasets on your local computer. Examine more powerful Big Data implementations, such as a Spark cluster or a custom, hosted solution.<\/p>\n\n\n\n<p>Take up a free <a href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/spark-pyspark\">SpySpark course<\/a><a data-type=\"URL\" href=\"https:\/\/www.mygreatlearning.com\/academy\/learn-for-free\/courses\/spark-pyspark gl_blog_id=36312\" target=\"_blank\" rel=\"noreferrer noopener\"> <\/a>,and get course completion certificate from Great learning. Our <a href=\"https:\/\/www.mygreatlearning.com\/pyspark\/free-courses\" target=\"_blank\" rel=\"noreferrer noopener\">PySpark courses<\/a> are designed for those who want to gain practical skills in data processing and analysis using this powerful tool. Whether you're a beginner or have some experience with Python, our courses will help you take your skills to the next level.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this guide, you'll learn what PySpark is, why it's used, who uses it, and what everybody should know before diving into PySpark, such as what Big Data, Hadoop, and MapReduce are, as well as a summary of SparkContext, SparkSession, and SQLContext. Check out the PySpark course to learn PySpark modules such as spark RDDs, [&hellip;]<\/p>\n","protected":false},"author":41,"featured_media":36315,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[25860],"tags":[36796],"content_type":[],"class_list":["post-36312","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","tag-python"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>PySpark Tutorial : A beginner\u2019s Guide<\/title>\n<meta name=\"description\" content=\"Pyspark is an Apache Spark which is an open-source cluster-computing framework for large-scale data processing written in Scala.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"PySpark Tutorial : A beginner\u2019s Guide\" \/>\n<meta property=\"og:description\" content=\"Pyspark is an Apache Spark which is an open-source cluster-computing framework for large-scale data processing written in Scala.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/\" \/>\n<meta property=\"og:site_name\" content=\"Great Learning Blog: Free Resources what Matters to shape your Career!\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/GreatLearningOfficial\/\" \/>\n<meta property=\"article:published_time\" content=\"2021-06-09T04:09:39+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-06T13:35:29+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1254\" \/>\n\t<meta property=\"og:image:height\" content=\"836\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Great Learning Editorial Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@https:\/\/twitter.com\/Great_Learning\" \/>\n<meta name=\"twitter:site\" content=\"@Great_Learning\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Great Learning Editorial Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"23 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/\"},\"author\":{\"name\":\"Great Learning Editorial Team\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\"},\"headline\":\"PySpark Tutorial : A beginner\u2019s Guide\",\"datePublished\":\"2021-06-09T04:09:39+00:00\",\"dateModified\":\"2025-01-06T13:35:29+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/\"},\"wordCount\":5228,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/06\\\/iStock-805183646.jpg\",\"keywords\":[\"python\"],\"articleSection\":[\"IT\\\/Software Development\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/\",\"name\":\"PySpark Tutorial : A beginner\u2019s Guide\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/06\\\/iStock-805183646.jpg\",\"datePublished\":\"2021-06-09T04:09:39+00:00\",\"dateModified\":\"2025-01-06T13:35:29+00:00\",\"description\":\"Pyspark is an Apache Spark which is an open-source cluster-computing framework for large-scale data processing written in Scala.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/06\\\/iStock-805183646.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2021\\\/06\\\/iStock-805183646.jpg\",\"width\":1254,\"height\":836,\"caption\":\"Glowing source code example snippet written in the Python programming language.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/pyspark-tutorial-for-beginners\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Blog\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"IT\\\/Software Development\",\"item\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/software\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"PySpark Tutorial : A beginner\u2019s Guide\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"name\":\"Great Learning Blog\",\"description\":\"Learn, Upskill &amp; Career Development Guide and Resources\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\"},\"alternateName\":\"Great Learning\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#organization\",\"name\":\"Great Learning\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/06\\\/GL-Logo.jpg\",\"width\":900,\"height\":900,\"caption\":\"Great Learning\"},\"image\":{\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/GreatLearningOfficial\\\/\",\"https:\\\/\\\/x.com\\\/Great_Learning\",\"https:\\\/\\\/www.instagram.com\\\/greatlearningofficial\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/in.pinterest.com\\\/greatlearning12\\\/\",\"https:\\\/\\\/www.youtube.com\\\/user\\\/beaconelearning\\\/\"],\"description\":\"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.\",\"email\":\"info@mygreatlearning.com\",\"legalName\":\"Great Learning Education Services Pvt. Ltd\",\"foundingDate\":\"2013-11-29\",\"numberOfEmployees\":{\"@type\":\"QuantitativeValue\",\"minValue\":\"1001\",\"maxValue\":\"5000\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/#\\\/schema\\\/person\\\/6f993d1be4c584a335951e836f2656ad\",\"name\":\"Great Learning Editorial Team\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"contentUrl\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/wp-content\\\/uploads\\\/2022\\\/02\\\/unnamed.webp\",\"caption\":\"Great Learning Editorial Team\"},\"description\":\"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.\",\"sameAs\":[\"https:\\\/\\\/www.mygreatlearning.com\\\/\",\"https:\\\/\\\/in.linkedin.com\\\/school\\\/great-learning\\\/\",\"https:\\\/\\\/x.com\\\/https:\\\/\\\/twitter.com\\\/Great_Learning\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UCObs0kLIrDjX2LLSybqNaEA\"],\"award\":[\"Best EdTech Company of the Year 2024\",\"Education Economictimes Outstanding Education\\\/Edtech Solution Provider of the Year 2024\",\"Leading E-learning Platform 2024\"],\"url\":\"https:\\\/\\\/www.mygreatlearning.com\\\/blog\\\/author\\\/greatlearning\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"PySpark Tutorial : A beginner\u2019s Guide","description":"Pyspark is an Apache Spark which is an open-source cluster-computing framework for large-scale data processing written in Scala.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/","og_locale":"en_US","og_type":"article","og_title":"PySpark Tutorial : A beginner\u2019s Guide","og_description":"Pyspark is an Apache Spark which is an open-source cluster-computing framework for large-scale data processing written in Scala.","og_url":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/","og_site_name":"Great Learning Blog: Free Resources what Matters to shape your Career!","article_publisher":"https:\/\/www.facebook.com\/GreatLearningOfficial\/","article_published_time":"2021-06-09T04:09:39+00:00","article_modified_time":"2025-01-06T13:35:29+00:00","og_image":[{"width":1254,"height":836,"url":"http:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg","type":"image\/jpeg"}],"author":"Great Learning Editorial Team","twitter_card":"summary_large_image","twitter_creator":"@https:\/\/twitter.com\/Great_Learning","twitter_site":"@Great_Learning","twitter_misc":{"Written by":"Great Learning Editorial Team","Est. reading time":"23 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/#article","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/"},"author":{"name":"Great Learning Editorial Team","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad"},"headline":"PySpark Tutorial : A beginner\u2019s Guide","datePublished":"2021-06-09T04:09:39+00:00","dateModified":"2025-01-06T13:35:29+00:00","mainEntityOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/"},"wordCount":5228,"commentCount":0,"publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg","keywords":["python"],"articleSection":["IT\/Software Development"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/","url":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/","name":"PySpark Tutorial : A beginner\u2019s Guide","isPartOf":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/#primaryimage"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/#primaryimage"},"thumbnailUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg","datePublished":"2021-06-09T04:09:39+00:00","dateModified":"2025-01-06T13:35:29+00:00","description":"Pyspark is an Apache Spark which is an open-source cluster-computing framework for large-scale data processing written in Scala.","breadcrumb":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/#primaryimage","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg","width":1254,"height":836,"caption":"Glowing source code example snippet written in the Python programming language."},{"@type":"BreadcrumbList","@id":"https:\/\/www.mygreatlearning.com\/blog\/pyspark-tutorial-for-beginners\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog","item":"https:\/\/www.mygreatlearning.com\/blog\/"},{"@type":"ListItem","position":2,"name":"IT\/Software Development","item":"https:\/\/www.mygreatlearning.com\/blog\/software\/"},{"@type":"ListItem","position":3,"name":"PySpark Tutorial : A beginner\u2019s Guide"}]},{"@type":"WebSite","@id":"https:\/\/www.mygreatlearning.com\/blog\/#website","url":"https:\/\/www.mygreatlearning.com\/blog\/","name":"Great Learning Blog","description":"Learn, Upskill &amp; Career Development Guide and Resources","publisher":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization"},"alternateName":"Great Learning","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.mygreatlearning.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.mygreatlearning.com\/blog\/#organization","name":"Great Learning","url":"https:\/\/www.mygreatlearning.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/06\/GL-Logo.jpg","width":900,"height":900,"caption":"Great Learning"},"image":{"@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/GreatLearningOfficial\/","https:\/\/x.com\/Great_Learning","https:\/\/www.instagram.com\/greatlearningofficial\/","https:\/\/www.linkedin.com\/school\/great-learning\/","https:\/\/in.pinterest.com\/greatlearning12\/","https:\/\/www.youtube.com\/user\/beaconelearning\/"],"description":"Great Learning is a leading global ed-tech company for professional training and higher education. It offers comprehensive, industry-relevant, hands-on learning programs across various business, technology, and interdisciplinary domains driving the digital economy. These programs are developed and offered in collaboration with the world's foremost academic institutions.","email":"info@mygreatlearning.com","legalName":"Great Learning Education Services Pvt. Ltd","foundingDate":"2013-11-29","numberOfEmployees":{"@type":"QuantitativeValue","minValue":"1001","maxValue":"5000"}},{"@type":"Person","@id":"https:\/\/www.mygreatlearning.com\/blog\/#\/schema\/person\/6f993d1be4c584a335951e836f2656ad","name":"Great Learning Editorial Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","url":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","contentUrl":"https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2022\/02\/unnamed.webp","caption":"Great Learning Editorial Team"},"description":"The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.","sameAs":["https:\/\/www.mygreatlearning.com\/","https:\/\/in.linkedin.com\/school\/great-learning\/","https:\/\/x.com\/https:\/\/twitter.com\/Great_Learning","https:\/\/www.youtube.com\/channel\/UCObs0kLIrDjX2LLSybqNaEA"],"award":["Best EdTech Company of the Year 2024","Education Economictimes Outstanding Education\/Edtech Solution Provider of the Year 2024","Leading E-learning Platform 2024"],"url":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"}]}},"uagb_featured_image_src":{"full":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg",1254,836,false],"thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646-150x150.jpg",150,150,true],"medium":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646-768x512.jpg",768,512,true],"large":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646-1024x683.jpg",1024,683,true],"1536x1536":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg",1254,836,false],"2048x2048":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646.jpg",1254,836,false],"web-stories-poster-portrait":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646-640x836.jpg",640,836,true],"web-stories-publisher-logo":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646-96x96.jpg",96,96,true],"web-stories-thumbnail":["https:\/\/www.mygreatlearning.com\/blog\/wp-content\/uploads\/2021\/06\/iStock-805183646-150x100.jpg",150,100,true]},"uagb_author_info":{"display_name":"Great Learning Editorial Team","author_link":"https:\/\/www.mygreatlearning.com\/blog\/author\/greatlearning\/"},"uagb_comment_info":0,"uagb_excerpt":"In this guide, you'll learn what PySpark is, why it's used, who uses it, and what everybody should know before diving into PySpark, such as what Big Data, Hadoop, and MapReduce are, as well as a summary of SparkContext, SparkSession, and SQLContext. Check out the PySpark course to learn PySpark modules such as spark RDDs,&hellip;","_links":{"self":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/36312","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/users\/41"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/comments?post=36312"}],"version-history":[{"count":21,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/36312\/revisions"}],"predecessor-version":[{"id":114704,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/posts\/36312\/revisions\/114704"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media\/36315"}],"wp:attachment":[{"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/media?parent=36312"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/categories?post=36312"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/tags?post=36312"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/www.mygreatlearning.com\/blog\/wp-json\/wp\/v2\/content_type?post=36312"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}