Email address

Password

Email address

Enter a valid email address

How to integrate with Spark?

< Integration With Spark

. Real-Time Application (Twitter) >

In the Kafka Spark Streaming Integration, Kafka acts as a central hub to input data streams which are processed, and then Spark streaming publishes the results using Spark engine into another Kafka topic or store in the form of dashboards, HDFS, and databases.

Two approaches involved in this process are -

Receiver-based approach

Using the Kafka consumer API, the Receiver is implemented to receive the data which is then stored in Spark executors. Spark stream then processes the data synchronously to ensure zero data loss.

Direct approach

The direct approach is receiver-less and works by querying Kafka for offsets in each topic within its designation partition rather than using receivers. This approach makes it easier to read data in parallel and offers zero data loss by eliminating the need for write-ahead logs.

It was introduced in Spark 1.4 for the Python API and Spark 1.3 for the Java and Scala API.

Here are Kafka-Spark APIs -

SparkConf API

It is used to set configuration for key-value pairs using the SparkConf class.

StreamingContext API

It denotes the connection to a Spark cluster and can be used to create broadcast variables, accumulators, and RDDs on it. Its signature is -

public StreamingContext(String master, String appName, Duration batchDuration,

String sparkHome, scala.collection.Seq<String> jars,

scala.collection.Map<String,String> environment)

where app name indicates the name of your job, the master is the cluster URL to connect to and batchDuration is the time interval required to divide streams of data into batches.

KafkaUtils API

This API connects Kafka cluster to Spark streaming using createStream signature mentioned below -

public static ReceiverInputDStream<scala.Tuple2<String,String>> createStream(

StreamingContext ssc, String zkQuorum, String groupId,

scala.collection.immutable.Map<String,Object> topics, StorageLevel storageLevel)

Where, ssc is a StreamingContext object, zkQuorum is Zookeeper quorum, groupId is the group id for the consumer, topics means a map of topics to consume and storageLevel indicates the level used for storing received objects.

< Integration With Spark

. Real-Time Application (Twitter) >

Top course recommendations for you

Python Fundamentals for Beginners

9 hrs

Beginner

771K+ Learners

4.55 (41157)

Front End Development - HTML

2 hrs

Beginner

534.3K+ Learners

4.51 (40100)

Front End Development - CSS

2 hrs

Beginner

193.4K+ Learners

4.51 (14471)

Blockchain Basics

3 hrs

Beginner

89.3K+ Learners

4.55 (4678)

Data Structures in C

2 hrs

Beginner

189.3K+ Learners

4.39 (13201)

Excel for Beginners

5 hrs

Beginner

1.4M+ Learners

4.48 (66682)

My SQL Basics

5 hrs

Beginner

284.5K+ Learners

4.46 (14021)

Android Application Development

2 hrs

Beginner

167.7K+ Learners

4.42 (7058)

OOPs in Java

2 hrs

Beginner

119K+ Learners

4.44 (6773)

Building Games using JavaScript

2 hrs

Beginner

34.3K+ Learners

4.46 (655)

Email address

Password

Email address

Enter a valid email address