There are multiple usecases where we can think of using Kafka alongside Spark for streaming realtime ETL processing involved in projects like tracking web activities, monitoring servers, detecting anomalies in Engine parts and so on. The architecture involves the source producing data which is sent to a Kafka topic & the consumer processes the data for every predefined batch interval.
Any batch processing logic would need to extract required data from the storage warehouse and depending on the amount of data, this operation would involve a lot of time. Even with HBase / ElasticSearch which allows parallel reads with respect to the region splits / shards, the time involved in reading data is considerable.
Kafka as a Storage System gives us all benefits of the fault-toleran, distributed storage and the throughput of Streaming system.
Apart from usual producer/consumer setup, Kafka also provides ways of rereading messages either from beginning of between certain offset ranges. Though the default retention period of messages in Kafka topic is 30 days, it could be altered or provided as parameter while creating new topic making it an excellent source with Historic + Realtime data.
Using Spark to process messages in Kafka topic would obviously fasten things up as each Spark executor will work on each Kafka partition. So choosing right partitioning logic for your Kafka topic is important if you want to take advantage of the parallelism Spark provides.
The Spark-Kafka integraion provides two ways to consume messages.
In our example Spark application, we would be using KafkaUtils.createRDD.
We shall consider users browsing behaviour data generated from Ecommerce website. Such behaviour data can have a large schema with users IP, browser details and much more. But to keep it simple, we shall use below schema.
Either localhost or remote machines running following services are required for executing this application.
You can refer this post for setting up the environment locally.