1. KAFKA INTRODUCTION
Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache software foundation.
- Independent tool nothing to do with HDFS
- runs on its own cluster
- Once Msgs receives by kafka it store on multiple m/c just like how HDFS stores, but it has its own cluster.
- Kafka acts like Receiver:
###################################################################
Requirement : Capturing the stream:
Expectation : Receiver must run all the time
Data processing
Normal Spark Receiver/Flume:
-----------------------------------------------------------
- Normal Spark Receiver runs on commodity hardware, if it goes down one more comes, but by that time no one is there to capture the coming data during that time.
- Normal spark receiver - if data coming rate is high it will burst (same thing happened with Indian railway, they developed using Flume)
- Runs on one m/c (JVM), but we can run 2 Receiver, but both receive the data, there will be duplicate copies and uses more resource in cluster.
- If Receiver configure for every 5 sec, process should happen with in that 5 sec so that it can take another message, but if process itself takes 10 sec then ...
Kafka/RabitMQ:
-------------------------------------------
Kafka -
- Its NOT Master/slave
- In case of Fault tolerance : It does Replication of messages
- what ever msg we send back end it will be stored in binary as file system
- Able to handle millions of message per sec easily (linked in get millions per sec)
- Stream processing is available (Kafka confluent) : Later kafka started giving Kafka processing engine also (same like spark), little bit of processing we can do in kafka itself.
Retension: by default it preserve the streams for 7 days, and preserve in array of bytes, message will not be removed irrespective of consumer consumed or not, once retension comes then only it will removed
- retension time is over
- retension size is over
Messages can be (csv,txt,avro...)
Header
----------
Body
###################################################################
1 Tweet (1msg should convert to Byte Array) -> Kafka(Recieve always as Byte array) -> Byte Array -----> Consumer - Poll (Time : 100ms, size: 100kb)
If message size is > 100kb => will throw error, so that we need to increase message size
If message size is < 100 kb => other message also come with that
No comments:
Post a Comment