Tuesday, December 31, 2019

3. KAFKA : Sample Program1

3. KAFKA : Sample Program1


Exaple 1: 

Producer - send million message (Java program)
Consumer - receive million message (Java program)
https://www.youtube.com/watch?v=1Og9n9FJteM

########################################################################

STEP1:

https://git-scm.com/download/win
   -> Git-2.24.1.2-64-bit.exe
   -> Set path to -> C:\Program Files\Git\bin

                       

STEP2:

Get sample kafka program from GIT:
https://github.com/mapr-demos/kafka-sample-programs
   -> cd kafka/SampleProgram/
   -> git clone https://github.com/mapr-demos/kafka-sample-programs.git
   -> //It creates folder called kafka/SampleProgram/kafka--sample-programs

                             


STEP3:

Apache kafka download - kafka.apache.org/downloads.html
   -> Scala 2.13  - kafka_2.13-2.4.0.tgz
   -> http://apachemirror.wuchna.com/kafka/2.4.0/kafka_2.13-2.4.0.tgz
   -> Unzip kafka_2.13-2.4.0.tgz
   -> cd kafka_2.13-2.4.0/

                         


   -> START ZOOKEEPER:
$ .\bin\zookeeper-server-start.sh .\config\zookeeper.properties

   -> START KAFKA SERVER:
$ .\bin\kafka-server-start.sh .\config\server.properties &

   -> CREATE TOPIC:
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic fast-messages
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic summary-markers

   -> Check list of topics available in kafka serrver
$ bin/kafka-topics.sh --list --zookeeper localhost:2181


    -> BUILD PRODUCER/CONSUMER program
$ cd ..
$ mvn package   //it will compile and create /target foler

                           

                   
    -> Run PRODUCER program
$ target/kafka-example producer

    -> Run CONSUMER program
$ target/kafka-example consumer






2. KAFKA ARCHITECTURE

2. KAFKA ARCHITECTURE




export http_proxy=www-**************.com:80
export https_proxy=www-*************.com:80
wget www.google.com
//download kafka directly in Workstation
wget http://www-us.apache.org/dist/kafka/2.2.1/kafka_2.12-2.2.1.tgz 

//inside kafka_2.13-2.4.0\config
zookeeper.properties    //zookeeper can be downloaded independent also
dataDir=/tmp/zookeeper   //(default store in /tmp folder) - Physically maintain all Znode informations...
clientPort=2181 //on which port zookeeper is running...


server.properties //every broker one server.properties will be there
broker.id=0 //mandatory: unique broker should be there, minimum property
listeners=PLAINTEXT://:9092 //9092 is the port produceer and consumer can communicate
log.dirs=/tmp/kafka-logs
log.retention.hours=168
log.segment.bytes=1073741824   //if segment reaches this size then go that segment to one file  | | | | |

1. KAFKA INTRODUCTION

1. KAFKA INTRODUCTION


Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache software foundation.


- Independent tool nothing to do with HDFS
- runs on its own cluster
- Once Msgs receives by kafka it store on multiple m/c just like how HDFS stores, but it has its own cluster.



- Kafka acts like Receiver:
###################################################################
Requirement : Capturing the stream:

Expectation : Receiver must run all the time
              Data processing

Normal Spark Receiver/Flume:
-----------------------------------------------------------
- Normal Spark Receiver runs on commodity hardware, if it goes down one more comes, but by that time no one is there to capture the coming data during that time.
- Normal spark receiver - if data coming rate is high it will burst (same thing happened with Indian railway, they developed using Flume)
- Runs on one m/c (JVM), but we can run 2 Receiver, but both receive the data, there will be duplicate copies and uses more resource in cluster.
- If Receiver configure for every 5 sec, process should happen with in that 5 sec so that it can take another message, but if process itself takes 10 sec then ...

Kafka/RabitMQ:
-------------------------------------------
Kafka - 
- Its NOT Master/slave
- In case of Fault tolerance : It does Replication of messages
- what ever msg we send back end it will be stored in binary as file system
- Able to handle millions of message per sec easily (linked in get millions per sec)
- Stream processing is available (Kafka confluent) : Later kafka started giving Kafka processing engine also (same like spark), little bit of processing we can do in kafka itself.
Retension: by default it preserve the streams for 7 days, and preserve in array of bytes, message will not be removed irrespective of consumer consumed or not, once retension comes then only it will removed
- retension time is over
- retension size is over
Messages can be (csv,txt,avro...)
      Header
      ----------

      Body

###################################################################

1 Tweet (1msg should convert to Byte Array)  ->   Kafka(Recieve always as Byte array)   ->   Byte Array   -----> Consumer - Poll (Time : 100ms, size: 100kb)
If message size is > 100kb => will throw error, so that we need to increase message size

If message size is < 100 kb => other message also come with that