Found this very good explanation of Kafka (see reference below). Thought I would repost it in case something were to happen to the original blog. Enjoy!
-----------------------------------------------------------------------------------------------------------------------------------
Asynchronous messaging is an important component of any distributed application. Producers and consumers of messages are de-coupled. Producers send messages to a queue or topic. Consumers consume messages from the queue or topic. The consumers do not have to be running when the message is sent. New consumers can be added on the fly. For Java programmers, JMS was and is the popular API for programming messaging applications. ActiveMQ, RabbitMQ , MQSeries (henceforth referred to as traditional brokers) are some of the popular message brokers that are widely used. While these brokers are very popular, they do have some limitations when it comes to internet scale applications. Generally their throughput will max out at few ten thousands of messages per second. Also, in many cases, the broker is a single point of failure.
A message broker is little bit like a database. It takes a message from a producer, stores it. Later a consumer reads the messages. The concepts involved in scaling a message broker are the same concepts as in scaling databases. Databases are scaled by partitioning the data storage and we have seen that applied in Hadoop, HBASE, Cassandra and many other popular open source projects. Replication adds redundancy and failure tolerance.
A common use case in internet companies is that log messages from thousands of servers need to sent to other servers that do number crunching and analytics. The rate at which messages are produced and consumed is several thousands per sec, much higher than a typical enterprise application. This needs message brokers that can handle internet scale traffic.
Apache Kafka is a open source message broker that claims to support internet scale traffic. Some key highlights of Kafka are
They have a good starter tutorial at http://kafka.apache.org/documentation.html#quickstart. So I will not repeat it. I will however write a future tutorial for JAVA producers and consumers.
Apache Kafka is a suitable choice for a messaging engine when
A message broker is little bit like a database. It takes a message from a producer, stores it. Later a consumer reads the messages. The concepts involved in scaling a message broker are the same concepts as in scaling databases. Databases are scaled by partitioning the data storage and we have seen that applied in Hadoop, HBASE, Cassandra and many other popular open source projects. Replication adds redundancy and failure tolerance.
A common use case in internet companies is that log messages from thousands of servers need to sent to other servers that do number crunching and analytics. The rate at which messages are produced and consumed is several thousands per sec, much higher than a typical enterprise application. This needs message brokers that can handle internet scale traffic.
Apache Kafka is a open source message broker that claims to support internet scale traffic. Some key highlights of Kafka are
- Message broker is a cluster of brokers. So there is partitioning and no single point of failure.
- Producers send messages to Topics.
- Messages in a Topic are partitioned among brokers so that you are not limited by machine size.
- For each topic partition 1 broker is a leader
- leader handles reads and writes
- followers replicate
- For redundancy, partitions can be replicated.
- A topic is like a log file with new messages appended to the end.
- Messages are deleted after a configurable period of time. Unlike other messaging systems where message is deleted after it is consumed. Consumer can re-consume messages if necessary.
- Each consumer maintains the position in the log file where it last read.
- Point to point messaging is implemented using Consumer groups. Consumer groups is a set of consumers with the same groupid. Within a group, each message is delivered to only one member of the group.
- Every message is delivered at least once to every consumer group. You can get publish subscribe using multiple consumer groups.
- Ordering of messages is preserved per partition. Partition is assigned to consumer within a consumer group. If you have same number of partitions and consumers in a group, then each consumer is assigned one partition and will get messages from that partition in order.
- Message delivery: For a producer , once a message is committed, it will be available as long as at least one replica is available. For the consumer, by default, Kafka provides at least once delivery, which means, in case of a crash, the message could be delivered multiple times. However with each consume, Kafka returns the offset in the logfile. The offset can be stored with the message consumed and in the event of a consumer crash, the consumer that takes over can start reading from the stored offset. For both producer and consumer, acknowledgement from broker is configurable.
- Kafka uses zookeeper to store metadata.
- Producer API is easy to use. There 2 consumer APIs.
- High level API is the simple API to use when you don'nt want to manage read offset within the topic. ConsumerConnector is the consumer class in this API and it stores offsets in zookeeper.
- What they call the Simple API is the hard to use API to be used when you want low level control of read offsets.
- Relies on filesystem for storage and caching. Caching is file system page cache.
- O(1) reads and writes since message and written to end of log and read sequentially. Reads and writes are batched for further efficiency.
- Developed in Scala programming language
They have a good starter tutorial at http://kafka.apache.org/documentation.html#quickstart. So I will not repeat it. I will however write a future tutorial for JAVA producers and consumers.
Apache Kafka is a suitable choice for a messaging engine when
- You have a very high volume of messages - several billion per day
- You need high through put
- You need the broker to be highly available
- You need cross data center replication
- You messages are logs from web servers
- Some loss of messages is tolerable
- Compared to JMS, the APIs are low level and hard to use
- APIs are not well documented. Documentation does not have javadocs
- APIs are changing and the product is evolving
- Default delivery is at least once delivery. Once and only once delivery requires additional work for the application developer
- Application developer needs to understand lower level storage details like partitions and consumer read offsets within the partition
Resource: http://khangaonkar.blogspot.com/2014/04/apache-kafka-introduction-should-i-use.html