Introduction to Kafka (Part 1)

Madhura Mehendale
5 min readMay 23, 2021

This is the first in a series of posts about basic Kafka concepts.

What is Kafka?

The definition from the documentation is as follows.

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Basically, Kafka is a platform to transfer events across systems in real-time. Kafka only takes care of transferring the events, and not processing these events. It acts as a short-life storage for real-time data that needs to be processed. Additionally, it also maintains the ordering of the event data gathered, the events are consumed in the order they were added (much like a queue).

Why Kafka over database?

I think it will be quite obvious once we finish our discussion about Kafka functionalities that it’s quite a complicated task to build all these features over a traditional database. Let’s revisit this later.

Topics and Partitions

Events in Kafka are organized as Topics (similar to tables in a database). Topics are identified by names, hence topic names need to be unique. For example, in an application one Kafka topic might be to record user activities on a site, another topic might be to record API requests that need to perform asynchronous tasks, etc. Kafka is widely used for communication among services within a system built on microservices architecture.

The events in a given topic are spread across buckets known as partitions. Each topic can have one or more partitions. We will get to the number of partitions needed for a topic and the reason for having partitions at a later point when we cover consumer and consumer groups.

For a given topic, a single event is added to only one of the partitions. The partition to which the message needs to be added can be specified by the producer, if not specified, Kafka follows a round-robin mechanism. Ordering of the messages is maintained only within a single partition and not across all partitions.

Each message in a partition is identified by a numerical offset (similar to elements in an array). An offset of a message is specific to a single partition of the topic and the offsets do repeat across partitions.

Topics and partitions.

Kafka Brokers

Kafka Broker is basically a server running the Kafka service. Multiple Kafka brokers can exist in a single setup, this is known as Kafka cluster. A Kafka broker is identified by an id which is an integer. Each broker holds certain topic partitions. Hence, within a cluster, the topic partitions can be spread across different brokers. Partitions of a given topic are also spread across different partitions to avoid a single point of failure and distribute the load evenly. A broker can also contain multiple partitions of different topics.

Topic partitions spread across brokers.

As you can see in the above image, there are three brokers (101,102,103) in the Kafka cluster. Partitions of topic A (in orange) are evenly distributed across all the brokers. Topic B has only two partitions (in yellow) which belong to Broker 101 and 102. The distribution of topics across partitions is handled by Kafka itself.

Topic Replication

Assume that in the above example, Broker 101 fails. It means two partitions are down. Just like in any distributed system, Kafka also has a way to recover from such failures through replication. Multiple copies of a topic partition are maintained across different brokers so that in case any broker goes down, another broker which has the replica can step in and serve the data. The replication factor is above 1, usually 2 or 3.

Topic replication.

Let’s understand how it works through an example. In the above diagram. there are two partitions of topic A. Two copies of each partition exist in different brokers. Assume, broker 102 goes down, both the partitions are still accessible through brokers 101 and 103.

Partition Leader

In the previous section, we understood how maintaining replicas of partitions helps when a broker goes down. But what about when all the brokers are up? Who serves the data for each topic since multiple brokers have the same data? This is resolved through the concept of a Leader for each Topic Partition. Each topic partition has a leader broker. When the leader broker for a partition goes down, another broker with the replica takes the place of the leader. The leader is selected through an election process which we won’t get into here. All the brokers with the replica (in-sync replica or ISR) sync their data with the leader.

Partition Leader.

In the above diagram, broker 101 is the leader for partition 0 and broker 102 is the leader for partition1. A leader is specific for a partition, different partitions can have different leader brokers.

That’s all for now. In the subsequent posts, we will be covering more concepts about producers, consumers, consumer groups, zookeeper, and also diving into coding.

Summary

  1. Kafka is an open-source distributed system that provides a message-passing capability across systems.
  2. Events in Kafka are organized into categories called topics.
  3. Each topic is further divided into partitions.
  4. Kafka broker is a server running the Kafka service. Multiple brokers form a Kafka cluster.

Kafka brokers are responsible for managing topic partitions.

  1. Topic Replication offers fail-over capability when a broker goes down.

Resources

  1. Find Kafka documentation here.
  2. Linkedin course regarding Kafka here.

--

--