Understanding Kafka: Essential Terms in Just 4 Minutes
Written on
Chapter 1: Introduction to Kafka
Kafka is an advanced distributed streaming platform, recognized as one of the leading tools for constructing real-time data pipelines and streaming applications. With its growing adoption among organizations, it is vital for developers and architects to grasp the fundamental terms and concepts associated with Kafka. This article serves as a thorough guide covering essential Kafka terminology, ranging from Brokers and Clusters to Consumer Groups and Data Retention. By the conclusion, you will be well-equipped to design and implement robust, scalable, and efficient Kafka systems.
Upon reading this article, you will learn how to:
- Describe Kafka's core components.
- Utilize Kafka for writing and reading event streams.
- Consume events in real-time or retrospectively.
- Outline a comprehensive example of an event streaming pipeline.
Now, let's introduce two key terms that will be frequently referenced in this discussion: brokers and topics.
A broker refers to a dedicated server within the Kafka cluster responsible for receiving, storing, processing, and distributing events. It manages topic partitions and retains metadata about the cluster. Conversely, a topic acts as a container or database for events in Kafka, symbolizing a stream of events that producers can write to and consumers can read from. With this foundational knowledge, we can continue exploring Kafka.
A Kafka cluster is composed of one or more brokers. Each broker functions as a dedicated server that handles event reception, storage, processing, and distribution. These brokers are synchronized and overseen by a dedicated server known as ZooKeeper. For instance, broker 0 may handle a log topic and a transaction topic, broker 1 could manage a payment topic and a GPS topic, while broker 2 takes care of a user click topic and a user search topic. Each broker contains one or multiple topics.
Kafka brokers play a crucial role in managing the storage of published events in topics and distributing them to subscribed consumers. The architecture incorporates partitioning and replication to boost fault tolerance and throughput, allowing events to be published and consumed concurrently across multiple brokers.
Partitioning involves segmenting a topic into various partitions for parallel event processing. Replication entails generating multiple copies of each partition and storing them across different brokers, ensuring continued access to topics even if some brokers are non-operational.
For example, consider a log topic divided into two partitions (0 and 1) and a user topic also split into two partitions (0 and 1). Each partition of these topics is replicated and stored on different brokers to enhance fault tolerance. Should a broker fail, consumers can still retrieve data from replicas located on other brokers, maintaining high availability and resilience in the Kafka cluster.
When a message is published to a topic, it may or may not be linked with a key. If a key is provided, Kafka guarantees that all events sharing the same key will be recorded in the same partition. This feature allows related events to be grouped together for easier processing. If no key is designated, Kafka employs a round-robin strategy to allocate events to partitions. This means that events without a key will be distributed to topic partitions in a rotating manner. For instance, the initial event without a key will go to partition 0, the second to partition 1, and so forth. This method promotes even distribution across all partitions, mitigating hotspots and enhancing performance during event consumption. However, it may also result in related events without keys being stored in different partitions, necessitating additional processing for integration.
Situations Requiring Keys:
- Banking Transactions: Each transaction must possess a unique identifier (key) to ensure precise processing.
- E-commerce Orders: Each order in an online store requires a distinct order ID (key) for accurate processing and tracking.
- IoT Sensor Data: Sensor data must be associated with a unique identifier (key) to guarantee correct analysis.
Situations Not Requiring Keys:
- News Articles: News articles can be published on a topic without keys, as they don’t need unique identifiers.
- Social Media Posts: Social media posts can also be distributed without keys since they don’t require unique identifiers.
- Weather Forecasts: Each forecast is unique and can be published on a topic without requiring a specific identifier.
Chapter 2: Key Terms in Kafka
The first video titled "Learn Kafka in 10 Minutes | Most Important Skill for Data Engineering" provides a concise overview of Kafka's essential skills for data engineers.
The second video, "Apache Kafka in 5 minutes," offers a quick introduction to the core concepts of Kafka.