Kafka: Whitepaper Review đź“Š
In today's fast-paced digital world, data flows like a raging river. It contains valuable insights that can be leveraged to improve various applications, such as ad relevance and security. The bulk of this data comes from logs that are generated by user activity, clicks, likes, sharing, errors, call stack, network usage, and other operational metrics. To unlock the potential of this data, we need a powerful tool that can handle real-time data streams.
Enter Kafka, a highly efficient and reliable distributed messaging system that simplifies the process of managing and processing logs. In this article, we will explore the basics of Kafka, its core concepts, and how it works.
What limitations of the existing messaging systems prompted the development of a new one?
Traditional enterprise messaging systems have existed for a long time. However, there were a few reasons why they were not a good fit for low-latency log processing:
- Low throughput: It was not possible to explicitly batch multiple messages into a single request, implying a full TCP/IP cycle for each message, which reduced the throughput.
- Weak distributed support: The systems did not put much effort into partitioning of messages for the distribution of data storage into multiple machines.
- Immediate consumption assumption: The near consumption assumption in existing messaging models added a restriction to the queue size of the non-consumed messages.
- Push model: Certain special log aggregators were also built to handle the aforementioned limitations, but these worked on a push-based mechanism. In order for the consumer applications to retrieve messages at the maximum rate a pull-based model is more effective.
Kafka Architecture
The Kafka’s architecture consists of the following main components:
- Producers: These applications publish data streams to Kafka topics. Think of them as news reporters, constantly feeding the data highway with fresh updates.
- Consumers: These applications subscribe to specific topics and process the incoming data stream. Imagine them as data analysts, eagerly grabbing and interpreting the information whizzing by.
- Topics: These are named channels within Kafka, organizing data by category. Think of them as news channels, each broadcasting a specific type of information.
- Partitions: Each topic is further divided into partitions, horizontally scaling the data across multiple servers. Imagine each lane on the data highway having multiple segments to handle heavy traffic.
- Brokers: These are the servers that run Kafka, storing and managing data, and coordinating communication between producers and consumers. Think of them as the traffic controllers, ensuring smooth flow and preventing collisions.
Kafka supports a point-to-point delivery model in which multiple consumers jointly consume a single copy of all messages on a topic. At the same time, it also allows the publish/subscribe model in which multiple consumers each retrieve their own copy of data.
Kafka implements offset-based addressing for messages within partitions, eliminating the need for an identifier-to-address mapping mechanism. Sequential consumption by consumers aligns offset acknowledgement with guaranteed delivery of preceding messages within that partition. Because consumers process messages sequentially, acknowledging a particular offset implies they’ve received all messages prior to that point within the same partition.
Kafka Design Decisions
Certain unconventional design decisions were made while building Kafka. Here are a few of them which explain it’s state-of-the-art technology.
- Caching: Kafka explicitly avoids caching messages in the memory layer of Kafka. Instead, it relies on the underlying file system’s page cache. This improves the performance by ensuring the overhead in garbage collection is reduced.
- SendFile API: A typical approach to sending bytes from a local file to a remote socket involves four steps: storage media -> OS page cache -> application buffer -> kernel buffer -> socket. Using the Linux-based sendfile API which directly transfers data from the file channel to the socket channel reducing 2 copying steps.
- Stateless Broker: In order to reduce the design complexity of the Broker, Kafka maintained the information about the messages consumed by the consumer on the consumer side and not the broker side. To determine the retentivity of a message, i.e. when to delete a message, Kafka uses a simple time-based SLA (which is normally seven days). This design proved much easier to implement in a pull-based model.
- Distributed Coordination: Kafka does not employ a “master-slave” architecture. Instead, in order for the consumers to coordinate with each other, it utilizes a highly available consensus service, Zookeeper. The Zookeeper manages the addition, and removal of brokers and consumers and also maintains a balance when such events happen. This removes complexity from the system since there is no need for preparation for failover mechanisms in case the master fails.
Conclusion
Sure, you’ve grasped the core concepts — producers, consumers, topics, partitions, brokers — all dancing in a well-coordinated data ballet. But Kafka is more than just technical jargon; it’s a paradigm shift in how we interact with information.
The design decisions, though unconventional, have been tailored to specific application needs at LinkedIn. By focusing on log processing applications, Kafka achieves much higher throughput than conventional messaging systems.
References
https://medium.com/@chkamalsingh/kafka-white-paper-594d6ab2fb6b