July 14, 20207 min read
Authors: Kamil Karczmarczyk - Node.js Team Leader
Every modern business revolves around data. That’s precisely why a number of technologies, platforms, and frameworks have emerged to support advanced data management over the years. One such solution is Apache Kafka, a distributed streaming platform that’s designed for high-speed, real-time data processing.
Up to date, Kafka has already seen large adoption at thousands of companies worldwide. We took a closer look at it to understand what’s so special about it, and how it can be used by different businesses out there.
In this article, we will answer the following questions:
What is Apache Kafka?
How Apache Kafka works?
Why (not) Apache Kafka?
What is Apache Kafka used for?
Who uses Apache Kafka?
Let’s dive right in.
Essentially, Kafka is an open-source, distributed streaming platform that enables storing, reading, and analysing data. It might not sound like much at first, but it’s actually a powerful tool capable of handling billions of events a day and still operating quickly, mostly due to its distributed nature.
Before it was passed to the community and Apache Foundation in 2011, Kafka was originally created at LinkedIn, to track the behaviour of its users and build connections between them. Back in the days, it was meant to solve a real problem that LinkedIn developers struggled with: low-latency intake of large event data. There was a clear need for real-time processing, but there were no solutions that could really accommodate it.
That’s how Kafka came into life. Since then, it came a long way and evolved to a full-fledged distributed publish-subscribe messaging system, which provides the backbone for building robust apps.
Basically, a streaming platform like Kafka has three core capabilities: publishing & subscribing to streams of records, storing these streams of records in a fault-tolerant way, and process them as they occur.
With Kafka specifically, applications (in this case producers) publish messages (records) that arrive at a Kafka node (broker). Each record (that consists of a key, a value, and a timestamp) is processed by so-called consumers, and stored in categories called topics that make up Kafka clusters.
As topics can grow in size, though, they get split into smaller partitions for better performance and scalability. All messages inside each partition are ordered in the immutable sequence they came in and continually appended to a structured commit log.
What’s more, the records in partitions are each assigned a sequential ID number (offset) that uniquely identifies each record within the partition. Partition data is also replicated across multiple brokers to preserve all information in case one broker dies.
What’s also interesting is that Kafka doesn’t care about what records are being consumed. The Kafka cluster durably persists all published records using a configurable retention period. It’s consumers themselves that poll Kafka for new messages and say what records they want to read. After the set period, however, the record is discarded to free up space.
As for the APIs, there are five key components in Kafka:
The Producer API that allows the app to publish a stream of records to topics (one or more, for that matter),
The Consumer API that enables the app to subscribe to topics and process the relevant stream of records,
The Streams API that facilitates the effective transformation of the input streams to output streams,
The Connector API that makes it possible to build and run reusable producers or consumers that connect Kafka topics to existing applications or data systems,
The Admin API that permits managing and inspecting topics, brokers, and other Kafka objects.
As a result, Kafka is able to ingest and quickly move large amounts of data, facilitating communication between various, even loosely connected elements of IT systems.
There are a few reasons why Kafka seems to be growing in popularity. First of all, given the massive amount of data that’s being produced and consumed across different services, apps and devices, many businesses can benefit from event-driven architecture nowadays. A distributed streaming platform with durable storage like Kafka is said to be the cleanest way to achieve such an architecture.
Kafka also happens to be reliable and fast - mostly due to low latency message delivery, sequential I/O, zero-copy principle, as well as efficient data batching & compression. All this makes Kafka a proper alternative to traditional messaging brokers.
On the other hand, however, Kafka is just complicated. To start with, you have to plan and calculate a proper number of brokers, topics and partitions. Kafka cluster rebalancing, on the other hand, can also impact the performance of both producers and consumers (and thus, pause data processing). Speaking of data - old records can easily get deleted too soon (with high production and low consumption, for example) to save disk space. It can easily be overkill if you don’t actually need Kafka’s features.
According to Kafka’s core development team, there are a few key use cases that it’s designed for, including:
website activity tracking,
operational metrics, and,
Whenever there’s a need for building real-time streaming apps that need to process or react to “chunks” of data, or reliably transfer data between systems or applications - Kafka comes to the rescue.
It’s one of the reasons why Kafka works well with banking and financial applications, where transactions have to be processed in a specific order. The same applies to transport & logistics, as well as retail - especially when there are IoT sensors involved. Within these industries, there’s often a need for constant monitoring, real-time & asynchronous applications (i.e. inventory checks), advanced analytics and system integration, just to name a few.
In fact, any business that wants to leverage data analytics and complex tool integration (for example between CRM, POS and e-commerce applications) can benefit from Kafka. It’s precisely where it fits well into the equation.
It shouldn’t come as a surprise that Kafka still forms a core part of LinkedIn’s infrastructure. It’s used mostly for activity tracking, message exchanges, and metric gathering, but the list of use cases doesn’t end here. Most of the data communication between different services within LinkedIn environment utilises Kafka to some extent.
For the time being, LinkedIn admits to maintaining more than 100 Kafka clusters with over 4,000 brokers, that serve 100,000 topics and millions of partitions. The total number of messages handled by LinkedIn’s Kafka deployments, on the other hand, already surpassed 7 trillion per day.
Even though no other service uses Kafka at LinkedIn’s scale, plenty of other applications, companies, and projects take advantage of it. At Uber, for example, many processes are modelled with Kafka Streams - including customer & driver matching and ETA calculations. Netflix also embraced the multi-cluster Kafka architecture for stream processing and now seamlessly handles trillions of messages each day.
At Future Mind, we had a chance to implement Kafka as well. For Fleet Connect, a vehicle tracking & fleet management system, we monitor the location of each vehicle (along with a few other parameters) thanks to dedicated devices inside. The data collected by these devices find their way to IoT Gateway, which decodes the messages and sends them further to Kafka. From there, they end up in IoT Collector where data processing & analysis actually take place.
Interestingly, the trackers inside the vehicles send the messages one by one and in order of occurrence - but only after the previous one has been accepted and stored properly, so that we don’t lose any relevant data. For this reason, though, we have to “take over” all these messages fast, in order not to overload the trackers and process data in real time.
In this case, horizontal scaling with the use of classic brokers, messaging systems and “traditional” queuing wouldn’t be enough due to the specificity of the project and the need to analyse data in real-time. With Kafka, however, we can divide the input stream into partitions based on the vehicle’s ID and use multiple brokers which ensure nearly limitless horizontal scalability, data backup in the case of failover, high operational speed, continuity, and real-time data processing in a specific order.
1. Consider whether you actually need Kafka. If:
You don’t need to process thousands of messages per second
There’s no need to keep the specific order in which data should be processed
A potential loss of some records in the case of failover would not very problematic
Then you most likely don’t need Kafka.
2. Think the system components through.
How many brokers do you need (how distributed the system should be, in other words), how many ZooKeeper instances, how many servers and in what locations?
What topics and partitions do you plan to have - these can be changed afterwards, but the change itself is quite troublesome
How many different producers and consumers will be using the system, and in how many instances (within one consumer group) each of them will be parallelised
3. Give the whole configuration some thought, especially:
What is be the best choice for a partition key, and what’s the partitioning strategy behind it?
How to choose the best strategy for batching messages (both in the case of producers and consumers) in order to keep the balance between the processing/delivery speed and delays?
What should be the retry policy and timeouts for delivered messages from producers to Kafka brokers?
4. Test out your setup and check whether:
If you disable one of the brokers, would the remaining ones take over the partitions handled by it (in other words, would partition rebalance be triggered)?
If you disable one of the consumer instances, would the remaining ones within the same consumer group start processing the messages from the inactive consumer partitions?
Can you handle the retry policy and committing offsets, and you haven’t lost any message or processed any duplicated ones along the way?
5. Monitor Kafka with the use of available tools and check that:
You don’t have any partitions that are not used
None of the consumers have too high watermark
All brokers are healthy
All brokers and consumers are well-balanced
In case you encounter any problems once you’re at it, you can always give us a shout. We’d be happy to help!