7 Reasons to Choose Apache Pulsar over Apache Kafka

When you set out to build the best messaging infrastructure service, the first step is to pick the right underlying messaging technology. There are lots of choices out there, from various open-source projects like RabbitMQ, ActiveMQ, and NATS to proprietary solutions such as IBM MQ or Red Hat AMQ. And, of course, there is Apache Kafka, which is almost synonymous with streaming. But we didn’t go with Apache Kafka, we went with Apache Pulsar.

So why did we build our messaging service using Apache Pulsar? Here are the top seven reasons why we chose Apache Pulsar over Apache Kafka.

1. Streaming and queuing come together

Apache Pulsar is like two products in one. Not only can it handle high-rate, real-time use cases like Kafka, but it also supports standard message queuing patterns, such as competing consumers, fail-over subscriptions, and easy message fan out. Apache Pulsar automatically keeps track of the client read position in the topic and stores that information in its high-performance distributed ledger, Apache BookKeeper.

Unlike Kafka, Apache Pulsar can handle many of the use cases of a traditional queuing system, like RabbitMQ. So instead of running two systems — one for real-time streaming and one for queuing — you do both with Pulsar. It’s a two-for-one deal, and those are always good.

2. Partitions, but not necessarily partitions

If you use Kafka, you know about partitions. All topics are partitioned in Kafka. Partitioning is important because it increases throughput. By spreading the work across partitions and therefore multiple brokers, the rate that can be processed by a single topic goes up. But what if you have some topics that don’t need high rates. In these simple cases, wouldn’t it be nice to not have to worry about partitions and the API and management complexity that comes along with them?

Well, with Apache Pulsar it can be that simple. If you just need a topic, then use a topic. You don’t have to specify the number of partitions or think about how many consumers the topic might have. Pulsar subscriptions allow you to add as many consumers as you want on a topic with Pulsar keeping track of it all. If your consuming application can’t keep up, you just use a shared subscription to distribute the load between multiple consumers.

And if you really do need the performance of a partitioned topic, you can do that, too. Pulsar has partitioned topics if you need them — but only if you need them.

3. Logs are good, distributed ledgers are better

The Kafka team deserves credit for the insight that a log is a great abstraction for a real-time data exchange system. Because logs are append-only, data can be written to them quickly, and because the data in a log is sequential, it can be extracted quickly in the order that it was written. Sequential reading and writing is fast, random is not. Persistent storage interactions are a bottleneck in any system that offers data guarantees, and the log abstraction makes this about as efficient as possible.

Simple logs are great. But they can get you into trouble when they get large. Fitting a log on a single server becomes a challenge. What happens if it gets full and you need to scale out? And what happens if the server storing the log fails and needs to be recreated from a replica?

Copying a large log from one server to another, while efficient, can still take a long time. If your system is trying to do this while keeping up with real-time data, this can be quite a challenge. Check out “Adding a New Broker Results in Terrible Performance” in the blog post Stories from the Front: Lessons Learned from Supporting Apache Kafka for some color on this.

Apache Pulsar avoids the problem of copying large logs by breaking the log into segments. It distributes those segments across multiple servers while the data is being written by using Apache BookKeeper as its storage layer. This means that the log is never stored on a single server, so a single server is never a bottleneck. Failure scenarios are easier to deal with and scaling out is a snap. Just add another server. No rebalancing needed.