Rapid Reads News

HOMEcorporatetechentertainmentresearchmiscwellnessathletics

Deployment Strategies for Apache Kafka Cluster Types


Deployment Strategies for Apache Kafka Cluster Types

Join the DZone community and get the full member experience.

Join For Free

Organizations start their data streaming adoption with a single Apache Kafka cluster to deploy the first use cases. The need for group-wide data governance and security but different SLAs, latency, and infrastructure requirements introduce new Kafka clusters. Multiple Kafka clusters are the norm, not an exception. Use cases include hybrid integration, aggregation, migration, and disaster recovery. This blog post explores real-world success stories and cluster strategies for different Kafka deployments across industries.

Apache Kafka is an open-source, distributed event streaming platform designed for high-throughput, low-latency data processing. It allows you to publish, subscribe to, store, and process streams of records in real time.

Kafka is a popular choice for building real-time data pipelines and streaming applications. The Kafka protocol became the de facto standard for event streaming across various frameworks, solutions, and cloud services. It supports operational and analytical workloads with features like persistent storage, scalability, and fault tolerance. Kafka includes components like Kafka Connect for integration and Kafka Streams for stream processing, making it a versatile tool for various data-driven use cases.

While Kafka is famous for real-time use cases, many projects leverage the data streaming platform for data consistency across the entire enterprise architecture, including databases, data lakes, legacy systems, Open APIs, and cloud-native applications.

Kafka is a distributed system. A production setup usually requires at least four brokers. Hence, most people automatically assume that all you need is a single distributed cluster you scale up when you add throughput and use cases. This is not wrong in the beginning. But...

One Kafka cluster is not the right answer for every use case. Various characteristics influence the architecture of a Kafka cluster:

Related topics like on-premise vs. public cloud, regional vs. global, and many other requirements also affect the Kafka architecture.

A single Kafka cluster is often the right starting point for your data streaming journey. It can onboard multiple use cases from different business domains and process gigabytes per second (if operated and scaled the right way).

However, depending on your project requirements, you need an enterprise architecture with multiple Kafka clusters. Here are a few common examples:

These options can be combined. For instance, a single broker at the edge typically replicates some curated data to a remote data center. Hybrid clusters have different architectures depending on how they are bridged: connections over the public Internet, private link, VPC peering, transit gateway, etc.

Having seen the development of Confluent Cloud over the years, I underestimated how much engineering time needs to be spent on security and connectivity. However, missing security bridges are the main blockers to the adoption of a Kafka cloud service. So, there is no way around providing various security bridges between Kafka clusters beyond just the public internet.

There are even use cases where organizations need to replicate data from the data center to the cloud, but the cloud service is not allowed to initiate the connection. Confluent built a specific feature, "source-initiated link," for such security requirements where the source (i.e., the on-premise Kafka cluster) always initiates the connection - even though the cloud Kafka clusters are consuming the data:

As you can see, it gets complex quickly. Find the right experts to help you from the beginning, not after you have already deployed the first clusters and applications.

A long time ago, I already described in a detailed presentation the architecture patterns for distributed, hybrid, edge, and global Apache Kafka deployments. Look at that slide deck and video recording for more details about the deployment options and trade-offs.

RPO and RTO are two critical KPIs you need to discuss before deciding on a Kafka cluster strategy:

While people often start with the goal of RPO = 0 and RTO = 0, they quickly realize how hard (but not impossible) it is to get this. You need to decide how much data you can lose in a disaster. You need a disaster recovery plan if disaster strikes. The legal and compliance teams will have to tell you if it is okay to lose a few data sets in case of disaster or not. These and many other challenges need to be discussed when evaluating your Kafka cluster strategy.

The replication between Kafka clusters with tools like MIrrorMaker or Cluster Linking is asynchronous and RPO > 0. Only a stretched Kafka cluster provides RPO = 0.

Most deployments with multiple Kafka clusters use asynchronous replication across data centers or clouds via tools like MirrorMaker or Confluent Cluster Linking. This is good enough for most use cases. But in case of a disaster, you lose a few messages. The RPO is > 0.

A stretched Kafka cluster deploys Kafka brokers of one single cluster across three data centers. The replication is synchronous (as this is how Kafka replicates data within one cluster) and guarantees zero data loss (RPO = 0) - even in the case of a disaster!

Why shouldn't you always do stretched clusters?

To be clear, in the public cloud, a region usually has three data centers (= availability zones). Hence, in the cloud, it depends on your SLAs if one cloud region counts as a stretched cluster or not. Most SaaS Kafka offerings deploy in a stretched cluster here.

However, many compliance scenarios do not see a Kafka cluster in one cloud region as good enough for guaranteeing SLAs and business continuity if a disaster strikes.

Confluent built a dedicated product to solve (some of) these challenges: Multi-Region Clusters (MRC). It provides capabilities to do synchronous and asynchronous replication within a stretched Kafka cluster.

For example, in a financial services scenario, MRC replicates low-volume critical transactions synchronously but high-volume logs asynchronously:

More details about stretched Kafka clusters vs. active-active / active-passive replication between two Kafka clusters in my global Kafka presentation.

The above sections explain why you need to consider different Kafka architectures depending on your project requirements. Self-managed Kafka clusters can be configured the way you need. In the public cloud, fully managed offerings look different (the same way as any other fully managed SaaS). Pricing is different because SaaS vendors need to configure reasonable limits. The vendor has to provide specific SLAs.

The data streaming landscape includes various Kafka cloud offerings. Here is an example of Confluent's current cloud offerings, including multi-tenant and dedicated environments with different SLAs, security features, and cost models.

Make sure to evaluate and understand the various cluster types from different vendors available in the public cloud, including TCO, provided uptime SLAs, replication costs across regions or cloud providers, and so on. The gaps and limitations are often intentionally hidden in the details.

For instance, if you use Amazon Managed Streaming for Apache Kafka (MSK), you should be aware that the terms and conditions state that "The service commitment does not apply to any unavailability, suspension or termination ... caused by the underlying Apache Kafka or Apache Zookeeper engine software that leads to request failures."

However, pricing and support SLAs are just one critical piece of comparison. There are lots of "build vs. buy" decisions you have to make as part of evaluating a data streaming platform.

Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable, and cost-efficient enterprise architectures. Tiered Storage for Kafka enables a new Kafka cluster type: Storing Petabytes of data in the Kafka commit log in a cost-efficient way (like in your data lake) with timestamps and guaranteed ordering to travel back in time for re-processing historical data. KOR Financial is a nice example of using Apache Kafka as a database for long-term persistence.

Kafka enables a Shift Left Architecture to store data only once for operational and analytical datasets:

With this in mind, think again about the use cases I described above for multiple Kafka clusters. Should you still replicate data in batch at rest in the database, data lake, or lakehouse from one data center or cloud region to another? No. You should synchronize data in real-time, store the data once (usually in an object store like Amazon S3), and then connect all analytical engines like Snowflake, Databricks, Amazon Athena, Google Cloud BigQuery, and so on to this standard table format.

Most organizations have multiple Kafka clusters. This section explores four success stories across different industries:

PayPal is a digital payment platform that allows users to send and receive money online securely and conveniently around the world in real time. This requires a scalable, secure, and compliant Kafka infrastructure.

During the 2022 Black Friday, Kafka traffic volume peaked at about 1.3 trillion messages daily. At present, PayPal has 85+ Kafka clusters, and every holiday season they flex up their Kafka infrastructure to handle the traffic surge. The Kafka platform continues to seamlessly scale to support this traffic growth without any impact on their business.

Today, PayPal's Kafka fleet consists of over 1,500 brokers that host over 20,000 topics. The events are replicated among the clusters, offering 99.99% availability.

Kafka cluster deployments are separated into different security zones within a data center:

The Kafka clusters are deployed across these security zones, based on data classification and business requirements. Real-time replication with tools such as MirrorMaker (in this example, running on Kafka Connect infrastructure) or Confluent Cluster Linking (using a simpler and less error-prone approach directly using the Kafka protocol for replication) is used to mirror the data across the data centers, which helps with disaster recovery and to achieve inter-security zone communication.

JioCinema is a rapidly growing video streaming platform in India. The telco OTT service is known for its expansive content offerings, including live sports like the Indian Premier League (IPL) for cricket, a newly launched Anime Hub, and comprehensive plans for covering major events like the Paris 2024 Olympics.

The data architecture leverages Apache Kafka, Flink, and Spark for data processing, as presented at Kafka Summit India 2024 in Bangalore:

Data streaming plays a pivotal role in various use cases to transform user experiences and content delivery. Over ten million messages per second enhance analytics, user insights, and content delivery mechanisms.

JioCinema's use cases include:

Kushal Khandelwal, Head of Data Platform, Analytics, and Consumption at JioCinema, explained that not all data is equal and the priorities and SLAs differ per use case:

Data streaming is a journey. Like many other organizations worldwide, JioCinema started with one large Kafka cluster using 1000+ Kafka Topics and 100,000+ Kafka Partitions for various use cases. Over time, a separation of concerns regarding use cases and SLAs developed into multiple Kafka clusters:

The success story of JioCinema shows the common evolution of a data streaming organization. Let's now explore another example where two very different Kafka clusters were deployed from the beginning for one use case.

The car manufacturer Audi provides connected cars featuring advanced technology that integrates internet connectivity and intelligent systems. Audi's cars enable real-time navigation, remote diagnostics, and enhanced in-car entertainment. These vehicles are equipped with Audi Connect services. Features include emergency calls, online traffic information, and integration with smart home devices, to enhance convenience and safety for drivers.

Audi presented its connected car architecture in the keynote of the 2018 Kafka Summit. The Audi enterprise architecture relies on two Kafka clusters with very different SLAs and use cases.

The Data Ingestion Kafka cluster is very critical. It needs to run 24/7 at scale. It provides last-mile connectivity to millions of cars using Kafka and MQTT. Backchannels from the IT side to the vehicle help with service communication and over-the-air updates (OTA).

ACDC Cloud is the analytics Kafka cluster of Audi's connected car architecture. The cluster is the foundation of many analytical workloads, which process enormous volumes of IoT and log data at scale with batch processing frameworks like Apache Spark.

This architecture was already presented in 2018. Audi's slogan, "Progress through Technology," shows how the company applied new technology for innovation long before most car manufacturers deployed similar scenarios. All sensor data from connected cars is processed in real time and stored for historical analysis and reporting.

New Relic is a cloud-based observability platform that provides real-time performance monitoring and analytics for applications and infrastructure to customers around the world.

Andrew Hartnett, VP of Software Engineering at New Relic, explains how data streaming is crucial for the entire business model of New Relic:

"Kafka is our central nervous system. It is a part of everything that we do. Most services across 110 different engineering teams with hundreds of services touch Kafka in some way, shape, or form in our company, so it really is mission-critical. What we were looking for is the ability to grow, and Confluent Cloud provided that."

New Relic ingested up to 7 billion data points per minute and is on track to ingest 2.5 exabytes of data in 2023. As New Relic expands its multi-cloud strategies, teams will use Confluent Cloud for a single pane of glass view across all environments.

"New Relic is multi-cloud. We want to be where our customers are. We want to be in those same environments, in those same regions, and we wanted to have our Kafka there with us." says Artnett in a Confluent case study.

Event-driven architectures and stream processing have existed for decades. The adoption grows with open-source frameworks like Apache Kafka and Flink in combination with fully managed cloud services. More and more organizations struggle with their Kafka scale. Enterprise-wide data governance, center of excellence, automation of deployment and operations, and enterprise architecture best practices help to successfully provide data streaming with multiple Kafka clusters for independent or collaborating business domains.

Multiple Kafka clusters are the norm, not an exception. Use cases like hybrid integration, disaster recovery, migration, or aggregation enable real-time data streaming everywhere with the needed SLAs.

Previous articleNext article

POPULAR CATEGORY

corporate

3763

tech

3917

entertainment

4597

research

2120

misc

4668

wellness

3758

athletics

4691