Smooth Scaling and Uninterrupted Processing with Apache Kafka ft. Sophie Blee-Goldman

Streaming Audio: Apache Kafka® & Real-Time Data

תוכן מסופק על ידי Confluent, founded by the original creators of Apache Kafka® and Founded by the original creators of Apache Kafka®. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Confluent, founded by the original creators of Apache Kafka® and Founded by the original creators of Apache Kafka® או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

4+ y ago 50:33

MP3•בית הפרקים

Availability in Kafka Streams is hard, especially in the face of any changes. Any change to topic metadata or group membership triggers a rebalance. But Kafka Streams struggles even after this stop-the-world rebalance has finished. According to Apache Kafka® Committer and Confluent Software Engineer Sophie Blee-Goldman, this is because a Streams app will generally have some state associated with a given partition, and to move this state from one consumer instance to another requires rebuilding this state from a special backing topic called a changelog, the source of truth for a partition’s state.

Restoring the changelog can take hours, and until the state is ready, Streams can’t do any further processing on that partition. Furthermore, it can’t serve any requests for local state until the local state is “caught up” with the changelog. So scaling out your Streams application results in pretty significant downtime—which is a bummer, especially if the reason for scaling out in the first place was to handle a particularly heavy workload.

To solve the stop-the-world rebalance, we have to find a way to safely assign partitions so we can be confident that they’ve been revoked from their previous owner before being given to a new consumer. To solve the scaling out problem in Kafka Streams, we go a step further. When you add a new instance to your Streams application, we won’t immediately assign any stateful partitions to it. Instead, we’ll leave them assigned to their current owner to continue processing and serving queries as usual. During this time, the new instance will start to “warm up” the local state in the background; it starts consuming from the changelog and building up the local state. We then follow a similar pattern as in cooperative rebalancing, and issue a follow-up rebalance.

In KIP-441, we call these probing rebalances. Every so often (i.e., 10 minutes by default), we trigger a rebalance. In the member’s subscription metadata that it sends to the group leader, each member encodes the current status of its local state. We use the changelog lag as a measure of how “caught up” a partition is. During a rebalance, only instances that are completely caught up are allowed to own stateful tasks; everything else must first warm up the state. So long as there is some task still warming up on a node, we will “probe” with rebalances until it’s ready.

EPISODE LINKS

265 פרקים

#Tech #Tech News #News #Confluent #Event Stream Processing #Data #Event Driven Architecture #Open Source #Data In Motion #Kafka Cloud Native #Data Mesh #Data Pipeline #Serverless Kafka #Podcasting Education #Confluent, original creators of Apache Kafka® #original creators of Apache Kafka® #Apache Kafka® #Cloud IT #Real Time