32 subscribers
התחל במצב לא מקוון עם האפליקציה Player FM !
פודקאסטים ששווה להאזין
בחסות


1 TED Tech
Capacity Planning Your Apache Kafka Cluster
Manage episode 424666737 series 2510642
How do you plan Apache Kafka® capacity and Kafka Streams sizing for optimal performance?
When Jason Bell (Principal Engineer, Dataworks and founder of Synthetica Data), begins to plan a Kafka cluster, he starts with a deep inspection of the customer's data itself—determining its volume as well as its contents: Is it JSON, straight pieces of text, or images? He then determines if Kafka is a good fit for the project overall, a decision he bases on volume, the desired architecture, as well as potential cost.
Next, the cluster is conceived in terms of some rule-of-thumb numbers. For example, Jason's minimum number of brokers for a cluster is three or four. This means he has a leader, a follower and at least one backup. A ZooKeeper quorum is also a set of three. For other elements, he works with pairs, an active and a standby—this applies to Kafka Connect and Schema Registry. Finally, there's Prometheus monitoring and Grafana alerting to add. Jason points out that these numbers are different for multi-data-center architectures.
Jason never assumes that everyone knows how Kafka works, because some software teams include specialists working on a producer or a consumer, who don't work directly with Kafka itself. They may not know how to adequately measure their Kafka volume themselves, so he often begins the collaborative process of graphing message volumes. He considers, for example, how many messages there are daily, and whether there is a peak time. Each industry is different, with some focusing on daily batch data (banking), and others fielding incredible amounts of continuous data (IoT data streaming from cars).
Extensive testing is necessary to ensure that the data patterns are adequately accommodated. Jason sets up a short-lived system that is identical to the main system. He finds that teams usually have not adequately tested across domain boundaries or the network. Developers tend to think in terms of numbers of messages, but not in terms of overall network traffic, or in how many consumers they'll actually need, for example. Latency must also be considered, for example if the compression on the producer's side doesn't match compression on the consumer's side, it will increase.
Kafka Connect sink connectors require special consideration when Jason is establishing a cluster. Failure strategies need to well thought out, including retries and how to deal with the potentially large number of messages that can accumulate in a dead letter queue. He suggests that more attention should generally be paid to the Kafka Connect elements of a cluster, something that can actually be addressed with bash scripts.
Finally, Kris and Jason cover his preference for Kafka Streams over ksqlDB from a network perspective.
EPISODE LINKS
פרקים
1. Intro (00:00:00)
2. Kafka Cluster Capacity Planning—where to begin (00:10:12)
3. Put it in the cloud (00:16:55)
4. Three is the magic number: Kafka leader and 2 followers (00:18:17)
5. Multi-data center architectures (00:23:09)
6. Extensive testing is necessary (00:25:21)
7. Kafka message volumes (00:28:46)
8. Kafka Connect sink connectors (00:36:56)
9. Kafka Streams vs ksqlDB for DevOps (00:51:08)
10. It's a wrap! (00:59:56)
265 פרקים
Manage episode 424666737 series 2510642
How do you plan Apache Kafka® capacity and Kafka Streams sizing for optimal performance?
When Jason Bell (Principal Engineer, Dataworks and founder of Synthetica Data), begins to plan a Kafka cluster, he starts with a deep inspection of the customer's data itself—determining its volume as well as its contents: Is it JSON, straight pieces of text, or images? He then determines if Kafka is a good fit for the project overall, a decision he bases on volume, the desired architecture, as well as potential cost.
Next, the cluster is conceived in terms of some rule-of-thumb numbers. For example, Jason's minimum number of brokers for a cluster is three or four. This means he has a leader, a follower and at least one backup. A ZooKeeper quorum is also a set of three. For other elements, he works with pairs, an active and a standby—this applies to Kafka Connect and Schema Registry. Finally, there's Prometheus monitoring and Grafana alerting to add. Jason points out that these numbers are different for multi-data-center architectures.
Jason never assumes that everyone knows how Kafka works, because some software teams include specialists working on a producer or a consumer, who don't work directly with Kafka itself. They may not know how to adequately measure their Kafka volume themselves, so he often begins the collaborative process of graphing message volumes. He considers, for example, how many messages there are daily, and whether there is a peak time. Each industry is different, with some focusing on daily batch data (banking), and others fielding incredible amounts of continuous data (IoT data streaming from cars).
Extensive testing is necessary to ensure that the data patterns are adequately accommodated. Jason sets up a short-lived system that is identical to the main system. He finds that teams usually have not adequately tested across domain boundaries or the network. Developers tend to think in terms of numbers of messages, but not in terms of overall network traffic, or in how many consumers they'll actually need, for example. Latency must also be considered, for example if the compression on the producer's side doesn't match compression on the consumer's side, it will increase.
Kafka Connect sink connectors require special consideration when Jason is establishing a cluster. Failure strategies need to well thought out, including retries and how to deal with the potentially large number of messages that can accumulate in a dead letter queue. He suggests that more attention should generally be paid to the Kafka Connect elements of a cluster, something that can actually be addressed with bash scripts.
Finally, Kris and Jason cover his preference for Kafka Streams over ksqlDB from a network perspective.
EPISODE LINKS
פרקים
1. Intro (00:00:00)
2. Kafka Cluster Capacity Planning—where to begin (00:10:12)
3. Put it in the cloud (00:16:55)
4. Three is the magic number: Kafka leader and 2 followers (00:18:17)
5. Multi-data center architectures (00:23:09)
6. Extensive testing is necessary (00:25:21)
7. Kafka message volumes (00:28:46)
8. Kafka Connect sink connectors (00:36:56)
9. Kafka Streams vs ksqlDB for DevOps (00:51:08)
10. It's a wrap! (00:59:56)
265 פרקים
همه قسمت ها
×
1 Apache Kafka 3.5 - Kafka Core, Connect, Streams, & Client Updates 11:25

1 How to use Data Contracts for Long-Term Schema Management 57:28

1 How to use Python with Apache Kafka 31:57

1 Next-Gen Data Modeling, Integrity, and Governance with YODA 55:55

1 Migrate Your Kafka Cluster with Minimal Downtime 1:01:30

1 Real-Time Data Transformation and Analytics with dbt Labs 43:41

1 What is the Future of Streaming Data? 41:29

1 What can Apache Kafka Developers learn from Online Gaming? 55:32


1 How to use OpenTelemetry to Trace and Monitor Apache Kafka Systems 50:01

1 What is Data Democratization and Why is it Important? 47:27

1 Git for Data: Managing Data like Code with lakeFS 30:42

1 Using Kafka-Leader-Election to Improve Scalability and Performance 51:06

1 Real-Time Machine Learning and Smarter AI with Data Streaming 38:56

1 The Present and Future of Stream Processing 31:19

1 Top 6 Worst Apache Kafka JIRA Bugs 1:10:58

1 Learn How Stream-Processing Works The Simplest Way Possible 31:29

1 Building and Designing Events and Event Streams with Apache Kafka 53:06

1 Rethinking Apache Kafka Security and Account Management 41:23

1 Real-time Threat Detection Using Machine Learning and Apache Kafka 29:18

1 Improving Apache Kafka Scalability and Elasticity with Tiered Storage 29:32

1 Decoupling with Event-Driven Architecture 38:38

1 If Streaming Is the Answer, Why Are We Still Doing Batch? 43:58

1 Security for Real-Time Data Stream Processing with Confluent Cloud 48:33

1 Running Apache Kafka in Production 58:44

1 Build a Real Time AI Data Platform with Apache Kafka 37:18

1 Optimizing Apache JVMs for Apache Kafka 1:11:42


1 Application Data Streaming with Apache Kafka and Swim 39:10

1 International Podcast Day - Apache Kafka Edition | Streaming Audio Special 1:02:22


1 Real-Time Stream Processing, Monitoring, and Analytics With Apache Kafka 34:07

1 Reddit Sentiment Analysis with Apache Kafka-Based Microservices 35:23

1 Capacity Planning Your Apache Kafka Cluster 1:01:54

1 Streaming Real-Time Sporting Analytics for World Table Tennis 34:29

1 Real-Time Event Distribution with Data Mesh 48:59

1 Apache Kafka Security Best Practices 39:10

1 What Could Go Wrong with a Kafka JDBC Connector? 41:10

1 Apache Kafka Networking with Confluent Cloud 37:22

1 Event-Driven Systems and Agile Operations 53:22

1 Streaming Analytics and Real-Time Signal Processing with Apache Kafka 1:06:33

1 Blockchain Data Integration with Apache Kafka 50:59

1 Automating Multi-Cloud Apache Kafka Cluster Rollouts 48:29

1 Common Apache Kafka Mistakes to Avoid 1:09:43

1 Tips For Writing Abstracts and Speaking at Conferences 48:56

1 How I Became a Developer Advocate 29:48

1 Data Mesh Architecture: A Modern Distributed Data Model 48:42

1 Flink vs Kafka Streams/ksqlDB: Comparing Stream Processing Tools 55:55

1 Practical Data Pipeline: Build a Plant Monitoring System with ksqlDB 33:56


1 Scaling Apache Kafka Clusters on Confluent Cloud ft. Ajit Yagaty and Aashish Kohli 49:07

1 Streaming Analytics on 50M Events Per Day with Confluent Cloud at Picnic 34:41


1 Optimizing Apache Kafka's Internals with Its Co-Creator Jun Rao 48:54

1 Using Event-Driven Design with Apache Kafka Streaming Applications ft. Bobby Calderwood 51:09

1 Monitoring Extreme-Scale Apache Kafka Using eBPF at New Relic 38:25

1 Confluent Platform 7.1: New Features + Updates 10:01

1 Scaling an Apache Kafka Based Architecture at Therapie Clinic 1:10:56

1 Bridging Frontend and Backend with GraphQL and Apache Kafka ft. Gerard Klijs 23:13
ברוכים הבאים אל Player FM!
Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.