Connecting Azure Cosmos DB with Apache Kafka - Better Together ft. Ryan CrawCour

Streaming Audio: Apache Kafka® & Real-Time Data

Player FM - Internet Radio Done Right

32 subscribers

הוסף לפני six שנים

תוכן מסופק על ידי Confluent, founded by the original creators of Apache Kafka® and Founded by the original creators of Apache Kafka®. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Confluent, founded by the original creators of Apache Kafka® and Founded by the original creators of Apache Kafka® או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

Our Skin: A Personal Discovery Podcast

1
You Are Your Longest Relationship: Artist DaQuane Cherry on Psoriasis, Art, and Self-Care 32:12

לפני 24 ימים32:12

הפעל מאוחר יותר

רשימות

לייק

אהבתי

32:12

DaQuane Cherry was once the kid who wore a hoodie to hide skin flare-ups in school. Now he’s an artist and advocate helping others feel seen. He reflects on his psoriasis journey, the power of small joys, and why loving yourself first isn’t a cliché—it’s essential. Plus, a deep dive into the history of La Roche-Posay’s legendary spring. See omnystudio.com/listener for privacy information.…

לפני 4 שנים 31:59

MP3•בית הפרקים

When building solutions for customers in Microsoft Azure, it is not uncommon to come across customers who are deeply entrenched in the Apache Kafka® ecosystem and want to continue expanding within it. Thus, figuring out how to connect Azure first-party services to this ecosystem is of the utmost importance.

Ryan CrawCour is a Microsoft engineer who has been working on all things data and analytics for the past 10+ years, including building out services like Azure Cosmos DB, which is used by millions of people around the globe. More recently, Ryan has taken a customer-facing role where he gets to help customers build the best solutions possible using Microsoft Azure’s cloud platform and development tools.

In one case, Ryan helped a customer leverage their existing Kafka investments and persist event messages in a durable managed database system in Azure. They chose Azure Cosmos DB, a fully managed, distributed, modern NoSQL database service as their preferred database, but the question remained as to how they would feed events from their Kafka infrastructure into Azure Cosmos DB, as well as how they could get changes from their database system back into their Kafka topics.

Although integration is in his blood, Ryan confesses that he is relatively new to the world of Kafka and has learned to adjust to what he finds in his customers’ environments. Oftentimes this is Kafka, and for many good reasons, customers don’t want to change this core part of their solution infrastructure. This has led him to embrace Kafka and the ecosystem around it, enabling him to better serve customers.

He’s been closely tracking the development and progress of Kafka Connect. To him, it is the natural step from Kafka as a messaging infrastructure to Kafka as a key pillar in an integration scenario. Kafka Connect can be thought of as a piece of middleware that can be used to connect a variety of systems to Kafka in a bidirectional manner. This means getting data from Kafka into your downstream systems, often databases, and also taking changes that occur in these systems and publishing them back to Kafka where other systems can then react.

One day, a customer asked him how to connect Azure Cosmos DB to Kafka. There wasn’t a connector at the time, so he helped build two with the Confluent team: a sink connector, where data flows from Kafka topics into Azure Cosmos DB, as well as a source connector, where Azure Cosmos DB is the source of data pushing changes that occur in the database into Kafka topics.

EPISODE LINKS

265 פרקים

#Tech #Tech News #News #Confluent #Event Stream Processing #Data #Event Driven Architecture #Open Source #Data In Motion #Kafka Cloud Native #Data Mesh #Data Pipeline #Serverless Kafka #Podcasting Education #Confluent, original creators of Apache Kafka® #original creators of Apache Kafka® #Apache Kafka® #Cloud IT #Real Time

Connecting Azure Cosmos DB with Apache Kafka - Better Together ft. Ryan CrawCour

Streaming Audio: Apache Kafka® & Real-Time Data

32 subscribers

published לפני 4 שנים

שתפו

MP3•בית הפרקים

EPISODE LINKS

265 פרקים

Todos los episodios

Streaming Audio: Apache Kafka® & Real-Time Data

1
Apache Kafka 3.5 - Kafka Core, Connect, Streams, & Client Updates 11:25

לפני 2 years11:25

11:25

Apache Kafka® 3.5 is here with the capability of previewing migrations between ZooKeeper clusters to KRaft mode. Follow along as Danica Fine highlights key release updates. Kafka Core: KIP-833 provides an updated timeline for KRaft. KIP-866 now is preview and allows migration from an existing ZooKeeper cluster to KRaft mode. KIP-900 introduces a way to bootstrap the KRaft controllers with SCRAM credentials. KIP-903 prevents a data loss scenario by preventing replicas with stale broker epochs from joining the ISR list. KIP-915 streamlines the process of downgrading Kafka's transaction and group coordinators by introducing tagged fields. Kafka Connect: KIP-710 provides the option to use a REST API for internal server communication that can be enabled by setting `dedicated.mode.enable.internal.rest` equal to true. KIP-875 offers support for native offset management in Kafka Connect. Connect cluster administrators can now read offsets for both source and sink connectors. This KIP adds a new STOPPED state for connectors, enabling users to shut down connectors and maintain connector configurations without utilizing resources. KIP-894 makes `IncrementalAlterConfigs` API available for use in MirrorMaker 2 (MM2), adding a new use.incremental.alter.config configuration which takes values “requested,” “never,” and “required.” KIP-911 adds a new source tag for metrics generated by the `MirrorSourceConnector` to help monitor mirroring deployments. Kafka Streams: KIP-339 improves Kafka Streams' error-handling capabilities by addressing serialization errors that occur before message production and extending the interface for custom error handling. KIP-889 introduces versioned state stores in Kafka Streams for temporal join semantics in stream-to-table joins. KIP-904 simplifies table aggregation in Kafka by proposing a change in serialization format to enable one-step aggregation and reduce noise from events with old and new keys/values. KIP-914 modifies how versioned state stores are used in Kafka Streams. Versioned state stores may impact different DSL processors in varying ways, see the documentation for details. Kafka Client: KIP-881 is now complete and introduces new client-side assignor logic for rack-aware consumer balancing for Kafka Consumers. KIP-887 adds the `EnvVarConfigProvider` implementation to Kafka so custom configurations stored in environment variables can be injected into the system by providing the map returned by `System.getEnv()`. KIP 641 introduces the `RecordReader` interface to Kafka's clients module, replacing the deprecated MessageReader Scala trait. EPISODE LINKS See release notes for Apache Kafka 3.5 Read the blog to learn more Download and get started with Apache Kafka 3.5 Watch the video version of this podcast…

Streaming Audio: Apache Kafka® & Real-Time Data

1
A Special Announcement from Streaming Audio 1:18

לפני 2 years1:18

1:18

After recording 64 episodes and featuring 58 amazing guests, the Streaming Audio podcast series has amassed over 130,000 plays on YouTube in the last year. We're extremely proud of these achievements and feel that it's time to take a well-deserved break. Streaming Audio will be taking a vacation! We want to express our gratitude to you, our valued listeners, for spending 10,000 hours with us on this incredible journey. Rest assured, we will be back with more episodes! In the meantime, feel free to revisit some of our previous episodes. For instance, you can listen to Anna McDonald share her stories about the worst Apache Kafka® bugs she’s ever seen, or listen to Jun Rao offer his expert advice on running Kafka in production. And who could forget the charming backstory behind Mitch Seymour's Kafka storybook, Gently Down the Stream? These memorable episodes brought us joy, and we're thrilled to have shared them with you. As we reflect on our accomplishments with pride, we also look forward to an exciting future. Until we meet again, happy listening! EPISODE LINKS Top 6 Worst Apache Kafka JIRA Bugs Running Apache Kafka in Production Learn How Stream-Processing Works The Simplest Way Possible Watch the video version of this podcast Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
How to use Data Contracts for Long-Term Schema Management 57:28

לפני 2 years57:28

57:28

Have you ever struggled with managing data long term, especially as the schema changes over time? In order to manage and leverage data across an organization, it’s essential to have well-defined guidelines and standards in place around data quality, enforcement, and data transfer. To get started, Abraham Leal (Customer Success Technical Architect, Confluent) suggests that organizations associate their Apache Kafka® data with a data contract (schema). A data contract is an agreement between a service provider and data consumers. It defines the management and intended usage of data within an organization. In this episode, Abraham talks to Kris about how to use data contracts and schema enforcement to ensure long-term data management. When an organization sends and stores critical and valuable data in Kafka, more often than not it would like to leverage that data in various valuable ways for multiple business units. Kafka is particularly suited for this use case, but it can be problematic later on if the governance rules aren’t established up front. With schema registry, evolution is easy due to its robust security guarantees. When managing data pipelines, you can also use GitOps automation features for an extra control layer. It allows you to be creative with topic versioning, upcasting/downcasting the data collected, and adding quality assurance steps at the end of each run to ensure your project remains reliable. Abraham explains that Protobuf and Avro are the best formats to use rather than XML or JSON because they are built to handle schema evolution. In addition, they have a much lower overhead per-record, so you can save bandwidth and data storage costs by adopting them. There’s so much more to consider, but if you are thinking about implementing or integrating with your data quality team, Abraham suggests that you use schema registry heavily from the beginning. If you have more questions, Kris invites you to join the conversation. You can also watch the KOR Financial Current talk Abraham mentions or take Danica Fine’s free course on how to use schema registry on Confluent Developer. EPISODE LINKS OS project KOR Financial Current Talk The Key Concepts of Schema Registry Schema Evolution and Compatibility Schema Registry Made Simple by Confluent Cloud ft. Magesh Nandakumar Kris Jenkins’ Twitter Watch the video version of this podcast Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
How to use Python with Apache Kafka 31:57

לפני 2 years31:57

31:57

Can you use Apache Kafka® and Python together? What’s the current state of Python support? And what are the best options to get started? In this episode, Dave Klein joins Kris to talk about all things Kafka and Python: the libraries, the tools, and the pros & cons. He also talks about the new course he just launched to support Python programmers entering the event-streaming world. Dave has been an active member of the Kafka community for many years and noticed that there were a lot of Kafka resources for Java but few for Python. So he decided to create a course to help people get started using Python and Kafka together. Historically, Java has had the most documentation, and people have often missed how good the Python support is for Kafka users. Python and Kafka are an ideal fit for machine learning applications and data engineering in general. Yet there are a lot of use cases for building, streaming, and machine learning pipelines. In fact, someone conducted a survey to find out what languages were most popular in the Kafka community and Python came in second after Java. That’s how Dave got the idea to create a course for newbies. In this course, Dave combines video lectures with code-heavy exercises to give developers a taste of what the code looks like, how to structure it, a preview of the shape of the code, and the structure of the classes and the functions so you can get hands-on practice using the library. He also covers building a producer and a consumer and using the admin client. And, of course, there is a module that covers working with the schemas supported by the Kafka library. Dave explains that Python opens up a world of opportunity and is ripe for expansion. So if you are ready to dive in, head over to developer.confluent.io to learn more about Dave’s course. EPISODE LINKS Blog: Getting Started with Python for Apache Kafka Course: Introduction to Apache Kafka for Python Developers Step-by-step guide: Building a Python client application for Kafka Coding in Motion Building and Designing Events and Event Streams with Apache Kafka Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Next-Gen Data Modeling, Integrity, and Governance with YODA 55:55

לפני 2 years55:55

55:55

In this episode, Kris interviews Doron Porat, Director of Infrastructure at Yotpo, and Liran Yogev, Director of Engineering at ZipRecruiter (formerly at Yotpo), about their experiences and strategies in dealing with data modeling at scale. Yotpo has a vast and active data lake, comprising thousands of datasets that are processed by different engines, primarily Apache Spark™. They wanted to provide users with self-service tools for generating and utilizing data with maximum flexibility, but encountered difficulties, including poor standardization, low data reusability, limited data lineage, and unreliable datasets. The team realized that Yotpo's modeling layer, which defines the structure and relationships of the data, needed to be separated from the execution layer, which defines and processes operations on the data. This separation would give programmers better visibility into data pipelines across all execution engines, storage methods, and formats, as well as more governance control for exploration and automation. To address these issues, they developed YODA, an internal tool that combines excellent developer experience, DBT, Databricks, Airflow, Looker and more, with a strong CI/CD and orchestration layer. Yotpo is a B2B, SaaS e-commerce marketing platform that provides businesses with the necessary tools for accurate customer analytics, remarketing, support messaging, and more. ZipRecruiter is a job site that utilizes AI matching to help businesses find the right candidates for their open roles. EPISODE LINKS Current 2022 Talk: Next Gen Data Modeling in the Open Data Platform Data Mesh 101 Data Mesh Architecture: A Modern Distributed Data Model Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Migrate Your Kafka Cluster with Minimal Downtime 1:01:30

לפני 2 years1:01:30

1:01:30

Migrating Apache Kafka® clusters can be challenging, especially when moving large amounts of data while minimizing downtime. Michael Dunn (Solutions Architect, Confluent) has worked in the data space for many years, designing and managing systems to support high-volume applications. He has helped many organizations strategize, design, and implement successful Kafka cluster migrations between different environments. In this episode, Michael shares some tips about Kafka cluster migration with Kris, including the pros and cons of the different tools he recommends. Michael explains that there are many reasons why companies migrate their Kafka clusters. For example, they may want to modernize their platforms, move to a self-hosted cloud server, or consolidate clusters. He tells Kris that creating a plan and selecting the right tool before getting started is critical for reducing downtime and minimizing migration risks. The good news is that a few tools can facilitate moving large amounts of data, topics, schemas, applications, connectors, and everything else from one Apache Kafka cluster to another. Kafka MirrorMaker/MirrorMaker2 (MM2) is a stand-alone tool for copying data between two Kafka clusters. It uses source and sink connectors to replicate topics from a source cluster into the destination cluster. Confluent Replicator allows you to replicate data from one Kafka cluster to another. Replicator is similar to MM2, but the difference is that it’s been battle-tested. Cluster Linking is a powerful tool offered by Confluent that allows you to mirror topics from an Apache Kafka 2.4/Confluent Platform 5.4 source cluster to a Confluent Platform 7+ cluster in a read-only state, and is available as a fully-managed service in Confluent Cloud. At the end of the day, Michael stresses that coupled with a well-thought-out strategy and the right tool, Kafka cluster migration can be relatively painless. Following his advice, you should be able to keep your system healthy and stable before and after the migration is complete. EPISODE LINKS MirrorMaker 2 Replicator Cluster Linking Schema Migration Multi-Cluster Apache Kafka with Cluster Linking Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Real-Time Data Transformation and Analytics with dbt Labs 43:41

לפני 2 years43:41

43:41

dbt is known as being part of the Modern Data Stack for ELT processes. Being in the MDS, dbt Labs believes in having the best of breed for every part of the stack. Oftentimes folks are using an EL tool like Fivetran to pull data from the database into the warehouse, then using dbt to manage the transformations in the warehouse. Analysts can then build dashboards on top of that data, or execute tests. It’s possible for an analyst to adapt this process for use with a microservice application using Apache Kafka® and the same method to pull batch data out of each and every database; however, in this episode, Amy Chen (Partner Engineering Manager, dbt Labs) tells Kris about a better way forward for analysts willing to adopt the streaming mindset: Reusable pipelines using dbt models that immediately pull events into the warehouse and materialize as materialized views by default. dbt Labs is the company that makes and maintains dbt. dbt Core is the open-source data transformation framework that allows data teams to operate with software engineering’s best practices. dbt Cloud is the fastest and most reliable way to deploy dbt. Inside the world of event streaming, there is a push to expand data access beyond the programmers writing the code, and towards everyone involved in the business. Over at dbt Labs they’re attempting something of the reverse— to get data analysts to adopt the best practices of software engineers, and more recently, of streaming programmers. They’re improving the process of building data pipelines while empowering businesses to bring more contributors into the analytics process, with an easy to deploy, easy to maintain platform. It offers version control to analysts who traditionally don’t have access to git, along with the ability to easily automate testing, all in the same place. In this episode, Kris and Amy explore: How to revolutionize testing for analysts with two of dbt’s core functionalities What streaming in a batch-based analytics world should look like What can be done to improve workflows How to democratize access to data for everyone in the business EPISODE LINKS Learn more about dbt labs An Analytics Engineer’s Guide to Streaming Panel discussion: If Streaming Is the Answer, Why Are We Still Doing Batch? All Current 2022 sessions and slides Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
What is the Future of Streaming Data? 41:29

לפני 2 years41:29

41:29

What’s the next big thing in the future of streaming data? In this episode, Greg DeMichillie (VP of Product and Solutions Marketing, Confluent) talks to Kris about the future of stream processing in environments where the value of data lies in their ability to intercept and interpret data. Greg explains that organizations typically focus on the infrastructure containers themselves, and not on the thousands of data connections that form within. When they finally realize that they don't have a way to manage the complexity of these connections, a new problem arises: how do they approach managing such complexity? That’s where Confluent and Apache Kafka® come into play - they offer a consistent way to organize this seemingly endless web of data so they don't have to face the daunting task of figuring out how to connect their shopping portals or jump through hoops trying different ETL tools on various systems. As more companies seek ways to manage this data, they are asking some basic questions: How to do it? Do best practices exist? How can we get help? The next question for companies who have already adopted Kafka is a bit more complex: "What about my partners?” For example, companies with inventory management systems use supply chain systems to track product creation and shipping. As a result, they need to decide which emails to update, if they need to write custom REST APIs to sit in front of Kafka topics, etc. Advanced use cases like this raise additional questions about data governance, security, data policy, and PII, forcing companies to think differently about data. Greg predicts this is the next big frontier as more companies adopt Kafka internally. And because they will have to think less about where the data is stored and more about how data moves, they will have to solve problems to make managing all that data easier. If you're an enthusiast of real-time data streaming, Greg invites you to attend the Kafka Summit (London) in May and Current (Austin, TX) for a deeper dive into the world of Apache Kafka-related topics now and beyond. EPISODE LINKS What’s Ahead of the Future of Data Streaming? If Streaming Is the Answer, Why Are We Still Doing Batch? All Current 2022 sessions and slides Kafka Summit London 2023 Current 2023 Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
What can Apache Kafka Developers learn from Online Gaming? 55:32

לפני 2 years55:32

55:32

What can online gaming teach us about making large-scale event management more collaborative in real-time? Ben Gamble (Developer Relations Manager, Aiven) has come to the world of real-time event streaming from an usual source: the video games industry. And if you stop to think about it, modern online games are complex, distributed real-time data systems with decades of innovative techniques to teach us. In this episode, Ben talks with Kris about integrating gaming concepts with Apache Kafka®. Using Kafka’s state management stream processing, Ben has built systems that can handle real-time event processing at a massive scale, including interesting approaches to conflict resolution and collaboration. Building latency into a system is one way to mask data processing time. Ben says that you can efficiently hide latency issues and prioritize performance improvements by setting an initial target and then optimizing from there. If you measure before optimizing, you can add an extra layer to manage user expectations better. Tricks like adding a visual progress bar give the appearance of progress but actually hide latency and improve the overall user experience. To effectively handle challenging activities, like resolving conflicts and atomic edits, Ben suggests “slicing” (or nano batching) to break down tasks into small, related chunks. Slicing allows each task to be evaluated separately, thus producing timely outcomes that resolve potential background conflicts without the user knowing. Ben also explains how he uses pooling to make collaboration seamless. Pooling is a process that links open requests with potential matches. Similar to booking seats on an airplane, seats are assigned when requests are made. As these types of connections are handled through a Kafka event stream, the initial open requests are eventually fulfilled when seats become available. According to Ben, real-world tools that facilitate collaboration (such as Google Docs and Slack) work similarly. Just like multi-player gaming systems, multiple users can comment or chat in real-time and users perceive instant responses because of the techniques ported over from the gaming world. As Ben sees it, the proliferation of these types of concepts across disciplines will also benefit a more significant number of collaborative systems. Despite being long established for gamers, these patterns can be implemented in more business applications to improve the user experience significantly. EPISODE LINKS Going Multiplayer With Kafka —Current 2022 Building a Dependable Real-Time Betting App with Confluent Cloud and Ably Event Streaming Patterns Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Apache Kafka 3.4 - New Features & Improvements 5:13

לפני 2 years5:13

5:13

Apache Kafka® 3.4 is released! In this special episode, Danica Fine (Senior Developer Advocate, Confluent), shares highlights of the Apache Kafka 3.4 release. This release introduces new KIPs in Kafka Core, Kafka Streams, and Kafka Connect. In Kafka Core: KIP-792 expands the metadata each group member passes to the group leader in its JoinGroup subscription to include the highest stable generation that consumer was a part of. KIP-830 includes a new configuration setting that allows you to disable the JMX reporter for environments where it’s not being used. KIP-854 introduces changes to clean up producer IDs more efficiently, to avoid excess memory usage. It introduces a new timeout parameter that affects the expiry of producer IDs and updates the old parameter to only affect the expiry of transaction IDs. KIP-866 (early access) provides a bridge to migrate between existing Zookeeper clusters to new KRaft mode clusters, enabling the migration of existing metadata from Zookeeper to KRaft. KIP-876 adds a new property that defines the maximum amount of time that the server will wait to generate a snapshot; the default is 1 hour. KIP-881 , an extension of KIP-392, makes it so that consumers can now be rack-aware when it comes to partition assignments and consumer rebalancing. In Kafka Streams: KIP-770 updates some Kafka Streams configs and metrics related to the record cache size. KIP-837 allows users to multicast result records to every partition of downstream sink topics and adds functionality for users to choose to drop result records without sending. And finally, for Kafka Connect: KIP-787 allows users to run MirrorMaker2 with custom implementations for the Kafka resource manager and makes it easier to integrate with your ecosystem. Tune in to learn more about the Apache Kafka 3.4 release! EPISODE LINKS See release notes for Apache Kafka 3.4 Read the blog to learn more Download Apache Kafka 3.4 and get started Watch the video version of this podcast Join the Community…

Streaming Audio: Apache Kafka® & Real-Time Data

1
How to use OpenTelemetry to Trace and Monitor Apache Kafka Systems 50:01

לפני 2 years50:01

50:01

How can you use OpenTelemetry to gain insight into your Apache Kafka® event systems? Roman Kolesnev, Staff Customer Innovation Engineer at Confluent, is a member of the Customer Solutions & Innovation Division Labs team working to build business-critical OpenTelemetry applications so companies can see what’s happening inside their data pipelines. In this episode, Roman joins Kris to discuss tracing and monitoring in distributed systems using OpenTelemetry. He talks about how monitoring each step of the process individually is critical to discovering potential delays or bottlenecks before they happen; including keeping track of timestamps, latency information, exceptions, and other data points that could help with troubleshooting. Tracing each request and its journey to completion in Kafka gives companies access to invaluable data that provides insight into system performance and reliability. Furthermore, using this data allows engineers to quickly identify errors or anticipate potential issues before they become significant problems. With greater visibility comes better control over application health - all made possible by OpenTelemetry's unified APIs and services. As described on the OpenTelemetry.io website, "OpenTelemetry is a Cloud Native Computing Foundation incubating project. Formed through a merger of the OpenTracing and OpenCensus projects." It provides a vendor-agnostic way for developers to instrument their applications across different platforms and programming languages while adhering to standard semantic conventions so the traces/information can be streamed to compatible systems following similar specs. By leveraging OpenTelemetry, organizations can ensure their applications and systems are secure and perform optimally. It will quickly become an essential tool for large-scale organizations that need to efficiently process massive amounts of real-time data. With its ability to scale independently, robust analytics capabilities, and powerful monitoring tools, OpenTelemetry is set to become the go-to platform for stream processing in the future. Roman explains that the OpenTelemetry APIs for Kafka are still in development and unavailable for open source. The code is complete and tested but has never run in production. But if you want to learn more about the nuts and bolts, he invites you to connect with him on the Confluent Community Slack channel. You can also check out Monitoring Kafka without instrumentation with eBPF - Antón Rodríguez to learn more about a similar approach for domain monitoring. EPISODE LINKS OpenTelemetry java instrumentation OpenTelemetry collector Distributed Tracing for Kafka with OpenTelemetry Monitoring Kafka without instrumentation with eBPF Kris Jenkins' Twitter Watch the video Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
What is Data Democratization and Why is it Important? 47:27

לפני 3 years47:27

47:27

Data democratization allows everyone in an organization to have access to the data they need, and the necessary tools needed to use this data effectively. In short, data democratization enables better business decisions. In this episode, Rama Ryali, a Senior IT and Data Executive, chats with Kris Jenkins about the importance of data democratization in modern systems. Rama explains that tech has unprecedented control over data and ignores basic business needs. Tech’s influence has largely gone unchecked and has led to a disconnect that often forces businesses to hire outside vendors for help turning their data into information they can use. In his role at RightData, Rama worked closely with Marketing, Sales, Customers, and Leadership to develop a no-code unified data platform that is accessible to everyone and fosters data democratization. So what is data democracy anyway? Rama explains that data democratization is the process of making data more accessible and open to a wider audience in a unified, no-code UI. It involves making sure that data is available to people who need it, regardless of their technical expertise or background. This enables businesses to make data-driven decisions faster and reduces the costs associated with acquiring, processing, and storing information. In addition, by allowing more people access to data, organizations can better collaborate and access tools that allow them to gain valuable insights into their operations and gain a competitive edge in the marketplace. In a perfect world, complicated tools supported by SQL, Excel, etc., with static views of data, will be replaced by a UI that anyone can use to analyze real-time streaming data. Kris coined a phase, “data socialization,” which describes the way that these types of tools can enable human connections across all areas of the organization, not just tech. Rama acknowledges that Excel, SQL, and other dev-heavy platforms will never go away, but the future of data democracy will allow businesses to unlock the maximum value of data through an iterative, democratic process where people talk about what the data is, what matters to other people, and how to transmit it in a way that makes sense. EPISODE LINKS RightData LinkedIn The 5 W’s of Metadata by Rama Ryali Real-Time Machine Learning and Smarter AI with Data Streaming Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Git for Data: Managing Data like Code with lakeFS 30:42

לפני 3 years30:42

30:42

Is it possible to manage and test data like code? lakeFS is an open-source data version control tool that transforms object storage into Git-like repositories, offering teams a way to use the same workflows for code and data. In this episode, Kris sits down with guest Adi Polak, VP of DevX at Treeverse, to discuss how lakeFS can be used to facilitate better management and testing of data. At its core, lakeFS provides teams with better data management. A theoretical data engineer on a large team runs a script to delete some data, but a bug in the script accidentally deletes a lot more data than intended. Application engineers can checkout the main branch, effectively erasing their mistakes, but without a tool like lakeFS, this data engineer would be in a lot of trouble. Polak is quick to explain that lakeFS isn’t built on Git. The source code behind an application is usually a few dozen mega bytes, while lakeFS is designed to handle petabytes of data; however, it does use Git-like semantics to create and access versions so adoption is quick and simple. Another big challenge that lakeFS helps teams tackle is reproducibility. Troubleshooting when and where a corruption in the data first appeared can be a tricky task for a data engineer, when data is constantly updating. With lakeFS, engineers can refer to snapshots to see where the product was corrupted, and rollback to that exact state. lakeFS also assists teams with reprocessing of historical data. With lakeFS data can be reprocessed on an isolated branch, before merging, to ensure the reprocessed data is exposed atomically. It also makes it easier to access the different versions of reprocessed data using any tag or a historical commit ID. Tune in to hear more about the benefits of lakeFS. EPISODE LINKS Adi Polak's Twitter lakeFS Git-for-data GitHub repo What is a Merkle Tree? If Streaming Is the Answer, Why Are We Still Doing Batch? Current 2022 sessions and slides Sign up for updates on Current 2023 Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Using Kafka-Leader-Election to Improve Scalability and Performance 51:06

לפני 3 years51:06

51:06

How does leader election work in Apache Kafka®? For the past 2 ½ years, Adithya Chandra, Staff Software Engineer at Confluent, has been working on Kafka scalability and performance, specifically partition leader election. In this episode, he gives Kris Jenkins a deep dive into the power of leader election in Kafka replication, why we need it, how it works, what can go wrong, and how it's being improved. Adithya explains that you can configure a certain number of replicas to be distributed across Kafka brokers and then set one of them as the elected leader - the others become followers. This leader-based model proves efficient because clients only have to write to the leader, who handles the replication process internally. But what happens when a broker goes offline, when a replica reassignment occurs, or when a broker shuts down? Adithya explains that when these triggers occur, one of the followers becomes the elected leader, and all the other replicas take their cue from the new leader. This failover reassignment ensures that messages are replicated effectively and efficiently with multiple copies across different brokers. Adithya explains how you can select a broker as the preferred election leader. The preferred leader then becomes the new leader in failure events. This reduces latency and ensures messages consistently write to the same broker for easier tracking and debugging. Leader failover cannot cover all failures, Adithya says. If a broker can’t be reached externally but can talk to other brokers in the cluster, leader failover won’t be triggered. If a broker experiences transient disk or network issues, the leader election process might fail, and the broker will not be elected as a leader. In both cases, manual intervention is required. Leadership priority is an important feature of Confluent Cloud that allows you to prioritize certain brokers over others and specify which broker is most likely to become the leader in case of a failover. This way, we can prioritize certain brokers to ensure that the most reliable broker handles more important and sensitive replication tasks. Additionally, this feature ensures that replication remains consistent and available even in an unexpected failure event. Improvements to this component of Kafka will enable it to be applied to a wide variety of scenarios. On-call engineers can use it to mitigate single-broker performance issues while debugging. Network and storage health solutions can use it to prioritize brokers. Adithya explains that preferred leader election and leadership failover ensure data is available and consistent during failure scenarios so that Kafka replication can run smoothly and efficiently. EPISODE LINKS Data Plane: Replication Protocol Optimizing Cloud-Native Apache Kafka Performance ft. Alok Nikhil and Adithya Chandra Watch the video Kris Jenkins’ Twitter Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Real-Time Machine Learning and Smarter AI with Data Streaming 38:56

לפני 3 years38:56

38:56

Are bad customer experiences really just data integration problems? Can real-time data streaming and machine learning be democratized in order to deliver a better customer experience? Airy, an open-source data-streaming platform, uses Apache Kafka® to help business teams deliver better results to their customers. In this episode, Airy CEO and co-founder Steffen Hoellinger explains how his company is expanding the reach of stream-processing tools and ideas beyond the world of programmers. Airy originally built Conversational AI (chatbot) software and other customer support products for companies to engage with their customers in conversational interfaces. Asynchronous messaging created a large amount of traffic, so the company adopted Kafka to ingest and process all messages & events in real time. In 2020, the co-founders decided to open source the technology, positioning Airy as an open source app framework for conversational teams at large enterprises to ingest and process conversational and customer data in real time. The decision was rooted in their belief that all bad customer experiences are really data integration problems, especially at large enterprises where data often is siloed and not accessible to machine learning models and human agents in real time. (Who hasn’t had the experience of entering customer data into an automated system, only to have the same data requested eventually by a human agent?) Airy is making data streaming universally accessible by supplying its clients with real-time data and offering integrations with standard business software. For engineering teams, Airy can reduce development time and increase the robustness of solutions they build. Data is now the cornerstone of most successful businesses, and real-time use cases are becoming more and more important. Open-source app frameworks like Airy are poised to drive massive adoption of event streaming over the years to come, across companies of all sizes, and maybe, eventually, down to consumers. EPISODE LINKS Learn how to deploy Airy Open Source - or sign up for an Airy Cloud test instance Google Case Study about Airy & TEDi, a 2,000 store retailer Become an Expert in Conversational Engineering Supercharging conversational AI with human agent feedback loops Integrating all Communication and Customer Data with Airy and Confluent How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka Real-Time Threat Detection Using Machine Learning and Apache Kafka Watch the video Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
The Present and Future of Stream Processing 31:19

לפני 3 years31:19

31:19

The past year saw new trends emerge in the world of data streaming technologies, as well as some unexpected and novel use cases for Apache Kafka®. New reflections on the future of stream processing and when companies should adopt microservice architecture inspired several talks at this year’s industry conferences. In this episode, Kris is joined by his colleagues Danica Fine, Senior Developer Advocate, and Robin Moffatt, Principal Developer Advocate, for an end-of-year roundtable on this year’s developments and what they want to see in the year to come. Robin and Danica kick things off with a discussion of the year’s memorable conferences. Talk submissions for Kafka Summit London and Current 2022 featuring topics were noticeably more varied than previous years, with fewer talks focused on the basics of Kafka implementation. Many abstracts featured interesting and unusual use cases, in addition to detailed explanations on what went wrong and how others could avoid the same issues. The conferences also made clear that a lot of companies are adopting or considering stream-processing solutions. Are we close to a future where streaming is a part of everything we do? Is there anything helping streaming become more mainstream? Will stream processing replace batch? On the other hand, a lot of in-demand talks focused on the importance of understanding the best practices supporting data mesh and understanding the nuances of the system and configurations. Danica identifies this as her big hope for next year: No more Kafka developers pursuing quick fixes. “No more band aid fixes. I want as many people as possible to understand the nuances of the levers that they're pulling for Kafka, whatever project they're building.” Kris and Robin agree that what will make them happy in 2023 is seeing broader, more diverse client libraries for Kafka. “Getting away from this idea that Kafka is largely a Java shop, which is nonsense, but there is that perception.” Streaming Audio returns in January 2023. EPISODE LINKS Put Your Data To Work: Top 5 Data Technology Trends for 2023 Write What You Know: Turning Your Apache Kafka Knowledge into a Technical Talk Common Apache Kafka Mistakes to Avoid Practical Data Pipeline: Build a Plant Monitoring System with ksqlDB If Streaming Is the Answer, Why Are We Still Doing Batch? View sessions and slides from Current 2022 Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Top 6 Worst Apache Kafka JIRA Bugs 1:10:58

לפני 3 years1:10:58

1:10:58

Entomophiliac, Anna McDonald (Principal Customer Success Technical Architect, Confluent) has seen her fair share of Apache Kafka® bugs. For her annual holiday roundup of the most noteworthy Kafka bugs, Anna tells Kris Jenkins about some of the scariest, most surprising, and most enlightening corner cases that make you ask, “Ah, so that’s how it really works?” She shares a lot of interesting details about how batching works, the replication protocol, how Kafka’s networking stack dances with Linux’s one, and which is the most important Scala class to read, if you’re only going to read one. In particular, Anna gives Kris details about a bug that he’s been thinking about lately – sticky partitioner (KAFKA-10888). When a Kafka producer sends several records to the same partition at around the same time, the partition can get overloaded. As a result, if too many records get processed at once, they can get stuck causing an unbalanced workload. Anna goes on to explain that the fix required keeping track of the number of offsets/messages written to each partition, and then batching to force more balanced distributions. She found another bug that occurs when Kafka server triggers TCP Congestion Control in some conditions (KAFKA-9648). Anna explains that when Kafka server restarts and then executes the preferred replica leader, lots of replica leaders trigger cluster metadata updates. Then, all clients establish a server connection at the same time that lots TCP requests are waiting in the TCP sync queue. The third bug she talks about (KAFKA-9211), may cause TCP delays after upgrading…. Oh, that’s a nasty one. She goes on to tell Kris about a rare bug (KAFKA-12686) in Partition.scala where there’s a race condition between the handling of an AlterIsrResponse and a LeaderAndIsrRequest. This rare scenario involves the delay of AlterIsrResponse when lots of ISR and leadership changes occur due to broker restarts. Bugs five (KAFKA-12964) and six (KAFKA-14334) are no better, but you’ll have to plug in your headphones and listen in to explore the ghoulish adventures of Anna McDonald as she gives a nightmarish peek into her world of JIRA bugs. It’s just what you might need this holiday season! EPISODE LINKS KAFKA-10888: Sticky partition leads to uneven product msg, resulting in abnormal delays in some partitions KAFKA-9648: Add configuration to adjust listen backlog size for Acceptor KAFKA-9211: Kafka upgrade 2.3.0 may cause tcp delay ack(Congestion Control) KAFKA-12686: Race condition in AlterIsr response handling KAFKA-12964: Corrupt segment recovery can delete new producer state snapshots KAFKA-14334: DelayedFetch purgatory not completed when appending as follower Optimizing for Low Latency and High Throughput Diagnose and Debug Apache Kafka Issues Watch the video Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Use PODCAST100 to get $100 of free Confluent Cloud usage ( details…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Learn How Stream-Processing Works The Simplest Way Possible 31:29

לפני 3 years31:29

31:29

Could you explain Apache Kafka® in ways that a small child could understand? When Mitch Seymour, author of Mastering Kafka Streams and ksqlDB , wanted a way to communicate the basics of Kafka and event-based stream processing, he decided to author a children’s book on the subject, but it turned into something with a far broader appeal. Mitch conceived the idea while writing a traditional manuscript for engineers and technicians interested in building stream processing applications. He wished he could explain what he was writing about to his 2-year-old daughter, and contemplated the best way to introduce the concepts in a way anyone could grasp. Four months later, he had completed the illustration book: Gently Down the Stream: A Gentle Introduction to Apache Kafka . It tells the story of a family of forest-dwelling Otters, who discover that they can use a giant river to communicate with each other. When more Otter families move into the forest, they must learn to adapt their system to handle the increase in activity. This accessible metaphor for how streaming applications work is accompanied by Mitch’s warm, painterly illustrations. For his second book, Seymour collaborated with the researcher and software developer Martin Kleppmann, author of Designing Data-Intensive Applications . Kleppmann admired the illustration book and proposed that the next book tackle a gentle introduction to cryptography. Specifically, it would introduce the concepts behind symmetric-key encryption, key exchange protocols, and the Diffie-Hellman algorithm, a method for exchanging secret information over a public channel. Secret Colors tells the story of a pair of Bunnies preparing to attend a school dance, who eagerly exchange notes on potential dates. They realize they need a way of keeping their messages secret, so they develop a technique that allows them to communicate without any chance of other Bunnies intercepting their messages. Mitch’s latest illustration book is— A Walk to the Cloud: A Gentle Introduction to Fully Managed Environments . In the episode, Seymour discusses his process of creating the books from concept to completion, the decision to create his own publishing company to distribute these books, and whether a fourth book is on the way. He also discusses the experience of illustrating the books side by side with his wife, shares his insights on how editing is similar to coding, and explains why a concise set of commands is equally desirable in SQL queries and children’s literature. EPISODE LINKS Minimizing Software Speciation with ksqlDB and Kafka Streams Gently Down the Stream: A Gentle Introduction to Apache Kafka Secret Colors A Walk to the Cloud: A Gentle Introduction to Fully Managed Environments Apache Kafka On the Go: Kafka Concepts for Beginners Apache Kafka 101 course Watch the video Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Building and Designing Events and Event Streams with Apache Kafka 53:06

לפני 3 years53:06

53:06

What are the key factors to consider when developing event-driven architecture? When properly designed, events can connect existing systems with a common language and allow data exchange in near real time. They also help reduce complexity by providing a single source of truth that eliminates the need to synchronize data between different services or applications. They enable dynamic behavior, allowing each service or application to respond quickly to changes in its environment. Using events, developers can create systems that are more reliable, responsive, and easier to maintain. In this podcast, Adam Bellemare, staff technologist at Confluent, discusses the four dimensions of events and designing event streams along with best practices, and an overview of a new course he just authored. This course, called Introduction to Designing Events and Event Streams , walks you through the process of properly designing events and event streams in any event-driven architecture. Adam explains that the goal of the course is to provide you with a foundation for designing events and event streams. Along with hands-on exercises and best practices, the course explores the four dimensions of events and event stream design and applies them to real-world problems. Most importantly, he talks to Kris about the key factors to consider when deciding what events to write, what events to publish, and how to structure and design them to trigger actions like broadcasting messages to other services or storing results in a database. How you design and implement events and event streams significantly affect not only what you can do today, but how you scale in the future. Head over to Introduction to Designing Events and Event Streams to learn everything you need to know about building an event-driven architecture. EPISODE LINKS Introduction to Designing Events and Event Streams Practical Data Mesh: Building Decentralized Data Architecture with Event Streams The Data Dichotomy: Rethinking the Way We Treat Data and Services Coding in Motion: Sound & Vision—Build a Data Streaming App with JavaScript and Confluent Cloud Using Event-Driven Design with Apache Kafka Streaming Applications ft. Bobby Calderwood Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Rethinking Apache Kafka Security and Account Management 41:23

לפני 3 years41:23

41:23

Is there a better way to manage access to resources without compromising security? New employees need access to a variety of resources within a company's tech stack. But manually granting access can be error-prone. And when employees leave, their access must be revoked, thus potentially introducing security risks if an admin misses one. In this podcast, Kris Jenkins talks to Anuj Sawani (Security Product Manager, Confluent) about the centralized identity management system he helped build to integrate with Apache Kafka® to prevent common identity management headaches and security risks. With 12+ years of experience building cybersecurity products for enterprise companies, Anuj Sawani explains how he helped build out KIP-768 (Secured OAuth support in Kafka) that supports a unified identity mechanism that spans across cloud and on-premises (hybrid scenarios). Confluent Cloud customers wanted a single identity to access all their services. The manual process required managing different sets of identity stores across the ecosystem. Anuj goes on to explain how Identity and Access Management (IAM) using cloud-native authentication protocols, such as OAuth or OpenID Connect, solves this problem by centralizing identity and minimizing security risks. Anuj emphasizes that sticking with industry standards is key because it makes integrating with other systems easy. With OAuth now supported in Kafka, this means performing client upgrades, configuring identity providers, etc. to ensure the applications can leverage new capabilities. Some examples of how to do this are to use centralized identities for client/broker connections. As Anuj continues to build and enhance features, he hopes to recommend this unified solution to other technology vendors because it makes integration much easier. The goal is to create a web of connectors that support the same standards. The future is bright, as other organizations are researching supporting OAuth and similar industry standards. Anuj is looking forward to the evolution and applying it to other use cases and scenarios. EPISODE LINKS Introduction to Confluent Cloud Security KIP-768: Secured OAuth support in Apache Kafka Confluent Cloud Documentation: OAuth 2.0 Support Apache Kafka Security Best Practices Security for Real-Time Data Stream Processing with Confluent Cloud Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Real-time Threat Detection Using Machine Learning and Apache Kafka 29:18

לפני 3 years29:18

29:18

Can we use machine learning to detect security threats in real-time? As organizations increasingly rely on distributed systems, it is becoming more important to analyze the traffic that passes through those systems quickly. Confluent Hackathon ’22 finalist, Géraud Dugé de Bernonville (Data Consultant, Zenika Bordeaux), shares how his team used TensorFlow (machine learning) and Neo4j (graph database) to analyze and detect network traffic data in real-time. What started as a research and development exercise turned into ZIEM, a full-blown internal project using ksqlDB to manipulate, export, and visualize data from Apache Kafka®. Géraud and his team noticed that large amounts of data passed through their network, and they were curious to see if they could detect threats as they happened. As a hackathon project, they built ZIEM, a network mapping and intrusion detection platform that quickly generates network diagrams. Using Kafka, the system captures network packets, processes the data in ksqlDB, and uses a Neo4j Sink Connector to send it to a Neo4j instance. Using the Neo4j browser, users can see instant network diagrams showing who's on the network, allowing them to detect anomalies quickly in real time. The Ziem project was initially conceived as an experiment to explore the potential of using Kafka for data processing and manipulation. However, it soon became apparent that there was great potential for broader applications (banking, security, etc.). As a result, the focus shifted to developing a tool for exporting data from Kafka, which is helpful in transforming data for deeper analysis, moving it from one database to another, or creating powerful visualizations. Géraud goes on to talk about how the success of this project has helped them better understand the potential of using Kafka for data processing. Zenika plans to continue working to build a pipeline that can handle more robust visualizations, expose more learning opportunities, and detect patterns. EPISODE LINKS Ziem Project on GitHub ksqlDB 101 course ksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work together ft. Simon Aubury Real-Time Stream Processing, Monitoring, and Analytics with Apache Kafka Application Data Streaming with Apache Kafka and Swim Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Improving Apache Kafka Scalability and Elasticity with Tiered Storage 29:32

לפני 3 years29:32

29:32

What happens when you need to store more than a few petabytes of data? Rittika Adhikari (Software Engineer, Confluent) discusses how her team implemented tiered storage, a method for improving the scalability and elasticity of data storage in Apache Kafka®. She also explores the motivating factors for building it in the first place: cost, performance, and manageability. Before Tiered Storage, there was no real way to retain Kafka data indefinitely. Because of the tight coupling between compute and storage, users were forced to use different tools to access cold and hot data. Additionally, the cost of re-replication was prohibitive because Kafka had to process large amounts of data rather than small hot sets. As a member of the Kafka Storage Foundations team, Rittika explains to Kris Jenkins how her team initially considered a Kafka data lake but settled on a more cost-effective method – tiered storage. With tiered storage, one tier handles elasticity and throughput for long-term storage, while the other tier is dedicated to high-cost, low-latency, short-term storage. Before, re-replication impacted all brokers, slowing down performance because it required more replication cycles. By decoupling compute and storage, they now only replicate the hot set rather than weeks of data. Ultimately, this tiered storage method broke down the barrier between compute and storage by separating data into multiple tiers across the cloud. This allowed for better scalability and elasticity that reduced operational toil. In preparation for a broader rollout to customers who heavily rely on compacted topics, Rittika’s team will be implementing tier compaction to support tiering of compacted topics. The goal is to have the partition leader perform compaction. This will substantially reduce compaction costs (CPU/disk) because the number of replicas compacting is significantly smaller. It also protects the broker resource consumption through a new compaction algorithm and throttling. EPISODE LINKS Jun Rao explains: What is Tiered Storage? Enabling Tiered Storage Infinite Storage in Confluent Platform Kafka Storage and Processing Fundamentals KIP-405: Kafka Tiered Storage Optimizing Apache Kafka’s Internals with Its Co-Creator Jun Rao Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Decoupling with Event-Driven Architecture 38:38

לפני 3 years38:38

38:38

In principle, data mesh architecture should liberate teams to build their systems and gather data in a distributed way, without having to explicitly coordinate. Data is the thing that can and should decouple teams, but proper implementation has its challenges. In this episode, Kris talks to Florian Albrecht (Solution Architect, Hermes Germany) about Galapagos, an open-source DevOps software tool for Apache Kafka® that Albrecht created with his team at Hermes, a German parcel delivery company. After Hermes chose Kafka to implement company-wide event-driven architecture, Albrecht’s team created rules and guidelines on how to use and really make the most out of Kafka. But the hands-off approach wasn’t leading to greater independence, so Albrecht’s team tried something different to documentation— they encoded the rules as software. This method pushed the teams to stop thinking in terms of data and to start thinking in terms of events. Previously, applications copied data from one point to another, with slight changes each time. In the end, teams with conflicting data were left asking when the data changed and why, with a real impact on customers who might be left wondering when their parcel was redirected and how. Every application would then have to be checked to find out when exactly the data was changed. Event architecture terminates this cycle. Events are immutable and changes are registered as new domain-specific events. Packaged together as event envelopes, they can be safely copied to other applications, and can provide significant insights. No need to check each application to find out when manually entered or imported data was changed—the complete history exists in the event envelope. More importantly, no more time-consuming collaborations where teams help each other to interpret the data. Using Galapagos helped the teams at Hermes to switch their thought process from raw data to event-driven. Galapagos also empowers business teams to take charge of their own data needs by providing a protective buffer. When specific teams, providers of data or events, want to change something, Galapagos enforces a method which will not kill the production applications already reading the data. Teams can add new fields which existing applications can ignore, but a previously required field that an application could be relying on won’t be changeable. Business partners using Galapagos found they were better prepared to give answers to their developer colleagues, allowing different parts of the business to communicate in ways they hadn’t before. Through Galapagos, Hermes saw better success decoupling teams. EPISODE LINKS A Guide to Data Mesh Practical Data Mesh ebook Galapagos GitHub Florian Albrecht GitHub Watch the video Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
If Streaming Is the Answer, Why Are We Still Doing Batch? 43:58

לפני 3 years43:58

43:58

Is real-time data streaming the future, or will batch processing always be with us? Interest in streaming data architecture is booming, but just as many teams are still happily batching away. Batch processing is still simpler to implement than stream processing, and successfully moving from batch to streaming requires a significant change to a team’s habits and processes, as well as a meaningful upfront investment. Some are even running dbt in micro batches to simulate an effect similar to streaming, without having to make the full transition. Will streaming ever fully take over? In this episode, Kris talks to a panel of industry experts with decades of experience building and implementing data systems. They discuss the state of streaming adoption today, if streaming will ever fully replace batch, and whether it even could (or should). Is micro batching the natural stepping stone between batch and streaming? Will there ever be a unified understanding on how data should be processed over time? Is the lack of agreement on best practices for data streaming an insurmountable obstacle to widespread adoption? What exactly is holding teams back from fully adopting a streaming model? Recorded live at Current 2022: The Next Generation of Kafka Summit, the panel includes Adi Polak (Vice President of Developer Experience, Treeverse), Amy Chen (Partner Engineering Manager, dbt Labs), Eric Sammer (CEO, Decodable), and Tyler Akidau (Principal Software Engineer, Snowflake). EPISODE LINKS dbt Labs Decodable lakeFS Snowflake View sessions and slides from Current 2022 Stream Processing vs. Batch Processing: What to Know From Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica Fine Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Security for Real-Time Data Stream Processing with Confluent Cloud 48:33

לפני 3 years48:33

48:33

Streaming real-time data at scale and processing it efficiently is critical to cybersecurity organizations like SecurityScorecard. Jared Smith, Senior Director of Threat Intelligence, and Brandon Brown, Senior Staff Software Engineer, Data Platform at SecurityScorecard, discuss their journey from using RabbitMQ to open-source Apache Kafka® for stream processing. As well as why turning to fully-managed Kafka on Confluent Cloud is the right choice for building real-time data pipelines at scale. SecurityScorecard mines data from dozens of digital sources to discover security risks and flaws with the potential to expose their client’ data. This includes scanning and ingesting data from a large number of ports to identify suspicious IP addresses, exposed servers, out-of-date endpoints, malware-infected devices, and other potential cyber threats for more than 12 million companies worldwide. To allow real-time stream processing for the organization, the team moved away from using RabbitMQ to open-source Kafka for processing a massive amount of data in a matter of milliseconds, instead of weeks or months. This makes the detection of a website’s security posture risk happen quickly for constantly evolving security threats. The team relied on batch pipelines to push data to and from Amazon S3 as well as expensive REST API based communication carrying data between systems. They also spent significant time and resources on open-source Kafka upgrades on Amazon MSK. Self-maintaining the Kafka infrastructure increased operational overhead with escalating costs. In order to scale faster, govern data better, and ultimately lower the total cost of ownership (TOC), Brandon, lead of the organization’s Pipeline team, pivoted towards a fully-managed, cloud-native approach for more scalable streaming data pipelines, and for the development of a new Automatic Vendor Detection (AVD) product. Jared and Brandon continue to leverage the Cloud for use cases including using PostgreSQL and pushing data to downstream systems using CSC connectors, increasing data governance and security for streaming scalability, and more. EPISODE LINKS SecurityScorecard Case Study Building Data Pipelines with Apache Kafka and Confluent Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Running Apache Kafka in Production 58:44

לפני 3 years58:44

58:44

What are some recommendations to consider when running Apache Kafka® in production? Jun Rao, one of the original Kafka creators, as well as an ongoing committer and PMC member, shares the essential wisdom he's gained from developing Kafka and dealing with a large number of Kafka use cases. Here are 6 recommendations for maximizing Kafka in production: 1. Nail Down the Operational Part When setting up your cluster, in addition to dealing with the usual architectural issues, make sure to also invest time into alerting, monitoring, logging, and other operational concerns. Managing a distributed system can be tricky and you have to make sure that all of its parts are healthy together. This will give you a chance at catching cluster problems early, rather than after they have become full-blown crises. 2. Reason Properly About Serialization and Schemas Up Fron t At the Kafka API level, events are just bytes, which gives your application the flexibility to use various serialization mechanisms. Avro has the benefit of decoupling schemas from data serialization, whereas Protobuf is often preferable to those practiced with remote procedure calls; JSON Schema is user friendly but verbose. When you are choosing your serialization, it's a good time to reason about schemas, which should be well-thought-out contracts between your publishers and subscribers. You should know who owns a schema as well as the path for evolving that schema over time. 3. Use Kafka As a Central Nervous System Rather Than As a Single Cluster Teams typically start out with a single, independent Kafka cluster, but they could benefit, even from the outset, by thinking of Kafka more as a central nervous system that they can use to connect disparate data sources. This enables data to be shared among more applications. 4. Utilize Dead Letter Queues (DLQs) DLQs can keep service delays from blocking the processing of your messages. For example, instead of using a unique topic for each customer to which you need to send data (potentially millions of topics), you may prefer to use a shared topic, or a series of shared topics that contain all of your customers. But if you are sending to multiple customers from a shared topic and one customer's REST API is down—instead of delaying the process entirely—you can have that customer's events divert into a dead letter queue. You can then process them later from that queue. 5. Understand Compacted Topics By default in Kafka topics, data is kept by time. But there is also another type of topic, a compacted topic , which stores data by key and replaces old data with new data as it comes in. This is particularly useful for working with data that is updateable, for example, data that may be coming in through a change-data-capture log. A practical example of this would be a retailer that needs to update prices and product descriptions to send out to all of its locations. 6. Imagine New Use Cases Enabled by Kafka's Recent Evolution The biggest recent change in Kafka's history is its migration to the cloud. By using Kafka there, you can reserve your engineering talent for business logic. The unlimited storage enabled by the cloud also means that you can truly keep data forever at reasonable cost, and thus you don't have to build a separate system for your historical data needs. EPISODE LINKS Kafka Internals 101 Watch in video Kris Jenkins' Twitter Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Build a Real Time AI Data Platform with Apache Kafka 37:18

לפני 3 years37:18

37:18

Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless. Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task. With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database. Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team. EPISODE LINKS Kafka Streams 101 course The Difference Engine for Unlocking the Kafka Black Box GitHub repo: kash.py Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Optimizing Apache JVMs for Apache Kafka 1:11:42

לפני 3 years1:11:42

1:11:42

Java Virtual Machines (JVMs) impact Apache Kafka® performance in production. How can you optimize your event-streaming architectures so they process more Kafka messages using the same number of JVMs? Gil Tene (CTO and Co-Founder, Azul) delves into JVM internals and how developers and architects can use Java and optimized JVMs to make real-time data pipelines more performant and more cost effective, with use cases. Gil has deep roots in Java optimization, having started out building large data centers for parallel processing, where the goal was to get a finite set of hardware to run the largest possible number of JVMs. As the industry evolved, Gil switched his primary focus to software, and throughout the years, has gained particular expertise in garbage collection (the C4 collector) and JIT compilation. The OpenJDK distribution Gil's company Azul releases, Zulu, is widely used throughout the Java world, although Azul's Prime build version can run Kafka up to forty-percent faster than the open version—on identical hardware. Gil relates that improvements in JVMs aren't yielded with a single stroke or in one day, but are rather the result of many smaller incremental optimizations over time, i.e. "half-percent" improvements that accumulate. Improving a JVM starts with a good engineering team, one that has thought significantly about how to make JVMs better. The team must continuously monitor metrics, and Gil mentions that his team tests optimizations against 400-500 different workloads (one of his favorite things to get into the lab is a new customer's workload). The quality of a JVM can be measured on response times, the consistency of these response times including outliers, as well as the level and number of machines that are needed to run it. A balance between performance and cost efficiency is usually a sweet spot for customers. Throughout the podcast, Gil goes into depth on optimization in theory and practice, as well as Azul's use of JIT compilers, as they play a key role in improving JVMs. There are always tradeoffs when using them: You want a JIT compiler to strike a balance between the work expended optimizing and the benefits that come from that work. Gil also mentions a new innovation Azul has been working on that moves JIT compilation to the cloud, where it can be applied to numerous JVMs simultaneously. EPISODE LINKS A Guide on Increasing Kafka Event Streaming Performance Better Kafka Performance Without Changing Any Code Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Apache Kafka 3.3 - KRaft, Kafka Core, Streams, & Connect Updates 6:42

לפני 3 years6:42

6:42

Apache Kafka® 3.3 is released! With over two years of development, KIP-833 marks KRaft as production ready for new AK 3.3 clusters only. On behalf of the Kafka community, Danica Fine (Senior Developer Advocate, Confluent) shares highlights of this release, with KIPs from Kafka Core, Kafka Streams, and Kafka Connect. To reduce request overhead and simplify client-side code, KIP-709 extends the OffsetFetch API requests to accept multiple consumer group IDs. This update has three changes, including extending the wire protocol, response handling changes, and enhancing the AdminClient to use the new protocol. Log recovery is an important process that is triggered whenever a broker starts up after an unclean shutdown. And since there is no way to know the log recovery progress other than checking if the broker log is busy, KIP-831 adds metrics for the log recovery progress with `RemainingLogsToRecover` and `RemainingSegmentsToRecover`for each recovery thread. These metrics allow the admin to monitor the progress of the log recovery. Additionally, updates on Kafka Core also include KIP-841: Fenced replicas should not be allowed to join the ISR in KRaft. KIP-835: Monitor KRaft Controller Quorum Health. KIP-859: Add metadata log processing error-related metrics. KIP-834 for Kafka Streams added the ability to pause and resume topologies. This feature lets you reduce rescue usage when processing is not required or modifying the logic of Kafka Streams applications, or when responding to operational issues. While KIP-820 extends the KStream process with a new processor API. Previously, KIP-98 added support for exactly-once delivery guarantees with Kafka and its Java clients. In the AK 3.3 release, KIP-618 offers the Exactly-Once Semantics support to Confluent’s source connectors. To accomplish this, a number of new connectors and worker-based configurations have been introduced, including `exactly.once.source.support`, `transaction.boundary`, and more. Image attribution: Apache ZooKeeper™: https://zookeeper.apache.org/ and Raft logo: https://raft.github.io/ EPISODE LINKS See release notes for Apache Kafka 3.3.0 and Apache Kafka 3.3.1 for the full list of changes Read the blog to learn more Download Apache Kafka 3.3 and get started Watch the video version of this podcast…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Application Data Streaming with Apache Kafka and Swim 39:10

לפני 3 years39:10

39:10

How do you set data applications in motion by running stateful business logic on streaming data? Capturing key stream processing events and cumulative statistics that necessitate real-time data assessment, migration, and visualization remains as a gap—for event-driven systems and stream processing frameworks according to Fred Patton (Developer Evangelist, Swim Inc.) In this episode, Fred explains streaming applications and how it contrasts with stream processing applications. Fred and Kris also discuss how you can use Apache Kafka® and Swim for a real-time UI for streaming data. Swim's technology facilitates relationships between streaming data from distributed sources and complex UIs, managing backpressure cumulatively, so that front ends don't get overwhelmed. They are focused on real-time, actionable insights, as opposed to those derived from historical data. Fred compares Swim's functionality to the speed layer in the Lambda architecture model, which is specifically concerned with serving real-time views. For this reason, when sending your data to Swim, it is common to also send a copy to a data warehouse that you control. Web agent— a data entity in the Swim ecosystem, can be as small as a single cellphone or as large as a whole cellular network. Web agents communicate with one another as well as with their subscribers, and each one is a URI that can be called by a browser or the command line. Swim has been designed to instantaneously accommodate requests at widely varying levels of granularity, each of which demands a completely different volume of data. Thus, as you drill down, for example, from a city view on a map into a neighborhood view, the Swim system figures out which web agent is responsible for the view you are requesting, as well as the other web agents needed to show it. Fred also shares an example where they work with a telephony company that requires real-time statuses for a network infrastructure with thousands of cell towers servicing millions of devices. Along with a use case for a transportation company needing to transform raw edge data into actionable insights for its connected vehicle customers. Future plans for Swim include porting more functionality to the cloud, which will enable additional automation, so that, for example, a customer just has to provide database and Kafka cluster connections, and Swim can automatically build out infrastructure. EPISODE LINKS Swim Cellular Network Simulator Continuous Intelligence - Streaming Apps That Are Always in Sync Using Swim with Apache Kafka Swim Developer Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
International Podcast Day - Apache Kafka Edition | Streaming Audio Special 1:02:22

לפני 3 years1:02:22

1:02:22

What’s your favorite podcast? Would you like to find some new ones? In celebration of International Podcast Day, Kris Jenkins invites 12 experts from the Apache Kafka® community to talk about their favorite podcasts. Unlike other episodes where guests educate developers and tell stories about Kafka, its surrounding technological ecosystem, or the Cloud, this special episode provides a glimpse into what these guests have learned through listening to podcasts that you might also find interesting. Through a virtual international tour, Kris chatted with Bill Bejeck (Integration Architect, Confluent), Nikoleta Verbeck (Senior Solutions Engineer, CSID, Confluent), Ben Stopford (Lead Technologist, OCTO, Confluent), Noelle Gallagher (Video Producer, Editor), Danica Fine (Senior Developer Advocate, Confluent), Tim Berglund (VP, Developer Relations, StarTree), Ben Ford (Founder and CEO, Commando Development), Jeff Bean (Group Manager, Technical Marketing, Confluent), Domenico Fioravanti (Director of Engineering, Therapie Clinic), Francesco Tisiot (Senior Developer Advocate, Aiven), Robin Moffatt (Principal, Developer Advocate, Confluent), and Simon Aubury (Principal Data Engineer, ThoughtWorks). They share recommendations covering a wide range of topics such as building distributed systems, travel, data engineering, greek mythology, data mesh, economics, and music and the arts. EPISODE LINKS Common Apache Kafka Mistakes to Avoid Flink vs Kafka Streams/ksqlDB Why Data Mesh ft. Ben Stopford Practical Data Pipeline ft. Danica Fine What Could Go Wrong with a Kafka JDBC Connector? Intro to Kafka Connect: Core Components and Architecture ft. Robin Moffatt Serverless Stream Processing with Apache Kafka ft. Bill Bejeck Scaling an Apache Kafka-Based Architecture at Therapie Clinic Event-Driven Systems and Agile Operations Real-Time Stream Processing, Monitoring, and Analytics with Apache Kafka Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
How to Build a Reactive Event Streaming App - Coding in Motion 1:26

לפני 3 years1:26

1:26

How do you build an event-driven application that can react to real-time data streams as they happen? Kris Jenkins (Senior Developer Advocate, Confluent) will be hosting another fun, hands-on programming workshop—Coding in Motion: Watching the River Flow, to demonstrate how you can build a reactive event streaming application with Apache Kafka®, ksqlDB using Python. As a developer advocate, Kris often speaks at conferences, and the presentation will be available on-demand through the organizer’s YouTube channel. The desire to read comments and be able to interact with the community motivated Kris to set up a real-time event streaming application that would notify him on his mobile phone. During the workshop, Kris will demonstrate the end-to-end process of using Python to process and stream data from YouTube’s REST API into a Kafka topic, analyze the data with ksqlDB, and then stream data out via Telegram. After the workshop, you’ll be able to use the recipe to build your own event-driven data application. EPISODE LINKS Coding in Motion: Building a Reactive Data Streaming App Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Real-Time Stream Processing, Monitoring, and Analytics With Apache Kafka 34:07

לפני 3 years34:07

34:07

Processing real-time event streams enables countless use cases big and small. With a day job designing and building highly available distributed data systems, Simon Aubury (Principal Data Engineer, Thoughtworks) believes stream-processing thinking can be applied to any stream of events. In this episode, Simon shares his Confluent Hackathon ’22 winning project—a wildlife monitoring system to observe population trends over time using a Raspberry Pi, along with Apache Kafka®, Kafka Connect, ksqlDB, TensorFlow Lite, and Kibana. He used the system to count animals in his Australian backyard and perform trend analysis on the results. Simon also shares ideas on how you can use these same technologies to help with other real-world challenges. Open-source, object detection models for TensorFlow, which appropriately are collected into "model zoos," meant that Simon didn't have to provide his own object identification as part of the project, which would have made it untenable. Instead, he was able to utilize the open-source models, which are essentially neural nets pretrained on relevant data sets—in his case, backyard animals. Simon's system, which consists of around 200 lines of code, employs a Kafka producer running a while loop, which connects to a camera feed using a Python library. For each frame brought down, object masking is applied in order to crop and reduce pixel density, and then the frame is compared to the models mentioned above. A Python dictionary containing probable found objects is sent to a Kafka broker for processing; the images themselves aren't sent. (Note that Simon's system is also capable of alerting if a specific, rare animal is detected.) On the broker, Simon uses ksqlDB and windowing to smooth the data in case the frames were inconsistent for some reason (it may look back over thirty seconds, for example, and find the highest number of animals per type). Finally, the data is sent to a Kibana dashboard for analysis, through a Kafka Connect sink connector. Simon’s system is an extremely low-cost system that can simulate the behaviors of more expensive, proprietary systems. And the concepts can easily be applied to many other use cases. For example, you could use it to estimate traffic at a shopping mall to gauge optimal opening hours, or you could use it to monitor the queue at a coffee shop, counting both queued patrons as well as impatient patrons who decide to leave because the queue is too long. EPISODE LINKS Real-Time Wildlife Monitoring with Apache Kafka Wildlife Monitoring Github ksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work Together Event-Driven Architecture - Common Mistakes and Valuable Lessons Motion in Motion: Building an End-to-End Motion Detection and Alerting System with Apache Kafka and ksqlDB Watch the video version of this podcast Kris Jenkins’ Twitter Learn more on Confluent Developer Use PODCAST100 to get $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Reddit Sentiment Analysis with Apache Kafka-Based Microservices 35:23

לפני 3 years35:23

35:23

How do you analyze Reddit sentiment with Apache Kafka® and microservices? Bringing the fresh perspective of someone who is both new to Kafka and the industry, Shufan Liu, nascent Developer Advocate at Confluent, discusses projects he has worked on during his summer internship—a Cluster Linking extension to a conceptual data pipeline project, and a microservice-based Reddit sentiment-analysis project. Shufan demonstrates that it’s possible to quickly get up to speed with the tools in the Kafka ecosystem and to start building something productive early on in your journey. Shufan's Cluster Linking project extends a demo by Danica Fine (Senior Developer Advocate, Confluent) that uses a Kafka-based data pipeline to address the challenge of automatic houseplant watering. He discusses his contribution to the project and shares details in his blog— Data Enrichment in Existing Data Pipelines Using Confluent Cloud . The second project Shufan presents is a sentiment analysis system that gathers data from a given subreddit, then assigns the data a sentiment score. He points out that its results would be hard to duplicate manually by simply reading through a subreddit—you really need the assistance of AI. The project consists of four microservices: A user input service that collects requests in a Kafka topic, which consist of the desired subreddit, along with the dates between which data should be collected An API polling service that fetches the requests from the user input service, collects the relevant data from the Reddit API, then appends it to a new topic A sentiment analysis service that analyzes the appended topic from the API polling service using the Python library NLTK; it calculates averages with ksqlDB A results-displaying service that consumes from a topic with the calculations Interesting subreddits that Shufan has analyzed for sentiment include gaming forums before and after key releases; crypto and stock trading forums at various meaningful points in time; and sports-related forums both before the season and several games into it. EPISODE LINKS Data Enrichment in Existing Data Pipelines Using Confluent Cloud Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Capacity Planning Your Apache Kafka Cluster 1:01:54

לפני 3 years1:01:54

1:01:54

How do you plan Apache Kafka® capacity and Kafka Streams sizing for optimal performance? When Jason Bell (Principal Engineer, Dataworks and founder of Synthetica Data), begins to plan a Kafka cluster, he starts with a deep inspection of the customer's data itself—determining its volume as well as its contents: Is it JSON, straight pieces of text, or images? He then determines if Kafka is a good fit for the project overall, a decision he bases on volume, the desired architecture, as well as potential cost. Next, the cluster is conceived in terms of some rule-of-thumb numbers. For example, Jason's minimum number of brokers for a cluster is three or four. This means he has a leader, a follower and at least one backup. A ZooKeeper quorum is also a set of three. For other elements, he works with pairs, an active and a standby—this applies to Kafka Connect and Schema Registry. Finally, there's Prometheus monitoring and Grafana alerting to add. Jason points out that these numbers are different for multi-data-center architectures. Jason never assumes that everyone knows how Kafka works, because some software teams include specialists working on a producer or a consumer, who don't work directly with Kafka itself. They may not know how to adequately measure their Kafka volume themselves, so he often begins the collaborative process of graphing message volumes. He considers, for example, how many messages there are daily, and whether there is a peak time. Each industry is different, with some focusing on daily batch data (banking), and others fielding incredible amounts of continuous data (IoT data streaming from cars). Extensive testing is necessary to ensure that the data patterns are adequately accommodated. Jason sets up a short-lived system that is identical to the main system. He finds that teams usually have not adequately tested across domain boundaries or the network. Developers tend to think in terms of numbers of messages, but not in terms of overall network traffic, or in how many consumers they'll actually need, for example. Latency must also be considered, for example if the compression on the producer's side doesn't match compression on the consumer's side, it will increase. Kafka Connect sink connectors require special consideration when Jason is establishing a cluster. Failure strategies need to well thought out, including retries and how to deal with the potentially large number of messages that can accumulate in a dead letter queue. He suggests that more attention should generally be paid to the Kafka Connect elements of a cluster, something that can actually be addressed with bash scripts. Finally, Kris and Jason cover his preference for Kafka Streams over ksqlDB from a network perspective. EPISODE LINKS Capacity Planning and Sizing for Kafka Streams Tales from the Frontline of Apache Kafka DevOps Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more on Confluent Developer Use PODCAST100 to get $100 of free Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Streaming Real-Time Sporting Analytics for World Table Tennis 34:29

לפני 3 years34:29

34:29

Reimagining a data architecture to provide real-time data flow for sporting events can be complicated, especially for organizations with as much data as World Table Tennis (WTT). Vatsan Rama (Director of IT, ITTF Group) shares why real-time data is essential in the sporting world and how his team reengineered their data system in 18 months, moving from a solely on-premises infrastructure to a cloud-native data system that uses Confluent Cloud with Apache Kafka® as its central nervous system. World Table Tennis is a business created by the International Table Tennis Federation (ITTF) to manage the official professional Table Tennis series of events and its commercial rights. World Table Tennis is also leading the sport digital transformation and commercializes its software application for real-time event scoring worldwide. Previously, ITTF scoring was processed manually with a desktop-based, on-venue results system (OVR) —an on-premises solution to process match data that calculated rankings and records, then sent event information to other systems, such as scoreboards. To provide match status in real-time, which makes the sport more engaging for fans and adds a competitive edge for players, Vatsan reengineered their OVR system to allow instant data sync between on-premises competition systems with the Cloud. The redesign started by establishing an event-driven architecture with Kafka that consolidates all legacy data sources, including records in Excel along with some handwritten forms (some dating back 90 years, even including records from the 1930 World Championship). To reduce operational overhead and maintenance, the team decided to stream data through fully managed Kafka as a service on Azure, for a scalable, distributed infrastructure. Vatsan shares that multiple table tennis events can run in parallel globally, and every time an umpire marks scores in a table, the data moves from the venue into Confluent Cloud, and then the score and rankings are sent to betting organizations and individuals on their mobile apps. EPISODE LINKS Event Processing Application Fully Managed Apache Kafka on Azure Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Real-Time Event Distribution with Data Mesh 48:59

לפני 3 years48:59

48:59

Inheriting software in the banking sector can be challenging. Perhaps the only thing harder is inheriting software built by a committee of banks. How do you keep it running, while improving it, refactoring it, and planning a bigger future for it? In this episode, Jean-Francois Garet (Technical Architect, Symphony) shares his experience at Symphony as he helps it evolve from an inherited, monolithic, single-tenant architecture to an event mesh for seamless event-streaming microservices. He talks about the journey they’ve taken so far, and the foundations they’ve laid for a modern data mesh. Symphony is the leading markets’ infrastructure and technology platform, which provides a full communication stack (chat, voice and video meetings, file and screen sharing) for the financial industry. Jean-Francois shares that its initial system was inherited from one of the founding institutions—and features the highest level of security to ensure confidentiality of business conversations, coupled with compliance with regulations covering financial transactions. However, its stacks are monolithic and single tenant. To modernize Symphony's architecture for real-time data, Jean-Francois and team have been exploring various approaches over the last four years. They started breaking down the monolith into microservices, and also made a move towards multitenancy by setting up an event mesh. However, they experienced a mix of success and failure in both attempts. To continue the evolution of the system, while maintaining business deliveries, the team started to focus on event streaming for asynchronous communications, as well as connecting the microservices for real-time data exchange. As they had prior Apache Kafka® usage in the company, the team decided to go with managed Kafka on the cloud as their streaming platform. The team has a set of principles in mind for the development of their event-streaming functionality: Isolate product domains Reach eventual consistency with event streaming Clear contracts for the event streams, for both producers and consumers Multiregion and global data sharing Jean-Francois shares that data mesh is ultimately what they are hoping to achieve with their platform—to provide governance around data and make data available as a product for self service. As of now, though, their focus is achieving real-time event streams with event mesh. EPISODE LINKS The Definitive Guide to Building a Data Mesh with Event Streams Data Mesh 101 What is Data Mesh? ft. Zhamak Dehghani Data Mesh Architecture Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Apache Kafka Security Best Practices 39:10

לפני 3 years39:10

39:10

Security is a primary consideration for any system design, and Apache Kafka® is no exception. Out of the box, Kafka has relatively little security enabled. Rajini Sivaram (Principal Engineer, Confluent, and co-author of “Kafka: The Definitive Guide” ) discusses how Kafka has gone from a system that included no security to providing an extensible and flexible platform for any business to build a secure messaging system. She shares considerations, important best practices, and features Kafka provides to help you design a secure modern data streaming system. In order to build a secure Kafka installation, you need to securely authenticate your users. Whether you are using Kerberos (SASL/GSSAPI), SASL/PLAIN, SCRAM, or OAUTH. Verifying your users can authenticate, and non-users can’t, is a primary requirement for any connected system. But authentication is only one part of the security story. We also need to address other areas. Kafka added support for fine-grained access control using ACLs with a pluggable authorizer several years ago. Over time, this was extended to support prefixed ACLs to make ACLs more manageable in large organizations. Now on its second generation authorizer, Kafka is easily extendable to support other forms of authorization, like integrating with a corporate LDAP server to provide group or role-based access control. Even if you’ve set up your system to use secure authentication and each user is authorized using a series of ACLs if the data is viewable by anyone listening, how secure is your system? That’s where encryption comes in. Using TLS Kafka can encrypt your data-in-transit. Security has gone from a nice-to-have to being a requirement of any modern-day system. Kafka has followed a similar path from zero security to having a flexible and extensible system that helps companies of any size pick the right security path for them. Be sure to also check out the newest Apache Kafka Security course on Confluent Developer for an in-depth explanation along with other recommendations. EPISODE LINKS An Introduction to Apache Kafka Security: Securing Real-Time Data Streams Kafka Security course Kafka: The Definitive Guide v2 Security Overview Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
What Could Go Wrong with a Kafka JDBC Connector? 41:10

לפני 3 years41:10

41:10

Java Database Connectivity (JDBC) is the Java API used to connect to a database. As one of the most popular Kafka connectors, it's important to prevent issues with your integrations. In this episode, we'll cover how a JDBC connection works, and common issues with your database connection. Why the Kafka JDBC Connector? When it comes to streaming database events into Apache Kafka®, the JDBC connector usually represents the first choice for its flexibility and the ability to support a wide variety of databases without requiring custom code. As an experienced data analyst, Francesco Tisiot (Senior Developer Advocate, Aiven) delves into his experience of streaming Kafka data pipeline with JDBC source connector and explains what could go wrong. He discusses alternative options available to avoid these problems, including the Debezium source connector for real-time change data capture. The JDBC connector is a Java API for Kafka Connect, which streams data between databases and Kafka. If you want to stream data from a rational database into Kafka, once per day or every two hours, the JDBC connector is a simple, batch processing connector to use. You can tell the JDBC connector which query you’d like to execute against the database, and then the connector will take the data into Kafka. The connector works well with out-of-the-box basic data types, however, when it comes to a database-specific data type, such as geometrical columns and array columns in PostgresSQL, these don’t represent well with the JDBC connector. Perhaps, you might not have any results in Kafka because the column is not within the connector’s supporting capability. Francesco shares other cases that would cause the JDBC connector to go wrong, such as: Infrequent snapshot times Out-of-order events Non-incremental sequences Hard deletes To help avoid these problems and set up a reliable source of events for your real-time streaming pipeline, Francesco suggests other approaches, such as the Debezium source connector for real-time change data capture. The Debezium connector has enhanced metadata, timestamps of the operation, access to all logs, and provides sequence numbers for you to speak the language of a DBA. They also talk about the governance tool, which Francesco has been building, and how streaming Game of Thrones sentiment analysis with Kafka started his current role as a developer advocate. EPISODE LINKS Kafka Connect Deep Dive – JDBC Source Connector JDBC Source Connector: What could go wrong? Metadata parser Debezium Documentation Database Migration with Apache Kafka and Apache Kafka Connect Watch the video version of this podcast Francesco Tisiot’s Twitter Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more on Confluent Developer…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Apache Kafka Networking with Confluent Cloud 37:22

לפני 3 years37:22

37:22

Setting up a reliable cloud networking for your Apache Kafka® infrastructure can be complex. There are many factors to consider—cost, security, scalability, and availability. With immense experience building cloud-native Kafka solutions on Confluent Cloud, Justin Lee (Principal Solutions Engineer, Enterprise Solutions Engineering, Confluent) and Dennis Wittekind (Customer Success Technical Architect, Customer Success Engineering, Confluent) talk about the different networking options on Confluent Cloud, including AWS Transit Gateway, AWS, and Azure Private Link, and discuss when and why you might choose one over the other. In order to build a secure cloud-native Kafka network, you need to consider information security and compliance requirements. These requirements may vary depending on your industry, location, and regulatory environment. For example, in financial organizations, transaction data or personal identifiable information (PII) may not be accessible over the internet. In this case, your network architecture may require private networking, which means you have to choose between private endpoints or a peering connection between your infrastructure and your Kafka clusters in the cloud. What are the differences between different networking solutions? Dennis and Justin talk about some of the benefits and drawbacks of different network architectures. For example, Transit Gateways offered by AWS are often a good fit for organizations with large, disparate network architectures, while Private Link is sometimes preferred for its security benefits. We also discuss the management overhead involved in administering different network architectures. Dennis and Justin also highlight their recently launched course on Confluent Developer—the Confluent Cloud Networking course. This hands-on course covers basic networking and cloud computing concepts that will offer support for you to get a clearer picture of the configurations and collaborate with the networking teams. EPISODE LINKS Cloud Networking course Manage Networking Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Event-Driven Systems and Agile Operations 53:22

לפני 3 years53:22

53:22

How do the principles of chaotic, agile operations in the military apply to software development and event-driven systems? As a former Royal Marine, Ben Ford (Founder and CEO, Commando Development) is also a software developer, with many years of experience building event streaming architectures across financial services and startups. He shares principles that the military employs in chaotic conditions as well as how these can be applied to event-streaming and agile development. According to Ben, the operational side of the military is very emergent and reactive based on situations, like real-time, event-driven systems. Having spent the last five years researching, adapting, and applying these principles to technology leadership, he identifies a parallel in these concepts and operations ranging from DevOps to organizational architecture, and even when developing data streaming applications. One of the concepts Ben and Kris talk through is Colonel John Boyd’s OODA loop, which includes four cycles: Observe : the observation of the incoming events and information Orient : the orientation stage involves reflecting on the events and how they are applied to your current situation Decide: the decision on what is the expected path to take. Then test and identify the potential outcomes Act : the action based on the decision, while also involves testing in generating further observations This concept of feedback loop helps to put in context and quickly make the most appropriate decision while understanding that changes can be made as more data becomes available. Ben and Kris also chat through their experience of building an event system together during the early days before the release of Apache Kafka® and more. EPISODE LINKS Building Real-Time Data Systems the Hard Way Mission Ctrl Mission Command: The Doctrine of Empowerment Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Streaming Analytics and Real-Time Signal Processing with Apache Kafka 1:06:33

לפני 3 years1:06:33

1:06:33

Imagine you can process and analyze real-time event streams for intelligence to mitigate cyber threats or keep soldiers constantly alerted to risks and precautions they should take based on events. In this episode, Jeffrey Needham (Senior Solutions Engineer, Advanced Technology Group, Confluent) shares use cases on how Apache Kafka® can be used for real-time signal processing to mitigate risk before it arises. He also explains the classic Kafka transactional processing defaults and the distinction between transactional and analytic processing. Jeffrey is part of the customer solutions and innovations division (CSID), which involves designing event streaming platforms and innovations to improve productivity for organizations by pushing the envelope of Kafka for real-time signal processing. What is signal intelligence? Jeffrey explains that it's not always affiliated with the military. Signal processing improves your operational or situational awareness by understanding the petabyte datasets of clickstream data, or the telemetry coming in from sensors, which could be the satellite or sensor arrays along a water pipeline. That is, bringing in event data from external sources to analyze, and then finding the pattern in the series of events to make informed decisions. Conventional On-Line Analytical Processing (OLAP) or data warehouse platforms evolved out of the transaction processing model. However, when analytics or even AI processing is applied to any data set, these algorithms never look at a single column or row, but look for patterns within millions of rows of transactionally derived data. Transaction-centric solutions are designed to update and delete specific rows and columns in an “ACID” compliant manner, which makes them inefficient and usually unaffordable at scale because this capability is less critical when the analytic goal is to look for a pattern within millions or even billions of these rows. Kafka was designed as a step forward from classic transaction processing technologies, which can also be configured in a way that’s optimized for signal processing high velocities of noisy or jittery data streams, in order to make sense, in real-time, of a dynamic, non-transactional environment. With its immutable, write-append commit logs, Kafka functions as a flight data recorder, which remains resilient even when network communications, or COMMs, are poor or nonexistent. Jeffrey shares the disconnected edge project he has been working on—smart soldier, which runs Kafka on a Raspberry Pi and x64-based handhelds. These devices are ergonomically integrated on each squad member to provide real-time visibility into the soldiers’ activities or situations. COMMs permitting, the topic data is then mirrored upstream and aggregated at multiple tiers—mobile command post, battalion, HQ—to provide ever-increasing views of the entire battlefield, or whatever the sensor array is monitoring, including the all important supply chain. Jeffrey also shares a couple of other use cases on how Kafka can be used for signal intelligence, including cybersecurity and protecting national critical infrastructure. EPISODE LINKS Using Kafka for Analytic Processing Watch the video version of this podcast Streaming Audio Playlist Learn more on Confluent Developer Use PODCAST100 to get $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Blockchain Data Integration with Apache Kafka 50:59

לפני 3 years50:59

50:59

How is Apache Kafka® relevant to blockchain technology and cryptocurrency? Fotios Filacouris (Staff Solutions Engineer, Confluent) has been working with Kafka for close to five years, primarily designing architectural solutions for financial services, he also has expertise in the blockchain. In this episode, he joins Kris to discuss how blockchain and Kafka are complementary, and he also highlights some of the use cases he has seen emerging that use Kafka in conjunction with traditional, distributed ledger technology (DLT) as well as blockchain technologies. According to Fotios, Kafka and the notion of blockchain share many traits, such as immutability, replication, distribution, and the decoupling of applications. This complementary relationship means that they can function well together if you are looking to extend the functionality of a given DLT through sidechain or off-chain activities, such as analytics, integrations with traditional enterprise systems, or even the integration of certain chains and ledgers. Based on Fotios’ observations, Kafka has become an essential piece of the puzzle in many blockchain-related use cases, including settlement, logging, analytics and risk, and volatility calculations. For example, a bitcoin trading application may use Kafka Streams to provide analytics on top of the price action of various crypto assets. Fotios has also seen use cases where a crypto platform leverages Kafka as its infrastructure layer for real-time logging and analytics. EPISODE LINKS Modernizing Banking Architectures with Apache Kafka New Kids On the Bloq Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Automating Multi-Cloud Apache Kafka Cluster Rollouts 48:29

לפני 3 years48:29

48:29

To ensure safe and efficient deployment of Apache Kafka® clusters across multiple cloud providers, Confluent rolled out a large scale cluster management solution. Rashmi Prabhu (Staff Software Engineer & Eng Manager, Fleet Management Platform, Confluent) and her team have been building the Fleet Management Platform for Confluent Cloud. In this episode, she delves into what Fleet Management is, and how the cluster management service streamlines Kafka operations in the cloud while providing a seamless developer experience. When it comes to performing operations at large scale on the cloud, manual processes work well if the scenario involves only a handful of clusters. However, as a business grows, a cloud footprint may potentially scale 10x, and will require upgrades to a significantly larger cluster fleet.d. Additionally, the process should be automated, in order to accelerate feature releases while ensuring safe and mature operations. Fleet Management lets you manage and automate software rollouts and relevant cloud operations within the Kafka ecosystem at scale—including cloud-native Kafka, ksqlDB, Kafka Connect, Schema Registry, and other cloud-native microservices. The automation service can consistently operate applications across multiple teams, and can also manage Kubernetes infrastructure at scale. The existing Fleet Management stack can successfully handle thousands of concurrent upgrades in the Confluent ecosystem. When building out the Fleet Management Platform, Rashmi and the team kept these key considerations in mind: Rollout Controls and DevX: Wide deployment and distribution of changes across the fleet of target assets; improved developer experience for ease of use, with rollout strategy support, deployment policies, a dynamic control workflow, and manual approval support on an as-needed basis. Safety: Built-in features where security and safety of the fleet are the priority with access control, and audits on operations: There is active monitoring and paced rollouts, as well as automated pauses and resumes to reduce the time to react upon failure. There’s also an error threshold, and controls to allow a healthy balance of risk vs. pace. Visibility: A close to real time, wide-angle view of the fleet state, along with insights into workflow progress, historical operations on the clusters, live notification on workflows, drift detection across assets, and so much more. EPISODE LINKS Optimize Fleet Management Software Engineer - Fleet Management Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Common Apache Kafka Mistakes to Avoid 1:09:43

לפני 3 years1:09:43

1:09:43

What are some of the common mistakes that you have seen with Apache Kafka® record production and consumption? Nikoleta Verbeck (Principal Solutions Architect at Professional Services, Confluent) has a role that specifically tasks her with performance tuning as well as troubleshooting Kafka installations of all kinds. Based on her field experience, she put together a comprehensive list of common issues with recommendations for building, maintaining, and improving Kafka systems that are applicable across use cases. Kris and Nikoleta begin by discussing the fact that it is common for those migrating to Kafka from other message brokers to implement too many producers, rather than the one per service. Kafka is thread safe and one producer instance can talk to multiple topics, unlike with traditional message brokers, where you may tend to use a client per topic. Monitoring is an unabashed good in any Kafka system. Nikoleta notes that it is better to monitor from the start of your installation as thoroughly as possible, even if you don't think you ultimately will require so much detail, because it will pay off in the long run. A major advantage of monitoring is that it lets you predict your potential resource growth in a more orderly fashion, as well as helps you to use your current resources more efficiently. Nikoleta mentions the many dashboards that have been built out by her team to accommodate leading monitoring platforms such as Prometheus, Grafana, New Relic, Datadog, and Splunk. They also discuss a number of useful elements that are optional in Kafka so people tend to be unaware of them. Compression is the first of these, and Nikoleta absolutely recommends that you enable it. Another is producer callbacks, which you can use to catch exceptions. A third is setting a `ConsumerRebalanceListener`, which notifies you about rebalancing events, letting you prepare for any issues that may result from them. Other topics covered in the episode are batching and the `linger.ms` Kafka producer setting, how to figure out your units of scale, and the metrics tool Trogdor. EPISODE LINKS 5 Common Pitfalls when Using Apache Kafka Kafka Internals course linger.ms producer configs. Fault Injection—Trogdor From Apache Kafka to Performance in Confluent Cloud Kafka Compression Interface ConsumerRebalanceListener Watch the video version of this podcast Nikoleta Verbeck’s Twitter Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more on Confluent Developer Use PODCAST100 to get $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Tips For Writing Abstracts and Speaking at Conferences 48:56

לפני 3 years48:56

48:56

A well-written abstract is your ticket to conferences, but how do you write an excellent synopsis that will get accepted? As an experienced conference speaker, Robin Moffatt (Principal Developer Advocate, Confluent) often writes presentations that help the developer community to understand Apache Kafka® and its ecosystem. He is also the Program Committee Chair for Kafka Summit and Current 2022 : The Next Generation of Kafka Summit. Having seen hundreds of conference submissions, Robin shares best practices for crafting abstracts that stand out, as well as tips for speaking at conferences. So you want to answer the call for papers? Before writing your abstract, Robin and Kris recommend identifying a topic that you are enthusiastic about, or a topic that can be useful to others. Oftentimes, attendees go to conferences to learn about a given technology, which they may not have extensive knowledge of yet—so a fundamental topic is a good basis for a conference talk. Once you’ve identified the topic you are interested in, there are key components to an effective write up: Title: Come up with an enticing title that lets the conference organizers and audiences understand the content at a glance. There is a chance that a great topic could be rejected due to a poor title. Abstract: Summarize the topic you plan to talk about in the proper format and length. Usually, a polished abstract has three short paragraphs consisting of approximately 200 words. It’s essential to spend quality time writing and refining your abstract, while keeping two audience groups in mind—the program committee and the conference attendees. Robin shares that when reviewing submissions, the program committees have a few standards in mind, such as if the topic fits into the overall conference theme, and whether attendees would be interested in the talk. Then if the abstract is accepted, the attendees themselves will decide if they’ll attend a particular session based on the agenda and the brief. Robin and Kris also discuss why you should submit to a conference in the first place and also give tips for preparing your talk once you are accepted. If you are a new speaker or just someone interested in getting feedback on your abstract, Robin and the conference committees for Current 2022: The Next Generation of Kafka Summit will be hosting office hours to provide feedback. EPISODE LINKS Current 2022: How to Become a Speaker How to Win at the Conference Abstract Submission Game Collection: How to Write a Good Conference Abstract Preparing a New Talk So How Do you Make Those Cool Diagrams? Syntax Highlighting Code For Presentation Slides Watch Video Version Twitter: Robin Moffatt | Kris Jenkins Join the Confluent Community Use PODCAST100 to get $100 of Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
How I Became a Developer Advocate 29:48

לפני 3 years29:48

29:48

What is a developer advocate and how do you become one? In this episode, we have seasoned developer advocates, Kris Jenkins (Senior Developer Advocate, Confluent) and Danica Fine (Senior Developer Advocate, Confluent) answer the question by diving into how they got into the world of developer relations, what they enjoyed the most about their roles, and how you can become one. Developer advocacy is at the heart of a developer community—helping developers and software engineers to get the most out of a given technology by providing support in form of blog posts, podcasts, conference talks, video tutorials, meetups, and other mediums. Before stepping into the world of developer relations, both Danica and Kris were hands-on developers. While dedicating professional time, Kris also devoted personal time to supporting fellow developers, such as running local meetups, writing blogs, and organizing hackathons. While Danica found her calling after learning more about Apache Kafka® and successfully implemented a mission-critical application for a financial services company—transforming 2,000 lines of codes into Kafka Streams. She enjoys building and sharing her knowledge with the community to make technology as accessible and as fun as possible. Additionally, the duo previews their developer advocacy trip to Singapore and Australia in mid-June, where they will attend local conferences and host in-person meetups on Kafka and event streaming. EPISODE LINKS In-person meetup: Singapore | Sydney | Melbourne Coding in Motion: Building a Data Streaming App with JavaScript Practical Data Pipeline: Build a Plant Monitoring System with ksqlDB How to Build a Strong Developer Community ft. Robin Moffatt and Ale Murray Designing Event-Driven Systems Watch the video version of this podcast Danica Fine’s Twitter Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Data Mesh Architecture: A Modern Distributed Data Model 48:42

לפני 3 years48:42

48:42

Data mesh isn’t software you can download and install, so how do you build a data mesh? In this episode, Adam Bellemare (Staff Technologist, Office of the CTO, Confluent) discusses his data mesh proof of concept and how it can help you conceptualize the ways in which implementing a data mesh could benefit your organization. Adam begins by noting that while data mesh is a type of modern data architecture, it is only partially a technical issue. For instance, it encompasses the best way to enable various data sets to be stored and made accessible to other teams in a distributed organization. Equally, it’s also a social issue—getting the various teams in an organization to commit to publishing high-quality versions of their data and making them widely available to everyone else. Adam explains that the four data mesh concepts themselves provide the language needed to start discussing the necessary social transitions that must take place within a company to bring about a better, more effective, and efficient data strategy. The data mesh proof of concept created by Adam's team showcases the possibilities of an event-stream based data mesh in a fully functional model. He explains that there is no widely accepted way to do data mesh, so it's necessarily opinionated. The proof of concept demonstrates what self-service data discovery looks like—you can see schemas, data owners, SLAs, and data quality for each data product. You can also model an app consuming data products, as well as publish your own data products. In addition to discussing data mesh concepts and the proof of concept, Adam also shares some experiences with organizational data he had as a staff data platform engineer at Shopify. His primary focus was getting their main ecommerce data into Apache Kafka® topics from sharded MySQL—using Kafka Connect and Debezium. He describes how he really came to appreciate the flexibility of having access to important business data within Kafka topics. This allowed people to experiment with new data combinations, letting them come up with new products, novel solutions, and different ways of looking at problems. Such data sharing and experimentation certainly lie at the heart of data mesh. Adam has been working in the data space for over a decade, with experience in big-data architecture, event-driven microservices, and streaming data platforms. He’s also the author of the book “Building Event-Driven Microservices.” EPISODE LINKS The Definitive Guide to Building a Data Mesh with Event Streams What is data mesh? Saxo Bank’s Best Practices for Distributed Domain-Driven Architecture Founded on the Data Mesh Watch the video version of this podcast Kris Jenkins’ Twitter Join the Confluent Community Learn more with Kafka tutorials at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Flink vs Kafka Streams/ksqlDB: Comparing Stream Processing Tools 55:55

לפני 3 years55:55

55:55

Stream processing can be hard or easy depending on the approach you take, and the tools you choose. This sentiment is at the heart of the discussion with Matthias J. Sax (Apache Kafka® PMC member; Software Engineer, ksqlDB and Kafka Streams, Confluent) and Jeff Bean (Sr. Technical Marketing Manager, Confluent). With immense collective experience in Kafka, ksqlDB, Kafka Streams, and Apache Flink®, they delve into the types of stream processing operations and explain the different ways of solving for their respective issues. The best stream processing tools they consider are Flink along with the options from the Kafka ecosystem: Java-based Kafka Streams and its SQL-wrapped variant—ksqlDB. Flink and ksqlDB tend to be used by divergent types of teams, since they differ in terms of both design and philosophy. Why Use Apache Flink? The teams using Flink are often highly specialized, with deep expertise, and with an absolute focus on stream processing. They tend to be responsible for unusually large, industry-outlying amounts of both state and scale, and they usually require complex aggregations. Flink can excel in these use cases, which potentially makes the difficulty of its learning curve and implementation worthwhile. Why use ksqlDB/Kafka Streams? Conversely, teams employing ksqlDB/Kafka Streams require less expertise to get started and also less expertise and time to manage their solutions. Jeff notes that the skills of a developer may not even be needed in some cases—those of a data analyst may suffice. ksqlDB and Kafka Streams seamlessly integrate with Kafka itself, as well as with external systems through the use of Kafka Connect. In addition to being easy to adopt, ksqlDB is also deployed on production stream processing applications requiring large scale and state. There are also other considerations beyond the strictly architectural. Local support availability, the administrative overhead of using a library versus a separate framework, and the availability of stream processing as a fully managed service all matter. Choosing a stream processing tool is a fraught decision partially because switching between them isn't trivial: the frameworks are different, the APIs are different, and the interfaces are different. In addition to the high-level discussion, Jeff and Matthias also share lots of details you can use to understand the options, covering employment models, transactions, batching, and parallelism, as well as a few interesting tangential topics along the way such as the tyranny of state and the Turing completeness of SQL. EPISODE LINKS The Future of SQL: Databases Meet Stream Processing Building Real-Time Event Streams in the Cloud, On Premises Kafka Streams 101 course ksqlDB 101 course Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more on Confluent Developer Use PODCAST100 for additional $100 of Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Practical Data Pipeline: Build a Plant Monitoring System with ksqlDB 33:56

לפני 3 years33:56

33:56

Apache Kafka® isn’t just for day jobs according to Danica Fine (Senior Developer Advocate, Confluent). It can be used to make life easier at home, too! Building out a practical Apache Kafka® data pipeline is not always complicated—it can be simple and fun. For Danica, the idea of building a Kafka-based data pipeline sprouted with the need to monitor the water level of her plants at home. In this episode, she explains the architecture of her hardware-oriented project and discusses how she integrates, processes, and enriches data using ksqlDB and Kafka Connect, a Raspberry Pi running Confluent's Python client, and a Telegram bot. Apart from the script on the Raspberry Pi, the entire project was coded within Confluent Cloud. Danica's model Kafka pipeline begins with moisture sensors in her plants streaming data that is requested by an endless for-loop in a Python script on her Raspberry Pi. The Pi in turn connects to Kafka on Confluent Cloud, where the plant data is sent serialized as Avro. She carefully modeled her data, sending an ID along with a timestamp, a temperature reading, and a moisture reading. On Confluent Cloud, Danica enriches the streaming plant data, which enters as a ksqlDB stream, with metadata such as moisture threshold levels, which is stored in a ksqlDB table. She windows the streaming data into 12-hour segments in order to avoid constant alerts when a threshold has been crossed. Alerts are sent at the end of the 12-hour period if a threshold has been traversed for a consistent time period within it (one hour, for example). These are sent to the Telegram API using Confluent Cloud's HTTP Sink Connector, which pings her phone when a plant's moisture level is too low. Potential future project improvement plans include visualizations, adding another Telegram bot to register metadata for new plants, adding machine learning to anticipate watering needs, and potentially closing the loop by pushing data back to the Raspberry Pi, which could power a visual indicator on the plants themselves. EPISODE LINKS Apache Kafka at Home: A Houseplant Alerting System with ksqlDB GitHub: raspberrypi-houseplants Data Pipelines 101 Tips for Streaming Data Pipelines ft. Danica Fine Motion in Motion: Building an End-to-End Motion Detection and Alerting System with Apache Kafka and ksqlDB Watch the video version of this podcast Danica Fine's Twitter Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more on Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Apache Kafka 3.2 - New Features & Improvements 6:54

לפני 3 years6:54

6:54

Apache Kafka® 3.2 delivers new KIPs in three different areas of the Kafka ecosystem: Kafka Core, Kafka Streams, and Kafka Connect. On behalf of the Kafka community, Danica Fine (Senior Developer Advocate, Confluent), shares release highlights. More than half of the KIPs in the new release concern Kafka Core. KIP-704 addresses unclean leader elections by allowing for further communication between the controller and the brokers. KIP-764 takes on the problem of a large number of client connections in a short period of time during preferred leader election by adding the configuration `socket.listen.backlog.size`. KIP-784 adds an error code field to the response of the `DescribeLogDirs` API, and KIP-788 improves network traffic by allowing you to set the pool size of network threads individually per listener on Kafka brokers. Finally, in accordance with the imminent KRaft protocol, KIP-801 introduces a built-in `StandardAuthorizer` that doesn't depend on ZooKeeper. There are five KIPs related to Kafka Streams in the AK 3.2 release. KIP-708 brings rack-aware standby assignment by tag, which improves fault tolerance. Then there are three projects related to Interactive Queries v2: KIP-796 specifies an improved interface for Interactive Queries; KIP-805 allows state to be queried over a specific range; and KIP-806 adds two implementations of the Query interface, `WindowKeyQuery` and `WindowRangeQuery`. The final Kafka Streams project, KIP-791, enhances `StateStoreContext` with `recordMetadata`,which may be accessed from state stores. Additionally, this Kafka release introduces Kafka Connect-related improvements, including KIP-769, which extends the `/connect-plugins` API, letting you list all available plugins, and not just connectors as before. KIP-779 lets `SourceTasks` handle producer exceptions according to `error.tolerance`, rather than instantly killing the entire connector by default. Finally, KIP-808 lets you specify precisions with respect to TimestampConverter single message transforms. Tune in to learn more about the Apache Kafka 3.2 release! EPISODE LINKS Apache Kafka 3.2 release notes Read the blog to learn more Download Apache Kafka 3.2.0 Watch the video version of this podcast…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Scaling Apache Kafka Clusters on Confluent Cloud ft. Ajit Yagaty and Aashish Kohli 49:07

לפני 3 years49:07

49:07

How much can Apache Kafka® scale horizontally, and how can you automatically balance, or rebalance data to ensure optimal performance? You may require the flexibility to scale or shrink your Kafka clusters based on demand. With experience engineering cluster elasticity and capacity management features for cloud-native Kafka, Ajit Yagaty (Confluent Cloud Control Plane Engineering) and Aashish Kohli (Confluent Cloud Product Management) join Kris Jenkins in this episode to explain how the architecture of Confluent Cloud supports elasticity. Kris suggests that optimal elasticity is like water from a faucet—you should be able to quickly obtain as many resources as you need, but at the same time you don't want the slightest amount to go wasted. But how do you specify the amount of capacity by which to adjust, and how do you know when it's necessary? Aashish begins by explaining how elasticity on Confluent Cloud has come a long way since the early days of scaling via support tickets. It's now self-serve and can be accomplished by dialing up or down a desired number of CKUs, or Confluent Units of Kafka. A CKU corresponds to a specific amount of Kafka resources and has been made to be consistent across all three major clouds. You can specify the number of CKUs you need via API, CLI or Confluent Cloud UI. Ajit explains in detail how, once your request has been made, cluster resizing is a two-step process. First, capacity is added, and then your data is rebalanced. Rebalancing data on the cluster is critical to ensuring that optimal performance is derived from the available capacity. The amount of time it takes to resize a Kafka cluster depends on the number of CKUs being added or removed, as well as the amount of data to be rebalanced. Of course, to request more or fewer CKUs in the first place, you have to know when it's necessary for your Kafka cluster(s). This can be challenging as clusters emit a large variety of metrics. Fortunately, there is a single composite metric that you can monitor to help you decide, as Ajit imparts on the episode. Other topics covered by the trio include an in-depth explanation of how Confluent Cloud achieves elasticity under the hood (separate control and data planes, along with some Kafka dogfooding), future plans for autoscaling elasticity, scenarios where elasticity is critical, and much more. EPISODE LINKS How to Elastically Scale Apache Kafka Clusters on Confluent Cloud Shrink a Dedicated Kafka Cluster in Confluent Cloud Elastic Apache Kafka Clusters in Confluent Cloud Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Streaming Analytics on 50M Events Per Day with Confluent Cloud at Picnic 34:41

לפני 3 years34:41

34:41

What are useful practices for migrating a system to Apache Kafka® and Confluent Cloud, and why use Confluent to modernize your architecture? Dima Kalashnikov (Technical Lead, Picnic Technologies) is part of a small analytics platform team at Picnic, an online-only, European grocery store that processes around 45 million customer events and five million internal events daily. An underlying goal at Picnic is to try and make decisions as data-driven as possible, so Dima's team collects events on all aspects of the company—from new stock arriving at the warehouse, to customer behavior on their websites, to statistics related to delivery trucks. Data is sent to internal systems and to a data warehouse. Picnic recently migrated from their existing solution to Confluent Cloud for several reasons: Ecosystem and community: Picnic liked the tooling present in the Kafka ecosystem. Since being a small team means they aren't able to devote extra time to building boilerplate-type code such as connectors for their data sources or functionality for extensive monitoring capabilities. Picnic also has analysts that use SQL so appreciated the processing capabilities of ksqlDB. Finally, they found that help isn't hard to locate if one gets stuck. Monitoring: They wanted better monitoring; specifically they found it challenging to measure for SLAs with their former system as they couldn't easily detect the positions of consumers in their streams. Scaling and data retention times: Picnic is growing so they needed to scale horizontally without having to worry about manual reassignment. They also hit a wall with their previous streaming solution with respect to the length of time they could save data, which is a serious issue for a company that makes data-first decisions. Cloud: Another factor of being a small team is that they don't have resources for extensive maintenance of their tooling. Dima's team was extremely careful and took their time with the migration. They ran a pilot system simultaneously with the old system, in order to make sure it could achieve their fundamental performance goals: complete stability, zero data loss, and no performance degradation. They also wanted to check it for costs. The pilot was successful and they actually have a second, IoT pilot in the works that uses Confluent Cloud and Debezium to track the robotics data emanating from their automatic fulfillment center. And it's a lot of data, Dima mentions that the robots in the center generate data sets as large as their customer events streams. EPISODE LINKS Picnic Analytics Platform: Migration from AWS Kinesis to Confluent Cloud Picnic Modernizes Data Architecture with Confluent Data Engineer: Event Streaming Platform Watch this podcast in video Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka resources on Confluent Developer Live demo: Event-Driven Microservices with Confluent Use PODCAST100 to get $100 of free Confluent Cloud usage Building Data Streaming App | Coding In Motion…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Build a Data Streaming App with Apache Kafka and JS - Coding in Motion 2:03

לפני 3 years2:03

2:03

Coding is inherently enjoyable and experimental. With the goal of bringing fun into programming, Kris Jenkins (Senior Developer Advocate, Confluent) hosts a new series of hands-on workshops—Coding in Motion, to teach you how to use Apache Kafka® and data streaming technologies for real-life use cases. In the first episode, Sound & Vision, Kris walks you through the end-to-end process of building a real-time, full-stack data streaming application from scratch using Kafka and JavaScript/TypeScript. During the workshop, you’ll learn to stream musical MIDI data into fully-managed Kafka using Confluent Cloud, then process and transform the raw data stream using ksqlDB. Finally, the enriched data streams will be pushed to a web server to display data in a 3D graphical visualization. Listen to Kris previews the first episode of Coding in Motion: Sound & Vision and join him in the workshop premiere to learn more. EPISODE LINKS Coding in Motion Workshop: Build a Streaming App for Sound & Vision Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Optimizing Apache Kafka's Internals with Its Co-Creator Jun Rao 48:54

לפני 3 years48:54

48:54

You already know Apache Kafka® is a distributed event streaming system for setting your data in motion, but how does its internal architecture work? No one can explain Kafka’s internal architecture better than Jun Rao, one of its original creators and Co-Founder of Confluent. Jun has an in-depth understanding of Kafka that few others can claim—and he shares that with us in this episode, and in his new Kafka Internals course on Confluent Developer. One of Jun's goals in publishing the Kafka Internals course was to cover the evolution of Kafka since its initial launch. In line with that goal, he discusses the history of Kafka development, including the original thinking behind some of its design decisions, as well as how its features have been improved to better meet its key goals of durability, scalability, and real-time data. With respect to its initial design, Jun relates how Kafka was conceived from the ground up as a distributed system, with compute and storage always maintained as separate entities, so that they could scale independently. Additionally, he shares that Kafka was deliberately made for high throughput since many of the popular messaging systems at the time of its invention were single node, but his team needed to process large volumes of non-transactional data, such as application metrics, various logs, click streams, and IoT information. As regards the evolution of its features, in addition to others, Jun explains these two topics at great length: Consumer rebalancing protocol: The original "stop the world" approach to Kafka's consumer rebalancing—although revolutionary at the time of its launch, was eventually improved upon to take a more incremental approach. Cluster metadata: Moving from the external ZooKeeper to the built-in KRaft protocol allows for better scaling by a factor of ten. according to Jun, and it also means you only need to worry about running a single binary. The Kafka Internals course consists of eleven concise modules, each dense with detail—covering Kafka fundamentals in technical depth. The course also pairs with four hands-on exercise modules led by Senior Developer Advocate Danica Fine. EPISODE LINKS Kafka Internals course How Apache Kafka Works: An Introduction to Kafka’s Internals Coding in Motion Workshop: Build a Streaming App Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Using Event-Driven Design with Apache Kafka Streaming Applications ft. Bobby Calderwood 51:09

לפני 3 years51:09

51:09

What is event modeling and how does it differ from standard data modeling? In this episode of Streaming Audio, Bobby Calderwood, founder of Evident Systems and creator of oNote observes that at the dawn of the computer age, due to the fact that memory and computing power were expensive, people began to move away from time-and-narrative-oriented record-keeping systems (in the manner of a ship's log or a financial ledger) to systems based on aggregation. Such data-model systems, still dominant today, only retain the current state generated from their inputs, with the inputs themselves going lost. A converse approach to the reductive data-model system is the event-model system, which is enabled by tools like Apache Kafka®, and which effectively saves every bit of activity that the system generates. The event model actually marks a return, in a sense, to the earlier, narrative-like recording methods. To further illustrate, Bobby uses a chess example to show the distinction between the data model and the event model. In a chess context, the event modeling system would retain each move in the game from beginning to end, such that any moment in the game could be derived by replaying the sequence of moves. Conversely, chess based on the data model would save only the current state of the game, destructively mutating the data structure to reflect it. The event model maintains an immutable log of all of a system's activity, which means that teams downstream from the transactions team have access to all of the system's data, not just the end transactions, and they can analyze the data as they wish in order to make their own conclusions. Thus there can be several read models over the same body of events. Bobby has found that non-programming stakeholding teams tend to intuitively comprehend the event model better than other data paradigms, given its natural narrative form. Transitioning from the data model to the event model, however, can be challenging. Bobby’s oNote—event modeling platform aims to help by providing a digital canvas that allows a system to be visually redesigned according to the event model. oNote generates Avro schema based on its models, and also uses Avro to generate runtime code. EPISODE LINKS Event Sourcing and Event Storage with Apache Kafka oNote Event Modeling Toward a Functional Programming Analogy for Microservices Event-Driven Architecture - Common Mistakes and Valuable Lessons ft. Simon Aubury Watch the video version of this podcast Coding in Motion Workshop: Build a Streaming App Kris Jenkins’ Twitter Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Monitoring Extreme-Scale Apache Kafka Using eBPF at New Relic 38:25

לפני 3 years38:25

38:25

New Relic runs one of the larger Apache Kafka® installations in the world, ingesting circa 125 petabytes a month, or approximately three billion data points per minute. Anton Rodriguez is the architect of the system, responsible for hundreds of clusters and thousands of clients, some of them implemented in non-standard technologies. In addition to the large volume of servers, he works with many teams, which must all work together when issues arise. Monitoring New Relic's large Kafka installation is critical and of course challenging, even for a company that itself specializes in monitoring. Specific obstacles include determining when rebalances are happening, identifying particularly old consumers, measuring consumer lag, and finding a way to observe all producing and consuming applications. One way that New Relic has improved the monitoring of its architecture is by directly consuming metrics from the Linux kernel using its new eBPF technology, which lets programs run inside the kernel without changing source code or adding additional modules (the open-source tool Pixie enables access to eBPF in a Kafka context). eBPF is very low impact, so doesn’t affect services, and it allows New Relic to see what’s happening at the network level—and to take action as necessary. EPISODE LINKS Monitoring Kafka Without Instrumentation Using eBPF What Is eBPF and Why Does It Matter for Observability? Kafka Monitoring Kafka Summit: Monitoring Kafka Without Instrumentation Using eBPF Watch the video version of this podcast Kris Jenkins’ Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Confluent Platform 7.1: New Features + Updates 10:01

לפני 3 years10:01

10:01

Confluent Platform 7.1 expands upon its already innovative features, adding improvements in key areas that benefit data consistency, allow for increased speed and scale, and enhance resilience and reliability. Previously, the Confluent Platform 7.0 release introduced Cluster Linking, which enables you to bridge on-premises and cloud clusters, among other configurations. Maintaining data quality standards across multiple environments can be challenging though. To assist with this problem, CP 7.1 adds Schema Linking, which lets you share consistent schemas across your clusters—synced in real time. Confluent for Kubernetes lets you build your own private-cloud Apache Kafka® service. Now you can enhance the global resilience of your architecture by employing to multiple regions. With the new release you can also configure custom volumes attached to Confluent deployments and you can declaratively define and manage the new Schema Links. As of this release, Confluent for Kubernetes now supports the full feature set of the Confluent Platform. Tiered Storage was released in Confluent Platform 6.0, and it offers immense benefits for a cluster by allowing the offloading of older topic data out of the broker and into slower, long-term object storage. The reduced amount of local data makes maintenance, scaling out, recovery from failure, and adding brokers all much quicker. CP 7.1 adds compatibility for object storage using Nutanix, NetApp, MinIO, and Dell, integrations that have been put through rigorous performance and quality testing. Health+ was introduced in CP 6.2—offers intelligent cloud-based alerting and monitoring tools in a dashboard. New as of CP 7.1, you can choose to be alerted when anomalies in broker latency are detected, when there is an issue with your connectors linking Kafka and external systems, as well as when a ksqlDB query will interfere with a continuous, real-time processing stream. Shipping with CP 7.1 is ksqlDB 0.23, which adds support for pull queries against streams as opposed to only against tables—a milestone development that greatly helps when debugging since a subset of messages within a topic can now be inspected. ksqlDB 0.23 also supports custom schema selection, which lets you choose a specific schema ID when you create a new stream or table, rather than use the latest registered schema. A number of additional smaller enhancements are also included in the release. EPISODE LINKS Download Confluent Platform 7.1 Check out the release notes Read the Confluent Platform 7.1 blog post Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Scaling an Apache Kafka Based Architecture at Therapie Clinic 1:10:56

לפני 3 years1:10:56

1:10:56

Scaling Apache Kafka® can be tricky, let alone scaling a team. When he was first hired, Domenico Fioravanti of Therapie Clinic was given the challenging task of assembling a sizable tech team from scratch, while simultaneously building a scalable and decoupled architecture from the ground up. In addition, he wanted to deliver value to the company from day one. One way that Domenico ultimately accomplished these goals was by focusing on managed solutions in order to avoid large investments in engineering know-how. Another way was to deliver quickly to production by using the existing knowledge of his team. Domenico's biggest initial priority was to make a real-time reporting dashboard that collated data generated by third-party systems, such as call centers and front-of-house software solutions that managed bookings and transactions. (Before Domenico's arrival, all reporting had been done by aggregating data from different sources through an expensive, manual, error-prone, and slow process—which tended to result in late and incomplete insights.) Establishing an initial stack with AWS and a BI/analytics tool only took a month and required minimal DevOps resources, but Domenico's team ended up wanting to leverage their efforts to free up third-party data for more than just the reporting/data insights use case. So they began considering Apache Kafka® as a central repository for their data. For Kafka itself, they investigated Amazon MSK vs. Confluent, carefully weighing setup and time costs, maintenance costs, limitations, security, availability, risks, migration costs, Kafka updates frequency, observability, and errors and troubleshooting needs. Domenico's team settled on Confluent Cloud and built the following stack: AWS AppSync, a managed GraphQL layer to interact with and abstract third-party APIs (data sources) AWS Lambdas for extracting data and producing to Kafka topics Kafka topics for the raw as well as transformed data Kafka Streams for data transformation Kafka Redshift sink connector for loading data AWS Redshift as the destination cloud data warehouse Looker for business intelligence and big data analytics This stack allowed the company's data to be consumed by multiple teams in a scalable way. Eventually, DynamoDB was added and by the end of a year, along with a scalable architecture, Domenico had successfully grown his staff to 45 members on six teams. EPISODE LINKS Confluent’s Data Streaming Platform Can Save Over $2.5M vs. Self-Managing Apache Kafka Accelerate Your Cloud Data Warehouse Migration and Modernization with Confluent Watch the video version of this podcast Kris Jenkins' Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Bridging Frontend and Backend with GraphQL and Apache Kafka ft. Gerard Klijs 23:13

לפני 3 years23:13

23:13

What is GraphQL? And how can you combine GraphQL with Apache Kafka® to query data in real time? With over 10 years of experience as a backend engineer, Gerard Klijs is a Confluent Community Catalyst, a contributor to several GraphQL libraries, and also a creator and maintainer of a Rust library to use Confluent Schema Registry with Java client. In this episode, he explains why you want to use Kafka with GraphQL and how they work together to bridge the gap between backend and frontend to make data more easily accessible in the frontend. As an alternative to REST, GraphQL is an open source programming language developed by Meta, which lets you pull data from multiple data sources via a single API call. GraphQL lets you migrate and deprecate data easily. For example, if you have a `name` field, which you later decided to replace by `firstName` and `lastName`, you can group the field names together and monitor the server for query requests. If there are no additional query requests for the deprecated field, then it can be removed from the server. Usually, GraphQL is used in the frontend with a server implemented in Node.js, while Kafka is often used as an integration layer between backend components. When it comes to connecting Kafka with GraphQL, the use cases might not seem as vast at first glance, but Gerard thinks that it is due to unfamiliarity and misconceptions on how the two can work together. For example, some may think Kafka is merely a message bus and GraphQL is for graph databases. Gerard also talks about the backend for frontend (BFF) pattern as well as tips on working with GraphQL. EPISODE LINKS Getting Started with GraphQL and Apache Kafka Kafka and GraphQL: Misconceptions and Connections Gerard Klijs Github Watch the video version of this podcast Kris Jenkins Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Building Real-Time Data Governance at Scale with Apache Kafka ft. Tushar Thole 42:58

לפני 3 years42:58

42:58

Data availability, usability, integrity, and security are words that we sometimes hear a lot. But what do they actually look like when put into practice? That’s where data governance comes in. This becomes especially tricky when working with real-time data architectures. Tushar Thole (Senior Manager, Engineering, Trust & Security, Confluent) focuses on delivering features for software-defined storage, software-defined networking (SD-WAN), security, and cloud-native domains. In this episode, he shares the importance of real-time data governance and the product portfolio—Stream Governance, which his team has been building to fostering the collaboration and knowledge sharing necessary to become an event-centric business while remaining compliant within an ever-evolving landscape of data regulations. With the increase of data volume, variety, and velocity, data governance is mandatory for trustworthy, usable, accurate, and accessible data across organizations, especially with distributed data in motion. When it comes to choosing a tool to govern real-time distributed data, there is often a paradox of choice. Some tools are built for handling data at rest, while open source alternatives lack features and are not managed services that can be integrated with the Apache Kafka® ecosystem natively. To solve governance use cases by delivering high-quality data assets, Tushar and his team have been taking Confluent Schema Registry, considered the de facto metadata management standard for the ecosystem, to the next level. This approach to governance allows organizations to scale Kafka operations for real-time observability with security and quality. The fully managed, cloud-native Stream Governance framework is based on three key workflows: Stream catalog: Search and discover data in a self-service fashion Stream lineage: Understand the complex data relationships with interactive, end-to-end maps of event streams Stream quality: Deliver trusted, high-quality event streams to the organization Tushar also shares use cases around data governance and sheds light on the Stream Governance roadmap. EPISODE LINKS Stream Governance – How it Works Data Mess to Data Mesh | Jay Kreps Demo: Stream Governance Data Governance for Real Time Data Watch the video version of this podcast Kris Jenkins Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Handling 2 Million Apache Kafka Messages Per Second at Honeycomb 41:36

לפני 3 years41:36

41:36

How many messages can Apache Kafka® process per second? At Honeycomb, it's easily over one million messages. In this episode, get a taste of how Honeycomb uses Kafka on massive scale. Liz Fong-Jones (Principal Developer Advocate, Honeycomb) explains how Honeycomb manages Kafka-based telemetry ingestion pipelines and scales Kafka clusters. And what is Honeycomb? Honeycomb is an observability platform that helps you visualize, analyze, and improve cloud application quality and performance. Their data volume has grown by a factor of 10 throughout the pandemic, while the total cost of ownership has only gone up by 20%. But how, you ask? As a developer advocate for site reliability engineering (SRE) and observability, Liz works alongside the platform engineering team on optimizing infrastructure for reliability and cost. Two years ago, the team was facing the prospect of growing from 20 Kafka brokers to 200 Kafka brokers as data volume increased. The challenge was to scale and shuffle data between the number of brokers while maintaining cost efficiency. The Honeycomb engineering team has experimented with using sc1 or st1 EBS hard disks to store the majority of longer-term archives and keep only the latest hours of data on NVMe instance storage. However, this approach to cost reduction was not ideal, which resulted in needing to keep data that is older than 24 hours on SSD. The team began to explore and adopt Zstandard compression to decrease bandwidth and disk size; however, the clusters were still struggling to keep up. When Confluent Platform 6.0 rolled out Tiered Storage, the team saw it as a feature to help them break away from being storage bound. Before bringing the feature into production, the team did a proof of concept, which helped them gain confidence as they watched Kafka tolerate broker death and reduce latencies in fetching historical data. Tiered Storage now shrinks their clusters significantly so that they can hold on to local NVMe SSD and the tiered data is only stored once on Amazon S3, rather than consuming SSD on all replicas. In combination with the AWS Im4gn instance, Tiered Storage allows the team to scale for long-term growth. Honeycomb also saved 87% on the cost per megabyte of Kafka throughput by optimizing their Kafka clusters. EPISODE LINKS Tiered Storage Introducing Confluent Platform 6.0 Scaling Kafka at Honeycomb Watch the video version of this podcast Kris Jenkins Twitter Streaming Audio Playlist Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Why Data Mesh? ft. Ben Stopford 44:42

לפני 3 years44:42

44:42

With experience in data infrastructure and distributed data technologies, author of the book “Designing Event-Driven Systems” Ben Stopford (Lead Technologist, Office of the CTO, Confluent) explains the data mesh paradigm, differences between traditional data warehouses and microservices, as well as how you can get started with data mesh. Unlike standard data architecture, data mesh is about moving data away from a monolithic data warehouse into distributed data systems. Doing so will allow data to be available as a product—this is also one of the four principles of data mesh: Data ownership by domain Data as a product Data available everywhere for self-service Data governed wherever it is These four principles are technology agnostic, which means that they don’t restrict you to a programming language, Apache Kafka®, or other databases. Data mesh is all about building point-to-point architecture that lets you evolve and accommodate real-time data needs with governance tools. Fundamentally, data mesh is more than a technological shift. It’s a mindset shift that requires cultural adaptation of product thinking—treating data as a product instead of data as an asset or resource. Data mesh invests ownership of data by the people who create it with requirements that ensure quality and governance. Because data mesh consists of a map of interconnections, it’s important to have governance tools in place to identify data sources and provide data discovery capabilities. There are many ways to implement data mesh, event streaming being one of them. You can ingest data sets from across organizations and sources into your own data system. Then you can use stream processing to trigger an application response to the data set. By representing each data product as a data stream, you can tag it with sub-elements and secondary dimensions to enable data searchability. If you are using a managed service like Confluent Cloud for data mesh, you can visualize how data flows inside the mesh through a stream lineage graph. Ben also discusses the importance of keeping data architecture as simple as you can to avoid derivatives of data products. EPISODE LINKS Data Mesh 101 course Data Mesh 101 with Live Walkthrough Exercise Introduction and Guide to Data Mesh The Definitive Guide to Building a Data Mesh with Event Streams What is Data Mesh, and How Does it Work? ft. Zhamak Dehghani Designing Event-Driven Systems Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Serverless Stream Processing with Apache Kafka ft. Bill Bejeck 42:23

לפני 3 years42:23

42:23

What is serverless? Having worked as a software engineer for over 15 years and as a regular contributor to Kafka Streams, Bill Bejeck (Integration Architect, Confluent) is an Apache Kafka® committer and author of “Kafka Streams in Action.” In today’s episode, he explains what serverless and the architectural concepts behind it are. To clarify, serverless doesn’t mean you can run an application without a server—there are still servers in the architecture, but they are abstracted away from your application development. In other words, you can focus on building and running applications and services without any concerns over infrastructure management. Using a cloud provider such as Amazon Web Services (AWS) enables you to allocate machine resources on demand while handling provisioning, maintenance, and scaling of the server infrastructure. There are a few important terms to know when implementing serverless functions with event stream processors: Functions as a service (FaaS) Stateless stream processing Stateful stream processing Serverless commonly falls into the FaaS cloud computing service category—for example, AWS Lambda is the classic definition of a FaaS offering. You have a greater degree of control to run a discrete chunk of code in response to certain events, and it lets you write code to solve a specific issue or use case. Stateless processing is simpler in comparison to stateful processing, which is more complex as it involves keeping the state of an event stream and needs a key-value store. ksqlDB allows you to perform both stateless and stateful processing, but its strength lies in stateful processing to answer complex questions while AWS Lambda is better suited for stateless processing tasks. By integrating ksqlDB with AWS Lambda together, they deliver serverless event streaming and analytics at scale. EPISODE LINKS What is Serverless? Serverless Stream Processing with Apache Kafka, AWS Lambda, and ksqlDB Stateful Serverless Architectures with ksqlDB and AWS Lambda Serverless GitHub repository Kafka Streams in Action Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
The Evolution of Apache Kafka: From In-House Infrastructure to Managed Cloud Service ft. Jay Kreps 46:32

לפני 3 years46:32

46:32

When it comes to Apache Kafka®, there’s no one better to tell the story than Jay Kreps (Co-Founder and CEO, Confluent), one of the original creators of Kafka. In this episode, he talks about the evolution of Kafka from in-house infrastructure to a managed cloud service and discusses what’s next for infrastructure engineers who used to self-manage the workload. Kafka started out at LinkedIn as a distributed stream processing framework and was core to their central data pipeline. At the time, the challenge was to address scalability for real-time data feeds. The social media platform’s initial data system was built on Apache™Hadoop®, but the team later realized that operationalizing and scaling the system required a considerable amount of work. When they started re-engineering the infrastructure, Jay observed a big gap in data streaming—on one end, data was being looked at constantly for analytics, while on the other end, data was being looked at once a day—missing real-time data interconnection. This ushered in efforts to build a distributed system that connects applications, data systems, and organizations for real-time data. That goal led to the birth of Kafka and eventually a company around it—Confluent. Over time, Confluent progressed from focussing solely on Kafka as a software product to a more holistic view—Kafka as a complete central nervous system for data, integrating connectors and stream processing with a fully-managed cloud service. Now as organizations make a similar shift from in-house infrastructure to fully-managed services, Jay outlines five guiding points to keep in mind: Cloud-native systems abstract away operational efforts for you without infrastructure concerns It’s important to have a complete ecosystem for Kafka, including connectors, a SQL layer, and data governance A distributed system should allow data to be accessible everywhere and across organizations Identifying a reliable storage infrastructure layer that is dependable, such as Amazon S3 is critical Cost-effective models mean sustainability and systems that are easy to build around EPISODE LINKS Building Real-Time Data Systems the Hard Way Kris Jenkins Twitter The Hitchhiker’s Guide to the Galaxy Hedonic treadmill Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
What’s Next for the Streaming Audio Podcast ft. Kris Jenkins 2:39

לפני 3 years2:39

2:39

Meet your new host of the Streaming Audio podcast: Kris Jenkins (Senior Developer Advocate, Confluent)! In this preview, Kris shares a few highlights from forthcoming episodes to look forward to, spanning topics from data mesh, cloud-native technologies, and serverless Apache Kafka®, to data modeling. As a developer advocate, Kris is endlessly fascinated about software design, functional programming, real-time systems, and electronic music. He is a veteran software developer and engineer, with a broad background from roles such as CTO of a Java/Oracle gold exchange and contract developer of several Haskell/PureScript-based event systems. There is still a raft of data streaming narratives to tell and many community experts to feature. We’ll cover what’s new and emerging, real-life Kafka use cases, and how people are currently using managed Kafka as a service, as well as the latest in the data streaming space If there’s a subject you’d like to see covered on the show or if you know someone who should be featured, let us know via the Confluent Community engagement form . EPISODE LINKS Get involved in the Confluent Community Subscribe on Apple Podcast Subscribe on Spotify Subscribe on Android Listen and Subscribe on PodLink Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
On to the Next Chapter ft. Tim Berglund 6:45

לפני 3 years6:45

6:45

After nearly 200 podcast episodes of Streaming Audio, Tim Berglund bids farewell in his last episode as host of the show. Tim reflects on the many great memories with guests who have appeared on the segment—and each for its own reasons. He has covered a wide variety of topics, ranging from Apache Kafka® fundamentals, microservices, event stream processing, use cases, to cloud-native Kafka, data mesh, and more. As Tim mentions, the Streaming Audio podcast will continue on to explore all things about Kafka and the cloud while featuring new voices and topics. You can subscribe to the Streaming Audio podcast on your podcast platform of choice to get the latest updates and news. Thank you for listening and stay tuned. EPISODE LINKS I Interviewed Nearly 200 Apache Kafka Experts and I learned These 10 Things Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Intro to Event Sourcing with Apache Kafka ft. Anna McDonald 30:14

לפני 3 years30:14

30:14

What is event sourcing and how does it work? Event sourcing is often used interchangeably with event-driven architecture and event stream processing. However, Anna McDonald (Principal Customer Success Technical Architect, Confluent) explains it's a specific category of its own—an event streaming pattern. Anna is passionate about event-driven architectures and event patterns. She’s a tour de force in the Apache Kafka® community and is the presenter of the Event Sourcing and Event Storage with Apache Kafka course on Confluent Developer. In this episode, she previews the course by providing an overview of what event sourcing is and what you need to know in order to build event-driven systems. Event sourcing is an architectural design pattern, which defines the approach to handling data operations that are driven by a sequence of events. The pattern ensures that all changes to an application state are captured and stored as an immutable sequence of events, known as a log of events. The events are persisted in an event store, which acts as the system of record. Unlike traditional databases where only the latest status is saved, an event-based system saves all events into a database in sequential order. If you find a past event is incorrect, you can replay each event from a certain timestamp up to the present to recreate the latest status of data. Event sourcing is commonly implemented with a command query responsibility segregation (CQRS) system to perform data computation tasks in response to events. To implement CQRS with Kafka, you can use Kafka Connect, along with a database, or alternatively use Kafka with the streaming database ksqlDB. In addition, Anna also shares about: Data at rest and data in motion techniques for event modeling The differences between event streaming and event sourcing How CQRS, change data capture (CDC), and event streaming help you leverage event-driven systems The primary qualities and advantages of an event-based storage system Use cases for event sourcing and how it integrates with your systems EPISODE LINKS Event Sourcing course Event Streaming in 3 Minutes Introducing Derivative Event Sourcing Meetup: Event Sourcing and Apache Kafka Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Expanding Apache Kafka Multi-Tenancy for Cloud-Native Systems ft. Anna Povzner and Anastasia Vela 31:01

לפני 3 years31:01

31:01

In an effort to make Apache Kafka® cloud native, Anna Povzener (Principal Engineer, Confluent) and Anastasia Vela (Software Engineer I, Confluent) have been working to expand multi-tenancy to cloud-native systems with automated capacity planning and scaling in Confluent Cloud. They explain how cloud-native data systems are different from legacy databases and share the technical requirements needed to create multi-tenancy for managed Kafka as a service. As a distributed system, Kafka is designed to support multi-tenant systems by: Isolating data with authentication, authorization, and encryption Isolating user namespaces Isolating performance with quotas Traditionally, Kafka’s multi-tenant capabilities are used in on-premises data centers to make data available and accessible across the company—a single company would run a multi-tenant Kafka cluster with all its workloads to stream data across organizations. Some processes behind setting up multi-tenant Kafka clusters are manual with the requirement to over-provision resources and capacity in order to protect the cluster from unplanned demand increases. When Kafka is on cloud instances, you have the ability to scale cloud resources on the fly for any unplanned workloads to meet expectations instantaneously. To shift multi-tenancy to the cloud, Anna and Anastasia identify the following as essential for the architectural design: Abstractions: requires minimal operational complexity of a cloud service Pay-per-use model: requires the system to use only the minimum required resources until additional is necessary Uptime and performance SLA/SLO: requires support for unknown and unpredictable workloads with minimal operational workload while protecting the cluster from distributed denial-of-service (DDoS) attacks Cost-efficiency: requires a lower cost of ownership You can also read more about the shift from on-premises to cloud-native, multi-tenant services in Anna and Anastasia’s publication on the Confluent blog. EPISODE LINKS From On-Prem to Cloud-Native: Multi-Tenancy in Confluent Cloud Cloud-Native Apache Kafka Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details ) Watch the video version of this podcast…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Apache Kafka 3.1 - Overview of Latest Features, Updates, and KIPs 4:43

לפני 4 years4:43

4:43

Apache Kafka® 3.1 is here with exciting new features and improvements! On behalf of the Kafka community, Danica Fine (Senior Developer Advocate, Confluent) shares release highlights that you won’t want to miss, including foreign-key joins in Kafka Streams and improvements that will provide consistency for Kafka latency metrics. KAFKA-13439 deprecates the eager protocol, which has been the default since Kafka 2.4—it’s advised to upgrade your applications to the cooperative protocol as the eager protocol will no longer be supported in future releases. Previously, foreign-key joins in Kafka Streams only worked if both primary and foreign-key tables were joined. This release adds support for foreign-key joins on tables with custom partitioners, which will be passed in as part of a new `TableJoined` object, comparable to the existing `Joined` and `StreamJoined` objects. With the goal of making Kafka more intuitive, KIP-773 enhances naming consistency for three new client metrics with millis and nanos. For example, `io-waittime-total` is reintroduced as `io-wait-time-ns-total`. The previously introduced metrics without ns will be deprecated but available for backward compatibility. KIP-768 continues the work started in KIP-255 to implement the necessary interfaces for a production-grade way to connect to an OpenID identity provider for authentication and token retrieval. This update provides an out-of-the-box implementation of an `AuthenticateCallbackHandler` that can be used to communicate with OAuth/OIDC. Additionally, this Kafka release introduces two new metrics for active brokers specifically, `ActiveBrokerCount` and `FenceBrokerCount`. These two metrics expose the number of active brokers in the cluster known by the controller and the number of fenced brokers known by the controller. Tune in to learn more about the Apache Kafka 3.1 release! EPISODE LINKS Apache Kafka 3.1 release notes Read the blog to learn more Download Apache Kafka 3.1 Watch the video version of this podcast…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Optimizing Cloud-Native Apache Kafka Performance ft. Alok Nikhil and Adithya Chandra 30:40

לפני 4 years30:40

30:40

Maximizing cloud Apache Kafka® performance isn’t just about running data processes on cloud instances. There is a lot of engineering work required to set and maintain a high-performance standard for speed and availability. Alok Nikhil (Senior Software Engineer, Confluent) and Adithya Chandra (Staff Software Engineer II, Confluent) share about their efforts on how to optimize Kafka on Confluent Cloud and the three guiding principles that they follow whether you are self-managing Kafka or working on a cloud-native system: Know your users and plan for their workloads Infrastructure matters for performance as well as cost efficiency Effective observability—you can’t improve what you don’t see A large part of setting and achieving performance standards is about understanding that workloads vary and come with unique requirements. There are different dimensions for performance, such as the number of partitions and the number of connections. Alok and Adithya suggest starting by identifying the workload patterns that are the most important to your business objectives for simulation, reproduction, and using the results to optimize the software. When identifying workloads, it’s essential to determine the infrastructure that you’ll need to support the given workload economically. Infrastructure optimization is as important as performance optimization. It's best practice to know the infrastructure that you have available to you and choose the appropriate hardware, operating system, and JVM to allocate the processes so that workloads run efficiently. With the necessary infrastructure patterns in place, it’s crucial to monitor metrics to ensure that your application is running as expected consistently with every release. Having the right observability metrics and logs allows you to identify and troubleshoot issues relatively quickly. Profiling and request sampling also help you dive deeper into performance issues, particularly, during incidents. Alok and Adithya’s team uses tooling such as the async-profiler for profiling CPU cycles, heap allocations, and lock contention. Alok and Adithya summarize their learnings and processes used for optimizing managed Kafka as a service, which can be applicable to your own cloud-native applications. You can also read more about their journey on the Confluent blog. EPISODE LINKS Speed, Scale, Storage: Our Journey from Apache Kafka to Performance in Confluent Cloud Cloud-Native Apache Kafka Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details ) Watch the video version of this podcast…

Streaming Audio: Apache Kafka® & Real-Time Data

1
From Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica Fine 29:50

לפני 4 years29:50

29:50

Implementing an event-driven data pipeline can be challenging, but doing so within the context of a legacy architecture is even more complex. Having spent three years building a streaming data infrastructure and being on the first team at a financial organization to implement Apache Kafka® event-driven data pipelines, Danica Fine (Senior Developer Advocate, Confluent) shares about the development process and how ksqlDB and Kafka Connect became instrumental to the implementation. By moving away from batch processing to streaming data pipelines with Kafka, data can be distributed with increased data scalability and resiliency. Kafka decouples the source from the target systems, so you can react to data as it changes while ensuring accurate data in the target system. In order to transition from monolithic micro-batching applications to real-time microservices that can integrate with a legacy system that has been around for decades, Danica and her team started developing Kafka connectors to connect to various sources and target systems. Kafka connectors: Building two major connectors for the data pipeline, including a source connector to connect the legacy data source to stream data into Kafka, and another target connector to pipe data from Kafka back into the legacy architecture. Algorithm: Implementing Kafka Streams applications to migrate data from a monolithic architecture to a stream processing architecture. Data join: Leveraging Kafka Connect and the JDBC source connector to bring in all data streams to complete the pipeline. Streams join: Using ksqlDB to join streams—the legacy data system continues to produce streams while the Kafka data pipeline is another stream of data. As a final tip, Danica suggests breaking algorithms into process steps. She also describes how her experience relates to the data pipelines course on Confluent Developer and encourages anyone who is interested in learning more to check it out. EPISODE LINKS Data Pipelines course Introduction to Streaming Data Pipelines with Apache Kafka and ksqlDB Guided Exercise on Building Streaming Data Pipelines Migrating from a Legacy System to Kafka Streams Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Real-Time Change Data Capture and Data Integration with Apache Kafka and Qlik 34:51

לפני 4 years34:51

34:51

Getting data from a database management system (DBMS) into Apache Kafka® in real time is a subject of ongoing innovation. John Neal (Principal Solution Architect, Qlik) and Adam Mayer (Senior Technical Producer Marketing Manager, Qlik) explain how leveraging change data capture (CDC) for data ingestion into Kafka enables real-time data-driven insights. It can be challenging to ingest data in real time. It is even more challenging when you have multiple data sources, including both traditional databases and mainframes, such as SAP and Oracle. Extracting data in batch for transfer and replication purposes is slow, and often incurs significant performance penalties. However, analytical queries are often even more resource intensive and are prohibitively expensive to run on production transactional databases. CDC enables the capture of source operations as a sequence of incrementing events, converting the data into events to be written to Kafka. Once this data is available in the Kafka topics, it can be used for both analytical and operational use cases. Data can be consumed and modeled for analytics by individual groups across your organization. Meanwhile, the same Kafka topics can be used to help power microservice applications and help ensure data governance without impacting your production data source. Kafka makes it easy to integrate your CDC data into your data warehouses, data lake, NoSQL database, microservices, and any other system. Adam and John highlight a few use cases where they see real-time Kafka data ingestion, processing, and analytics moving the needle—including real-time customer predictions, supply chain optimizations, and operational reporting. Finally, Adam and John cap it off with a discussion on how capturing and tracking data changes are critical for your machine learning model to enrich data quality. EPISODE LINKS Fast Track Business Insights with Data in Motion Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Modernizing Banking Architectures with Apache Kafka ft. Fotios Filacouris 34:59

לפני 4 years34:59

34:59

It’s been said that financial services organizations have been early Apache Kafka® adopters due to the strong delivery guarantees and scalability that Kafka provides. With experience working and designing architectural solutions for financial services, Fotios Filacouris (Senior Solutions Engineer, Enterprise Solutions Engineering, Confluent) joins Tim to discuss how Kafka and Confluent help banks build modern architectures, highlighting key emerging use cases from the sector. Previously, Kafka was often viewed as a simple pipe that connected databases together, which allows for easy and scalable data migration. As the Kafka ecosystem evolves with added components like ksqlDB, Kafka Streams, and Kafka Connect, the implementation of Kafka goes beyond being just a pipe—it’s an intelligent pipe that enables real-time, actionable data insights. Fotios shares a couple of use cases showcasing how Kafka solves the problems that many banks are facing today. One of his customers transformed retail banking by using Kafka as the architectural base for storing all data permanently and indefinitely. This approach enables data in motion and a better user experience for frontend users while scrolling through their transaction history by eliminating the need to download old statements that have been offloaded in the cloud or a data lake. Kafka also provides the best of both worlds with increased scalability and strong message delivery guarantees that are comparable to queuing middleware like IBM MQ and TIBCO. In addition to use cases, Tim and Fotios talk about deploying Kafka for banks within the cloud and drill into the profession of being a solutions engineer. EPISODE LINKS Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

Streaming Audio: Apache Kafka® & Real-Time Data

1
Running Hundreds of Stream Processing Applications with Apache Kafka at Wise 31:08

לפני 4 years31:08

31:08

What’s it like building a stream processing platform with around 300 stateful stream processing applications based on Kafka Streams? Levani Kokhreidze (Principal Engineer, Wise) shares his experience building such a platform that the business depends on for multi-currency movements across the globe. He explains how his team uses Kafka Streams for real-time money transfers at Wise, a fintech organization that facilitates international currency transfers for 11 million customers. Getting to this point and expanding the stream processing platform is not, however, without its challenges. One of the major challenges at Wise is to aggregate, join, and process real-time event streams to transfer currency instantly. To accomplish this, the Wise relies on Apache Kafka® as an event broker, as well as Kafka Streams, the accompanying Java stream processing library. Kafka Streams lets you build event-driven microservices for processing streams, which can then be deployed alongside the Kafka cluster of your choice. Wise also uses the Interactive Queries feature in Kafka streams, to query internal application state at runtime. The Wise stream processing platform has gradually moved them away from a monolithic architecture to an event-driven microservices model with around 400 total microservices working together. This has given Wise the ability to independently shape and scale each service to better serve evolving business needs. Their stream processing platform includes a domain-specific language (DSL) that provides libraries and tooling, such as Docker images for building your own stream processing applications with governance. With this approach, Wise is able to store 50 TB of stateful data based on Kafka Streams running in Kubernetes. Levani shares his own experiences in this journey with you and provides you with guidance that may help you follow in Wise’s footsteps. He covers how to properly delegate ownership and responsibilities for sourcing events from existing data stores, and outlines some of the pitfalls they encountered along the way. To cap it all off, Levani also shares some important lessons in organization and technology, with some best practices to keep in mind. EPISODE LINKS Kafka Streams 101 course Real-Time Stream Processing with Kafka Streams ft. Bill Bejeck Watch the video version of this podcast Join the Confluent Community Learn more with Kafka tutorials, resources, and guides at Confluent Developer Live demo: Intro to Event-Driven Microservices with Confluent Use PODCAST100 to get an additional $100 of free Confluent Cloud usage ( details )…

ברוכים הבאים אל Player FM!

Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.

תקשיבו ל-500+ נושאים

32 subscribers

דומה לStreaming Audio: Apache Kafka® & Real-Time Data

Amazon eGift Card - Bright Balloons (Animated)

Amazon Basics Dog and Puppy Pee Pads, 5-Layer Leak-Proof Super Absorbent, Quick-Dry Surface, Potty Training, Regular (22x22"), 100 Count, Blue & White

Amazon Basics Multipurpose Copy Printer Paper, 8.5 x 11 Inches, 20 lb, 1 Ream, (500 Sheets), 92 Bright, White

פודקאסטים ששווה להאזין

Streaming Audio: Apache Kafka® & Real-Time Data « » Connecting Azure Cosmos DB with Apache Kafka - Better Together ft. Ryan CrawCour

Connecting Azure Cosmos DB with Apache Kafka - Better Together ft. Ryan CrawCour

פודקאסטים ששווה להאזין

ברוכים הבאים אל Player FM!

The Let Them Theory: A Life-Changing Tool That Millions of People Can't Stop Talking About

The Let Them Theory: A Life-Changing Tool That Millions of People Can't Stop Talking About

Bounty Paper Towels Quick Size, White, 16 Family Rolls = 40 Regular Rolls (Packaging May Vary)

Tubi: Watch Free Movies & TV Shows

The Let Them Theory: A Life-Changing Tool That Millions of People Can't Stop Talking About

דומה לStreaming Audio: Apache Kafka® & Real-Time Data

מדריך עזר מהיר

Streaming Audio: Apache Kafka® & Real-Time Data « »
Connecting Azure Cosmos DB with Apache Kafka - Better Together ft. Ryan CrawCour