103. Chaos Engineering

Code[ish]

תוכן מסופק על ידי Salesforce Engineering. כל תוכן הפודקאסטים כולל פרקים, גרפיקה ותיאורי פודקאסטים מועלים ומסופקים ישירות על ידי Salesforce Engineering או שותף פלטפורמת הפודקאסט שלהם. אם אתה מאמין שמישהו משתמש ביצירה שלך המוגנת בזכויות יוצרים ללא רשותך, אתה יכול לעקוב אחר התהליך המתואר כאן https://he.player.fm/legal.

5y ago

MP3

סדרה בארכיון ("עדכון לא פעיל" status)

When? This feed was archived on December 12, 2020 08:26 (5y ago). Last successful fetch was on March 02, 2025 02:12 (7M ago)

Why? עדכון לא פעיל status. השרתים שלנו לא הצליחו לאחזר פודקאסט חוקי לזמן ממושך.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Rick Newman interviews Mikolaj Pawlikowski, who recently wrote a book called "Chaos Engineering: Crash test your applications." The theory behind chaos engineering is to "break things on purpose" in your operational flow. You want to deliberately inject failures that might occur in production ahead of time, in order to anticipate them, and thus implement workarounds and corrections. Typically, this practice is often used for large, distributed systems, because of the many points of failure, but it can be useful in any architecture.

One of the obstacles to embracing chaos engineering is finding high level approval from other teammates, or even managers. Even after the feature is a complete and the unit tests are passing, it can be difficult to convince someone that some resiliency work needs to continue, because there's no visible or tangible benefit to preparing for a disaster. Mikolaj suggests that people clearly lay out to other colleagues the ways a system can fail, and the impact it can have on the application or business. Rather than try to fear monger, it can be useful to point to other companies' availability issues as words of caution for their teams to embrace. Mikolaj also says that chaos engineering doesn't need to focus solely on complicated problems like race conditions across distributed systems. Often, there's enough low hanging fruit, such as disk space running out or an API timing out, that can be useful to consider fixing.

The chaos engineering mindset can also extend beyond pure software. If you think about people working across different timezones as a distributed system, you can also optimize for failures in communication before they occur. Everyone works at a different pace, and communication issues can be analogous to a network loss. Rather than fix miscommunications after they occur, establishing shared practices (like writing down every meeting, or setting up playbooks) can go a long way to ensuring that everyone will be able to do their best under changing circumstances.

Links from this episode

Mikolaj's book is called Chaos Engineering: Crash test your applications -- get a 40% discount using the code podish19
powerfulseal is a testing tool for Kubernetes clusters
Mikolaj distributes the Chaos Engineering Newsletter
Conf42 is a conference focusing on high-level computer science
ChaosConf is the world’s largest Chaos Engineering event
Awesome Chaos Engineering is a curated list of Chaos Engineering resources

131 פרקים

#Programming #Software Development #Tech #Chris Castles