201 subscribers
Looks like the publisher may have taken this series offline or changed its URL. Please contact support if you believe it should be working, the feed URL is invalid, or you have any other concerns about it.
התחל במצב לא מקוון עם האפליקציה Player FM !
114. Beyond Root Cause Analysis in Complex Systems
סדרה בארכיון ("עדכון לא פעיל" status)
When? This feed was archived on December 12, 2020 08:26 (
Why? עדכון לא פעיל status. השרתים שלנו לא הצליחו לאחזר פודקאסט חוקי לזמן ממושך.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 294560456 series 2501898
In this episode of Codeish, Marcus Blankenship, a Senior Engineering Manager at Salesforce, is joined by Robert Blumen, a Lead DevOps Engineer at Salesforce.
During their discussion, they take a deep dive into the theories that underpin human error and complex system failures and offer fresh perspectives on improving complex systems.
Root cause analysis is the method of analyzing a failure after it occurs in an attempt to identify the cause. This method looks at the fundamental reasons that a failure occurs, particularly digging into issues such as processes, systems, designs, and chains of events. Complex system failures usually begin when a single component of the system fails, requiring nearby "nodes" (or other components in the system network) to take up the workload or obligation of the failed component.
Complex system breakdowns are not limited to IT. They also exist in medicine, industrial accidents, shipping, and aeronautics. As Robert asserts: "In the case of IT, [systems breakdowns] mean people can't check their email, or can’t obtain services from a business. In other fields of medicine, maybe the patient dies, a ship capsizes, a plane crashes."
The 5 WHYs
The 5 WHYs root cause analysis is about truly getting to the bottom of a problem by asking “why” five levels deep. Using this method often uncovers an unexpected internal or process-related problem.
Accident investigation can represent both simple and complex systems. Robert explains, "Simple systems are like five dominoes that have a knock-on effort. By comparison, complex systems have a large number of heterogeneous pieces. And the interaction between the pieces is also quite complex. If you have N pieces, you could have N squared connections between them and an IT system."
He further explains, "You can lose a server, but if you're properly configured to have retries, your next level upstream should be able to find a different service. That's a pretty complex interaction that you've set up to avoid an outage."
In the case of a complex system, generally, there is not a single root cause for the failure. Instead, it's a combination of emergent properties that manifest themselves as the result of various system components working together, not as a property of any individual component.
An example of this is the worst airline disaster in history. Two 747 planes were flying to Gran Canaria airport. However, the airport was closed due to an exploded bomb, and the planes were rerouted to Tenerife. The runway in Tenerife was unaccustomed to handling 747s. Inadequate radars and fog compounded a combination of human errors such as misheard commands. Two planes tried to take off at the same time and collided with each other in the air.
Robert talks about Dr. Cook, who wrote about the dual role of operators. "The dual role is the need to preserve the operation of the system and the health of the business. Everything an operator does is with those two objectives in mind." They must take calculated risks to preserve outputs, but this is rarely recognized or complemented.
Another component of complex systems is that they are in a perpetual state of partially broken. You don't necessarily discover this until an outage occurs. Only through the post-mortem process do you realize there was a failure. Humans are imperfect beings and are naturally prone to making errors. And when we are given responsibilities, there is always the chance for error.
What's a more useful way of thinking about the causes of failures in a complex system?
Robert gives the example of a tree structure or AC graph showing one node at the edge, representing the outage or incident.
If you step back one layer, you might not ask what is the cause, but rather what were contributing causes? In this manner, you might find multiple contributing factors that interconnect as more nodes grow. With this understanding, you can then look at the system and say, "Well, where are the things that we want to fix?" It’s important to remember that if you find 15 contributing factors, you are not obligated to fix all 15; only three or four of them may be important. Furthermore, it may not be cost-effective to fix everything.
One approach is to take all of the identified contributing factors, rank them by some combination of their impact and costs, then decide which are the most important.
What is some advice for people who want to stop thinking about their system in terms of simple systems and start thinking about them in terms of complex systems?
Robert Blumen suggests understanding that you may have a cognitive bias toward focusing on the portions of the system that influenced decision-making.
What was the context that that person was facing at the time? Did they have enough information to make a good decision? Are we putting people in impossible situations where they don't have the right information? Was there adequate monitoring? If this was a known problem, was there a runbook?
What are ways to improve the human environment so that the operator can make better decisions if the same set of factors occurs again?
132 פרקים
סדרה בארכיון ("עדכון לא פעיל" status)
When?
This feed was archived on December 12, 2020 08:26 (
Why? עדכון לא פעיל status. השרתים שלנו לא הצליחו לאחזר פודקאסט חוקי לזמן ממושך.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 294560456 series 2501898
In this episode of Codeish, Marcus Blankenship, a Senior Engineering Manager at Salesforce, is joined by Robert Blumen, a Lead DevOps Engineer at Salesforce.
During their discussion, they take a deep dive into the theories that underpin human error and complex system failures and offer fresh perspectives on improving complex systems.
Root cause analysis is the method of analyzing a failure after it occurs in an attempt to identify the cause. This method looks at the fundamental reasons that a failure occurs, particularly digging into issues such as processes, systems, designs, and chains of events. Complex system failures usually begin when a single component of the system fails, requiring nearby "nodes" (or other components in the system network) to take up the workload or obligation of the failed component.
Complex system breakdowns are not limited to IT. They also exist in medicine, industrial accidents, shipping, and aeronautics. As Robert asserts: "In the case of IT, [systems breakdowns] mean people can't check their email, or can’t obtain services from a business. In other fields of medicine, maybe the patient dies, a ship capsizes, a plane crashes."
The 5 WHYs
The 5 WHYs root cause analysis is about truly getting to the bottom of a problem by asking “why” five levels deep. Using this method often uncovers an unexpected internal or process-related problem.
Accident investigation can represent both simple and complex systems. Robert explains, "Simple systems are like five dominoes that have a knock-on effort. By comparison, complex systems have a large number of heterogeneous pieces. And the interaction between the pieces is also quite complex. If you have N pieces, you could have N squared connections between them and an IT system."
He further explains, "You can lose a server, but if you're properly configured to have retries, your next level upstream should be able to find a different service. That's a pretty complex interaction that you've set up to avoid an outage."
In the case of a complex system, generally, there is not a single root cause for the failure. Instead, it's a combination of emergent properties that manifest themselves as the result of various system components working together, not as a property of any individual component.
An example of this is the worst airline disaster in history. Two 747 planes were flying to Gran Canaria airport. However, the airport was closed due to an exploded bomb, and the planes were rerouted to Tenerife. The runway in Tenerife was unaccustomed to handling 747s. Inadequate radars and fog compounded a combination of human errors such as misheard commands. Two planes tried to take off at the same time and collided with each other in the air.
Robert talks about Dr. Cook, who wrote about the dual role of operators. "The dual role is the need to preserve the operation of the system and the health of the business. Everything an operator does is with those two objectives in mind." They must take calculated risks to preserve outputs, but this is rarely recognized or complemented.
Another component of complex systems is that they are in a perpetual state of partially broken. You don't necessarily discover this until an outage occurs. Only through the post-mortem process do you realize there was a failure. Humans are imperfect beings and are naturally prone to making errors. And when we are given responsibilities, there is always the chance for error.
What's a more useful way of thinking about the causes of failures in a complex system?
Robert gives the example of a tree structure or AC graph showing one node at the edge, representing the outage or incident.
If you step back one layer, you might not ask what is the cause, but rather what were contributing causes? In this manner, you might find multiple contributing factors that interconnect as more nodes grow. With this understanding, you can then look at the system and say, "Well, where are the things that we want to fix?" It’s important to remember that if you find 15 contributing factors, you are not obligated to fix all 15; only three or four of them may be important. Furthermore, it may not be cost-effective to fix everything.
One approach is to take all of the identified contributing factors, rank them by some combination of their impact and costs, then decide which are the most important.
What is some advice for people who want to stop thinking about their system in terms of simple systems and start thinking about them in terms of complex systems?
Robert Blumen suggests understanding that you may have a cognitive bias toward focusing on the portions of the system that influenced decision-making.
What was the context that that person was facing at the time? Did they have enough information to make a good decision? Are we putting people in impossible situations where they don't have the right information? Was there adequate monitoring? If this was a known problem, was there a runbook?
What are ways to improve the human environment so that the operator can make better decisions if the same set of factors occurs again?
132 פרקים
כל הפרקים
×
1 118. Why Writing Matters for Engineers

1 117. Open Source with Jim Jagielski

1 115. Demystifying the User Experience with Performance Monitoring

1 114. Beyond Root Cause Analysis in Complex Systems

1 113. Principles of Pragmatic Engineering

1 112. Managing Public Key Infrastructure within an Enterprise

1 111. Gift Cards for Small Businesses

1 109. Meditation for the Curious Skeptic

1 108. Building Community with the Wicked CoolKit

1 I Was There: Stories of Production Incidents II

1 107. How to Write Seriously Good Software

1 106. Growing a Self-Funded Company
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 104. The Evolution of Service Meshes
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 102. Whether or Not to Repeat Yourself: DRY, DAMP, or WET
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 99. The Technical Side of Deep Fakes
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 98. The Ethical Side of Deep Fakes
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 Special Episode: Health Metrics at Scale
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 97. The Challenges of Bespoke Solutions in a Regulated World
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 I Was There: Stories of Production Incidents
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 93. Conferences in a Virtual World
![Code[ish] podcast artwork](https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/32.png 32w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/64.png 64w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/128.png 128w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/256.png 256w, https://cdn.player.fm/images/23686112/series/BR0PSa6K222OMf3v/512.png 512w)
![Code[ish] podcast artwork](/static/images/64pixel.png)
1 92. Strategies for Improving Your Mental Health
ברוכים הבאים אל Player FM!
Player FM סורק את האינטרנט עבור פודקאסטים באיכות גבוהה בשבילכם כדי שתהנו מהם כרגע. זה יישום הפודקאסט הטוב ביותר והוא עובד על אנדרואיד, iPhone ואינטרנט. הירשמו לסנכרון מנויים במכשירים שונים.