I recently interviewed Rafal Los, Chief Security Evangelist at HP, about his somewhat controversial position that stability is bad for business. He was drawing from an analogy in the financial world. You can watch that interview at the end of this post for more context.
Stability has largely been the goal of most of the IT initiatives over my career. I cut my teeth designing IP voice networks – you can’t afford to have a client phone system go down and keep your clients for very long. More recently, as Director of Technology at Lexblog, a significant portion of my time was dedicated to maintaining and increasing stability of the overall publishing infrastructure. While both of those situations require a high degree of stability so that people can use technology the way they expect it to work, they also require constant testing to prevent future instability.
The more I think about what Rafal Los said, the more I tend to agree that too much stability is bad for business in general and IT infrastructure specifically. He points out complacency as being one of the key enemies of resiliency. I agree. When I joined Lexblog, things appeared to be working correctly. And things worked great, until they didn’t. It took a significant outage and a couple days without sleep to find out what the previous IT leadership hadn’t done. We ultimately solved the problem and implemented a solution to prevent it from happening again, but the chaos at the time was extremely nerve racking.
Networks are Alive
If those of us in IT thought about our networks and infrastructure as a complex living organism, much like the human body we all walk around in, we would be more accepting of exactly how powerful a lack of stability can be on creating a stronger organism. To run with this analogy, we go through life exposing our bodies to instability. We get blisters when our skin gets damaged from foreign friction. We are exposed to disease that temporarily disrupts normal body function.
As we recover we develop callouses to protect against future skin damage and we create antibodies that help protect against future viral disruption. In other words, our bodies program in the recovery mechanisms that make our organic network more resilient.
The Chaos Monkey in Nature
If we are properly testing our networks, using something along the lines of the Netflix chaos monkey approach, we are destabilizing the network on a regular basis. When you intentionally create a flaw in the system, you find out what happens. Script an automated recovery process and you prevent the flaw you find this time from impacting the overall system in the future. That scripted patch is your antibody against future outages.
For a human example of the chaos monkey, look at the cultures where parents intentionally expose kids to chicken pox. While there is a vaccine, some percentage of the population believes that getting chicken pox naturally is a better long term protection. Sure, the short term discomfort to the child may be unpleasant, but with proper attention to recovery, the child’s immune system is protected from future exposure.
Vaccines Are StackOverflow
Most network and infrastructure problems were encountered by someone else already. By looking at the path followed by others in the past, we can prevent future catastrophe. But you have to be looking for the solution. Assuming your network is already immune is foolish.
In the human organism, this is just like getting immunized. We’ve seen a similar attack vector in other human bodies and created a solution to protect against it. By administering that solution, we know that getting exposed to germs will result in our bodies being able to defend against illness.
Pulling Cables or Remapping the Brain
Another example from human experience would be the brain’s ability to remap functions following damage to the system. There are numerous cases where someone lost functionality in a portion of the brain and ultimately relearned that function in other parts of the brain. The brain re-cables after part of the cabling goes bad. I’m not suggesting any of us would ever intentionally damage our own brains (too many pints at the pub aside), but we live in an unpredictable world where chaos can occur at any time.
Pulling a random network cable might be analogous to the way our brains remap functions. When you pull a network cable and suddenly lose communication with a portion of the network, you know you need to rethink your redundancy. By creating a more fault tolerant topology, the next time that cable gets pulled, your network continues to function as expected.
The Risks to a Living Organism
As Rafal Los points out, there are four primary risks to a living organism. I already mentioned complacency as being one of them. We can grow lazy, over consume, and do permanent damage to the system, even when we think nothing can break. Our bodies become resistant to change when we gain weight, form bad eating habits, and fail to continue the constant effort required to keep them healthy. Rigidity sets in for the smoker who’s always smoked or the guy who eats a double cheeseburger every day for lunch. And if we assume too much stability, it becomes nearly impossible to recover.
Network infrastructure is no different, whether you’re looking at hardware, software, cabling, or anything else. If you don’t constantly test the resiliency and resist the temptation to become complacent, eventually you’ll wind up in a situation that is nearly impossible to recover from. Just because it doesn’t seem broken right now, doesn’t mean it won’t be tomorrow.
Watch the interview with Rafal Los here and share your thoughts on Stability vs. Resiliency.