Blog
-
10 mins
read
If you are also wondering how to shift left Resilience and Chaos Engineering to Developers, you are reading the right article.
Wow, what a ride. We are now in our third year and have learned a lot. We have been heads-down building the product; we interviewed many people and talked to many users. And we have continuously improved our product and put all our knowledge about Resilience and Chaos Engineering into it.
However, we had one crucial insight on our way: Chaos Engineering is great, but without additional tooling, it's primarily valuable for experienced engineers who understand their systems very well. Today, I'm very proud to introduce you to an important milestone for our product. But before we dive deeper into that, maybe a slight step back. What trends do we currently see in the market that influenced our product decisions in the last months?
Everything gets digital - Not only COVID-19 but also the ongoing globalization leads to more and more digital products that make life easier for everyone.
Sometimes you feel that there can't be enough developers to implement all the products and digital services with their features.
This naturally leads to more and more complexity in distributed systems. By now, no human brain can survey and control and which forces us to deal with the topic of service reliability.
Everything will shift even more in the direction of the cloud. We are now seeing this trend in Dev Tools as well. No developer works completely locally anymore.
We are empowering developers to build matters more than ever before. So we have to focus on Developer Experience (DX) to build tools that help them in their day2day workflow creating products that matter. So we have to improve the overall performance of delivery capabilities and eliminate the case that reliability work sometimes stays in conflict to increase the velocity.
Attention to non-functional requirements (security, maintainability, fault tolerance, reliability, performance, scalability) becomes even more critical as complexity increases. Still, developers don't have time or don't see value in optimizing these requirements.
Who are the Developers?
Who are the developers we see as responsible for service reliability and resilience? The answer is simple: Every developer but different about their role and concrete competencies. They have one thing in common: They hate production incidents.
But we need to keep in mind that a junior developer may not create a resilience design and enforce it themselves. They may be informed about issues during the software delivery process when a problem was found in their code related to potential problematic software resilience design. A DevOps Engineer will be more likely to think about the specific challenges since he has an overview of the entire workflow. The SRE has the final responsibility for the reliability of the production system and will also be interested in having consistent and automated processes that significantly improve the stability of the software.
But the answer is also that the topic is very complex, and methods like Chaos and Resilience Engineering are not immediately familiar to everyone. That's why we built Steadybit, and our goal is to support each of the developers in the best possible way. We also believe that people are essential in creating an authentic Culture of Resilience.
We love to automate things to concentrate on our daily business and not get distracted by daily production incidents. However, we learned that the best way to avoid that is to start very early in the process ("Shift Left").
What's their problem?
We've also learned: A developer has no time to do chaos experiments. This learning does not mean that chaos experiments are flawed, but it is always about limiting the cognitive load for a developer. Patterns such as Retry, Circuit Breaker, and Rolling Update are known. Still, they are often only implemented in later stages of the software dev cycle or even only when a failure occurs or an incident happens.
As a result, failures and bugs in individual subcomponents often lead to cascading effects that could be avoided. Steadybit provides an answer for that. We make any existing resilience issues transparent at an early stage and accompany the developer during the solution, and, if desired, also fix these issues. The goal is always to create as little or no further cognitive load for the developer but to give them the focus on his actual task: Velocity - The implementation of features.
As Ben already mentioned in his post: "Moving from experiments to expectations means that many complicated facets of Chaos Engineering can be avoided, and Resilience Engineering be democratized."
Expert knowledge is now no longer required to operate systems stable and resiliently.
Resilience as Code
We believe that it makes more sense to start with the expectation and not break things. And of course: The developer is responsible for the applications and takes care of them. I'm even more pleased to present the first milestone of our new feature, "Resilience Policies" to you today. We are pursuing three fundamental goals with this feature:
Start with predefined policies in minutes.
Codify all the things.
Customize it to your needs once you have learned how it works.
Setup your environment
Of course, we use our CLI:
The connection to the Steadybit platform is made quickly:
If you don't have an account yet, sign up quickly here: https://www.steadybit.com/get-started/...
Create Service Definition
As we continuously improve every part of our product, some things have changed since the last post. For example, you can now directly define the policies that should apply to your service.
This selection then results in a service definition with two resilience policies:
The basic model is that one policy always consists of 1:n tasks. Let's have a deeper look at the policy recovery-pod:
A task always represents a ready-to-use experiment or a weak spot validation that needs to be executed. However, the "experiment" can be more like a test with fixed input and output states. An essential part of this is continuously checking whether the experiment or the test was successful. For the task recovery-of-single-container, the affected pod should be replaced within a given time - ending with the Kubernetes Deployment back in a ready state. If this is not the case, the task would be declared failed. Steadybit will check that internally with the following experiment (minimalistic example - see here for complete code):
In the end, it's all about continuously measuring the MTTR of a single service: How long does it take you to get everything fully operational again?
Apply Service Definition
The CLI will synchronize the service definition with the platform in the next step. You can think of it like you're doing it usually in the context of Kubernetes Manifests (kubectl apply):
The following command executes the defined tasks resulting from the policies and displays a report in the console.
As an option, you can directly jump into the platform UI:
Here you can see all assigned tasks of the fashion service on the right sidebar:
So, 40% of the assigned tasks are already completed in this case. But especially the recovery-pod task seems to be failing. Let's have a deeper look at that issue:
Some fashion pods are not ready again within 2minutes (default waiting time). Therefore we found a potential issue because a long recovery time can probably lead to reliability problems, especially when deploying a few times a day. Maybe this predefined policy does not suit your needs? Let's move on to the next chapter.
Define your own policies
Developers love customizing things to their own needs. We feel you. Thus, all policies and tasks that steadybit currently brings are publicly stored in this GitHub Repository and can be adapted or customized to your own needs. Feel free to submit a PR on it to benefit other people.
Run it continuously
And of course, since we as developers always want to automate everything to make our daily work more accessible and reduce all avoidable cognitive load, we also offer various GitHub Actions to facilitate interaction with the policies and tasks:
Future
There is so much more to come. Unfortunately, it doesn't all fit in one blog post. I promise you can look forward to many more features to improve your service reliability. But we are also still learning, and that's why we need you and your feedback! We would be thrilled if you try it out and contact us. Don't worry; you don't need to go through a sales call ;-) Just you and the awesome folks from Steadybit. We are also developers and feel you.
We can solve the problems that hurt by working together in the community. So we are excited to build the best resilience product in the world with you!
Start now: https://www.steadybit.com/get-started/