Your Team Has Accepted Failure As Normal. That's Exactly The Problem.

The first time I saw the dashboard, I almost couldn't believe it. Thousands of errors in the logs every hour. Of course, this was aggregated across services and thousands of devices and users. But errors are supposed to be rare, right? This indicated the opposite.

I knew the dashboard was there for a while, but I didn't have access to it. I knew error handling and logging were a general concern for the team. I had seen a bunch of it as I worked to debug different issues with the system. The dashboard finally showed the magnitude of the problem. Some of the team already had access to this dashboard for weeks. So, the next thing I couldn't believe was the team's reaction, or really, the lack of reaction. Thousands of errors every few minutes were normal. This was the baseline that the team had come to know as the definition of stable, whether we were aware of it or not.

How could this be? Even if those errors weren't really errors, it still showed a lack of care for both logging and for handling those scenarios. And if they were real errors, why aren't we doing something about it now? The non-responsiveness to a constant baseline of thousands of errors was such an odd thing to sit with and see. Maybe the team did feel the same way, but they weren't acting like it. After a few minutes of sitting with this new information, one question kept coming to my mind: how the hell did we get here? The good news is that I found the answers, but the bad news is that the answers revealed some core problems about the software development process. Let me take you through a few of the root causes.

The system failures were not part of the design.

Since you get to control the code, you get to define the behaviours of the software. This includes defining what happens when a failure mode is encountered. One of the most obvious ways that we achieved this error rate was from a lack of care and definition around how the system should behave when it fails to do its job. When designing both new features and feature updates, there were hardly ever definitions of error states and error behaviour. The product team didn't want to define acceptance criteria for user stories or requirements. The engineering team didn't always fill in the gap. They only dealt with error states they happened to think of when implementing the requested changes. In some cases, neither team took care to understand what the user would want to happen for specific failure scenarios. We weren't designing the unhappy execution paths. That meant we weren't formally testing all of them either. Error states are going to happen, and they should be handled and logged appropriately. That starts with a proper design so that these events become rare.

The system was built through pressure and patchwork.

The second major cause of this error rate was due to the general approach of patching the system. Of course, a lot of this can be justified. When the system is failing and production incidents are common, then you don't have time to do proper refactoring. You just start writing the fix, testing it quickly, and deploying the patch to production so customers and internal team members stop yelling. However, if you do this long enough, the system drifts from what your business needs to grow, to what your business needs to not die. It looks like one of those patchwork quilts you might find at a local farmer's market. It serves the purpose, but the design is inconsistent. It makes adding a new feature more and more difficult as time goes on. A lot of this occurs because the focus is on velocity. Engineering teams usually have a lot of pressure to deliver the feature that was promised to that big customer months ago. When you are moving fast and patching as you go, you aren't thinking ahead about error states and handling. You are letting your customers catch the errors live and then reacting to them.

The system has no consistent logging framework.

The logging of messages in any software system is one of the most important aspects for ensuring your team can understand the behaviour. You can't always control how the user interacts with your system, and you can't always plan ahead for every edge case. Even for the sake of testing, you need proper logging of messages so that you can trace the behaviour to determine if it was as expected or not. The inconsistency in logging was a big contributor to the error rate that I was seeing. Some of them were caused by misconfigurations. Some of them were caused by missing error handling logic. Some of them were labelled errors, but shouldn't have been. Some were duplicates, and some were errors spawned in multiple services for a failure somewhere else in the system. However, none of them could be easily traced. Even for the correctly identified errors, the lack of proper logging structure meant that each one had to be traced manually to be confirmed. Just think about the cost of this across thousands of error messages. Refactoring the system to have a consistent logging framework would not only have prevented errors, but would have made troubleshooting much more efficient and effective.

We cannot build software systems and only care about the happy path. Of course, the happy path is critically important. That is where the customers receive the value from the system. However, there are going to be customers who are one error away from switching to your competitors. Now, we may never achieve an error rate of zero. The system may need to fail. In safety-critical software, failures still occur, even if they may not be called failures. These scenarios force engineers to think about failure modes before they even write a line of code. The failure is predictable and handled in a way where the system can still keep operating, and eventually self-heal. You might not work in a safety-critical domain, but you can still adopt the same mindset about designing for failures appropriately, instead of reacting to them when your customers call.

Don't let your error rate keep climbing. Don't let your team treat all of the errors as noise, and only look at deviations from some baseline. If your error rate is non-zero, you should know exactly why those errors are happening. Anticipate the failure modes from the beginning of your design and you'll have a lot less explaining to do to your customers.