Nobody Approved This Change, But It Changed Everything

My team was drowning in testing. Not for a lack of skill or effort, but for a lack of time. We had cleared a monumental backlog of releases before. As a manager, I was proud of the team's ability to prioritize, work hard, and find bugs before they reached production. However, it just created a new problem.

Now that the backlog had been cleared, the pressure to release more frequently increased. And that is what the organization started doing. Releases were smaller, and there was more emphasis on velocity. But it was paired with an emphasis on quality. So, my team was not relieved at all. In fact, their expected workload just kept growing. We still found lots of bugs, but to keep the release cadence short, they had to be fixed right away and the release retested each time. Part of that release testing was smoke testing the system after every deployment to each of the four environments in our pipeline to production.

It was becoming a task that every team member started to dread. It took dedicated focus to complete quickly. It was the same set of test cases every time. Often the release engineers were waiting on the answer, so that added pressure. When test cases were failing, I would be interrupted to dig into it and find the root cause. It was causing a lot of pressure, frustration, and distraction.

I knew this was a growing pain in my team and also affecting anyone involved in the release. I also knew it was a symptom of a change that was long overdue: automated testing. And the lack of automated testing was a symptom of an even bigger concern: a poor culture of software quality. I kept having this intuition that if we could get automated testing going, release quality would improve. We could track testing metrics and correlate them to release outcomes to give data as to why we need to improve our processes with regards to quality. But I just couldn't get any more help to get started.

I asked leadership for more resources. My team was filled up just trying to keep up with release testing, investigating issues, and planning testing for the upcoming features. But I didn't get any more resources. I was told we don't have the budget, and all the other teams were overbooked on their roadmaps.

I'll be honest, I was really pissed off. As the release gatekeeper, these issues required a lot of my attention. And I could feel my team's frustration growing. I could even feel their frustration growing for what I was going through. It was that sensation you get when you can feel your team members starting to search for new roles. If you're an engineering leader, you'll likely know this feeling all too well. I was even doing some of the smoke testing to lighten the load on the team and it was just making it worse because I didn't have the time to be doing it.

One morning, I got to my desk and sat down. I decided that I had to do something different. I needed automated testing if we were ever going to scale this platform. I needed people to care more about the quality of their code. I needed leadership to understand what we were dealing with, so they could justify making a change. As I sat there anxious for another Slack message about smoke testing, it clicked. I needed numbers that I could put a dollar value on. If I could put a dollar amount on the cost of poor quality, it could be compared to other priorities and most likely would show that the cost of the current culture towards software quality was much higher than the cost of not building that next feature.

This was a big idea, and my next thought was "MVP". What was the minimum viable product that I could build to get the data I needed and make a tangible impact? The answer was obvious: automating smoke testing.

Each smoke testing suite took about 30 minutes to complete manually. For each release, which we aimed to do weekly, it had to be performed four times, once after deploying to each environment. That's two hours a week. And if we found any bugs during testing the release, we would need to repeat the testing for one or more environments again. We often found bugs, so at least another 2 hours each week. Add onto that production issues, which were happening weekly. Add another 2 hours minimum of smoke testing, sometimes a lot more.

When I averaged it, my team and I were spending around 12 hours a week performing smoke tests. I knew with the right solution, we could cut that down to a fraction of the time, make it executable quickly by any engineer, and reduce any testing variations that comes with manual testing. That time savings I could put a number on.

So, I began implementing a framework and test cases between meetings, work planning, and responding to production incidents. After a couple weeks, it started to take shape. I had the foundation and a few test cases completed. I showed my team, they loved it, and wanted to help. So, I got my team to join in writing code in between their other tasks too. They wanted this change just as much as I did.

When it was done, I documented it and demoed it. I showed the organization the power of automating our testing, and got the buy-in to deploy it as part of our release process, instead of manual smoke testing. I trained the release captains on how to run it, and they gave me feedback on optimizing it. After improving it based on feedback and running it for a few releases, I got the release captains to adopt it and take ownership over it. They started investigating the failed test cases themselves. They even used those investigations to educate their teams, and the broader organization. I could feel the shift starting to happen.

After several release cycles, we worked out the kinks and the team was confident in the results. We now had fully automated smoke testing! This small project was able to reduce running the smoke test suite manually in 30 minutes to running it fully automated in under 3 minutes. That's a 90% reduction and a savings of few thousand dollars a week in terms of direct testing efforts. It was an awesome accomplishment, but that was only part of the benefit. Now, my team could stop performing smoke testing manually, and spend that time doing other important work. Release engineers did not have to view my team as the bottleneck in their release process anymore. The organization started to feel what I had been saying to them about automated testing.

For me, it was a big win, and a big relief. I took a risk, worked extra hard for a couple months, and it paid off. This project was the catalyst for a series of changes I made and pushed for that led to a 95% reduction of production incidents. It was the beginning of the culture shift in software quality.

The green light isn't coming, and your team can't wait.

I didn't get permission to build this automated smoke testing project. I didn't get extra time, extra headcount, or a line in the budget. I got told no, and then I started anyway. I could feel the mounting pressure in my team, and I needed to release some of it.

That's the part nobody talks about. The risk wasn't technical. It was the decision to stop waiting for the organization to care as much as I did. To stop making the case and start making the solution. It was the decision to bet on my own read of the situation when nobody else was ready or willing.

It could have failed. In fact, the test suite did fail a few times before we got them right. But if I had waited for the green light, I would have lost my team. Releases would have continued to be onerous, and the organization would've been stuck with the same software quality culture.

If you're an engineering leader reading this, you probably already know what needs to change. You've probably already made the case and been told it's not the right time. Every week you wait, you're losing engineers, trust, and time you can't get back. Waiting for the green light is more expensive than you think. Start building.