I Built an AI Tool Under Pressure. Here Is What Broke.

Have you ever been under pressure to deliver something quick to support the sales team? That was me last week. Actually, let me paint a more accurate summary of the situation:

Me as the sales leader: We need a lead magnet that is valuable, unique, and powered by AI.

Me as the business owner: No problem. Engineering will get that done for next week.

Me as the engineer: Um… okay?

For those of you that don't know, I own and operate Reliably, a consulting firm dedicated to helping businesses build and deploy reliable software systems. I was looking to build out a new system that functioned as a lead magnet and delivered value to our prospects before they even got on a call with me.

As an entrepreneur, there are a million more things to do than you have time for, so I needed a quick win here. The requirements list was very small, as you want for a prototype. It needed to deliver a self-assessment, it needed to be embedded within reliably.ca, and it needed to be powered by AI. I already laid out the assessment manually, so it should be straightforward right? Wrong.

I deliberately took the approach that many non-engineers and engineers alike have been taking with using AI to build prototypes. Let me share with you how this experience just reinforces what I keep telling the community: the reliability of AI systems is inversely proportional to the scope of the task you ask it to complete.

I'll start with the general task. The system needed to provide the user with the ability to perform a cost assessment. I wanted them to be able to get a ballpark of the cost of unreliable software systems. As an engineering leader, I have a good idea what the numbers look like. But from talking to other engineering leaders, I don't think many of them have tried to put a dollar value on it. Having those numbers is critical if you want to make decisions about quality engineering initiatives and gain the support of leadership to pursue them. The assessment should feel like a conversation, hence the use of AI tools. The user should be able to opt in to receive a report via email by providing their email address at the end.

Before I could get to the AI part, I needed to lay out the foundations of the system. Remember, this is all under the guise of "quick win." So I decided to use as much existing tooling as possible. I designed the report structure that would contain the self-assessment answers and cost calculation metrics, and set up a Supabase table to hold it for me. I built an email template so users could have a copy of the assessment results, and set up a Resend account to handle that. Finally, I built a simple frontend interface for both the chat portion and the cost calculation using the Next.js framework. The skeleton was ready. Now all that was left to do was write a prompt and connect the chat interface to the Anthropic API.

Ask a question. Extract metrics. Calculate costs. Respond.

I started out how any developer might start when designing an agent for a specific task. I wrote out a prompt that defined the task, the self-assessment questions, extraction and estimation parameters, examples for calculating metrics, default values to fall back on, rules to constrain responses to the user, output formatting, and other relevant details of the task. The prompt instructed the system to ask a question, extract the metrics, use the metrics to calculate unreliability costs, and then respond to the user. Then repeat until the questions are exhausted and create a summary.

It didn't take long into the testing for the system to reveal that it would not be able to handle such a task. The AI response would forget to ask the next question, add extra questions, and ask questions from which it could not extract the desired metrics. Ok, so questions are out. That needs to be handled in code, not English.

Extract metrics. Calculate costs. Respond.

The next issue was that the cost calculations had too much variance. At first, I thought it was due to the extraction portion. I tried refining my cost models and defining more parameters to extract in an attempt to reduce the rate of erroneous extractions. I logged the metric extraction and realized it was fairly consistent at pulling out the desired metrics. However, it was not great at interpreting and performing the cost calculations. After an hour of testing sunk into this, I decided to move the calculation logic into the system code so that calculations were consistent. This allowed me to refine the extraction process a little further, until it reached the desired behaviour.

Extract metrics. Respond (nicely).

The prompt now had a refined function set: extract the metrics from each question and respond to the user. I played with some default values and found a format that seemed to make extraction fairly predictable for a range of answers to the assessment questions. However, the response was quite varied, and not always the kind of tone you want to use when you're trying to impress a prospect. With a few changes, it started to sound more like what I was looking for: empathy and tough love. The version that is running today uses this much narrower prompt than what I thought was reasonable at the beginning.

Prompt scope is inversely related to reliability.

Of course, take this as it is, a single anecdote. However, my experience showed that the more functionality that is crammed into a single prompt, the lower the chance of getting what you expect. This is exactly the reasoning behind building pipelines for AI-integrated systems. Each block does a very specific task. The results chained together. For my purposes, extract metrics and respond were all that could be handled in a single prompt, while maintaining the desired estimation confidence.

I hope you take something from this experience about developing tools with AI systems. Take care in your prompting.

But just as importantly, I now have a tool to show you the cost of unreliability and it is powered by AI. Understanding this cost can be a game changer for your organization.

Go take the assessment and find out what unreliability is costing you: reliably.ca