I've worked with AI tools enough by now to understand that the same input doesn't produce the same output. Of course, we can all imagine why that would be a problem, but that wasn't the purpose of this experiment. The purpose was to explore the differences in code generated in response to the same prompt.
Last week, I was explaining an AI experiment I did while working at ChargeLab to an old colleague of mine. Essentially, I asked ChatGPT to extract all of the API interfaces from a code base, in hopes of producing a consistent and accurate list as the code base evolved. This would have reduced manual effort substantially and allowed me to write software to calculate interface test coverage metrics. However, a repeated measures experiment showed at most an 86% accuracy compared to me manually extracting the list of interfaces, and was as low as 25% across 20 runs. It opened my eyes to the non-deterministic nature of AI tooling.
Unfortunately, I was not able to keep the data from the experiment, so this week I decided I would do a new one. I created a prompt to generate a RESTful GET request handler to fetch a list of users. Something I have written a bunch of times before, you know, because every system has users and they need to be fetched. I wrote the prompt naturally, as you might expect any developer to create during their daily work. Not too short, but not so much detail that you might as well write the code yourself. Next, I prompted Claude's Sonnet 4.6 five times consecutively in different chat windows with the 'memory' function turned off. It generated five code samples and I analyzed them to see what was the same and what was different. Let's take a look at the results.
Experiment Prompt: Create a REST GET handler function that fetches a list of users in Java using the Spring framework. There are two query parameters to consider. First, if the 'name' parameter is provided, the returned list should include users where the 'name' string is either part of the first or last name, or an exact match. Second, if the 'companyId' parameter is provided, the list should only contain users who belong to the company with the UUID specified in the request. The handler should contain logic to return specific HTTP status codes for any failure modes.
One of the first things I checked when looking across the different samples from Claude was what the REST request handler returns for a successful fetch from the database. The first code sample returned 200 OK and a list of users, even if the list was empty. The other four samples return 204 No Content if the user list was empty and 200 OK if there were one or more users in the list. A meaningful inconsistency, but before I dive deeper, it would only be right to mention the positive aspects.
Sample 1
List<User> users = userService.getUsers(name, companyId);
return ResponseEntity.ok(users);
Samples 2–5
if (users.isEmpty()) {
return ResponseEntity.noContent().build(); // 204
}
return ResponseEntity.ok(users); // 200
Claude did write all five samples using Java and the Spring framework (controller, service, and repository). All cases return 500 Internal Server Error for any general exceptions that would be thrown. In all cases, it did validate that the 'companyId' parameter exists in the database and returned 404 Not Found if it did not. Lastly, in all samples, both the 'name' and 'companyId' parameters were treated as optional, and case-insensitive pattern matching was used to filter on the 'name' parameter. Not bad, but not great either. Now, let's get back to the differences.
The big difference is what I mentioned at the beginning. The happy path contract was not consistent across these samples. Sample 1 returned 200 OK for an empty list, while the others returned 204 No Content for an empty list. Review from a developer would likely catch this, but if we just let this deploy without review, the consumer of this contract would break depending on which version it was expecting. There was also another inconsistency with the responses across the samples, and it had to do with error handling. Sample 1 and Sample 3 return strings containing the error message, while the other samples returned JSON objects to represent the error message. Again, the consumer of these responses will behave unexpectedly depending on which version they expect to receive.
Sample 1
return ResponseEntity
.status(HttpStatus.BAD_REQUEST)
.body("Query parameter 'name' must not be blank if provided.");
Sample 4
return ResponseEntity
.status(HttpStatus.BAD_REQUEST)
.body(Map.of("error", "'name' parameter must not be blank if provided."));
Let's look at a few more differences. First, Sample 2 and Sample 5 did not properly validate the 'name' parameter. Sample 5 does do a null check as part of the filtering logic, but no error is returned if it is null. In addition, Sample 3 added a validation to ensure the 'name' parameter was less than 100 characters. Contrasting this to the 'companyId' parameter, which has the same validation null check and existence check on every sample. This demonstrates variability in what the model determined that needed to be validated.
In terms of filtering on the 'name' parameter, there were a couple differences. Sample 4 was the only one to check for an exact match of the first name or the last name. Sample 5 was the only one that filtered for matches with the first and last name individually, and also with the concatenated first and last name. Even the logic used for filtering the 'name', and the 'companyId' parameters varied. Sample 1 and Sample 4 logic used JPA specifications, while the other three samples used custom JPQL queries to filter the user fetching.
Finally, a very interesting difference was that only one sample even considered authorization. Sample 2 was the only one to indicate that authorization was considered. It specified an 'AccessDeniedException' and returned 403 Forbidden if the exception was thrown. You might assume the AI system would expect that authorization would be a consideration for any system, but it only showed up in one of the five samples.
I performed this experiment to show you the potential risks of having AI agents generate code and deploy it directly to production systems. The results showed that even for this simple example, the generated code can vary in meaningful and disruptive ways. Response contracts, parameter validation, and even filtering logic were not consistent. Deploying these samples could leave users with an unreliable system, which is the opposite of what they need.
Modern AI tools give you great power to generate code at rates never experienced before. You need to uphold your great responsibility for proper specifications and code review to continue delivering reliable systems.
The code samples for this experiment can be found on GitHub here: https://github.com/tdesplen/RestHandlerComparison