One Year with AI in Production: The Autonomy Myth and What Actually Works

In short: After a year embedded in a product team using AI tools in production, the “agents ship features while you sleep” pitch is mostly bullshit.

What worked: making a legacy codebase accessible and easier to refactor, cutting bug triaging to minutes, letting non-engineers ship internal tools. What didn’t: untested PRs, uncontrolled API spend, and an agentic way to throw code over the fence.

AI amplifies whatever direction already exists.

The engagement

A year ago, I was brought in as a product consultant for a company with 8 developers maintaining 3 applications. The brief: translate the CEO’s AI vision into a measurable product roadmap and help organize a development team where work was getting thrown over the fence and nobody was owning outcomes.

The CEO wanted every tool the team touched — ticketing, code reviews, analytics, data pipelines, performance monitoring — to be AI-enabled. All AI requests went through an AI gateway for monitoring and rate limiting. The company was also starting to use AI to generate core business value: categorizing thousands of publicly available financial and contractual documents so enterprise customers could rapidly access information that was previously scattered across PDFs. Work that used to require armies of human reviewers.

I was skeptical. I’ve been building software for 20 years. I’d seen the hype cycles.

Where the tools delivered

Within weeks, the AI tooling started proving itself in four areas that matter to anyone running a product team.

Codebase knowledge became self-serve. The company’s 15-year-old software had layers of technical debt, undocumented decisions, and code written by developers who left years ago. By connecting an AI agent to the codebase and a production database replica via MCP, anyone could ask domain-specific questions using the product’s own language and get answers in seconds instead of waiting for the one person who might remember. Without the source code, the agent guesses. With it, it understands your domain.

Bug triaging dropped from hours to minutes. Connecting the agent to the codebase and a database MCP meant we could ask natural language questions about live production data and get to root causes in a few minutes. In one case, the agent identified an unordered database query as the source of a display bug in under 7 minutes — including the time I spent challenging its first (wrong) conclusion. In another, connecting the agent to Sentry, Posthog, and a production replica let me diagnose a 32-second p95 response time, confirm the feature causing it had 2 clicks in 30 days, and ship a fix that dropped that p95 to 600ms.

Non-engineers started shipping internal tools. I coached the business analysts to use Claude Code. They built internal dashboards, candidate list generators, and data quality monitoring tools. These weren’t toy projects — they were real applications that had been sitting in the backlog for months waiting for dev time. A BA with zero coding experience built a dashboard in an afternoon that would have taken the dev team a week.

Agentic coding made refactoring viable. The classification pipeline that was core to the business had hardcoded prompts, no evals, and no feedback loop. AI-driven TDD — writing the test, letting the agent implement, reviewing and iterating — made it possible to rebuild the entire pipeline as an ensemble of heuristics, ML, and a multi-step LLM chain. The kind of refactoring that would have been deprioritized indefinitely became a few weeks of focused work.

Where the same tools backfired

Not all of this went well.

When triaging takes minutes instead of hours, the temptation is to fix everything. That backfires. I introduced a bug prioritization framework specifically because the team’s new speed made it easy to push fixes for P3 edge cases while P1 issues waited for someone to define “done”.

The same accessibility that enabled the BAs also created new failure modes. A product manager who hadn’t set up the app locally spent hours having Claude Code build a feature, then pushed a PR he’d never run. CI broke. Neither he nor the agent could figure out why, so we had to pull a developer off their work to untangle a chain of nonsensical mini-commits trying to fix issues he’d only discovered in the pipeline.

It happened more than once — PRs thrown over the fence with no local testing, no review of what the agent actually produced.

The exact pattern I’d been brought in to fix, just wearing a new hat.

In another case, a BA was iterating on an internal app using the Claude API. No evals, no way to know if each iteration was better or worse than the last. Just gut feeling. The app never shipped. But the admin in charge of billing got an alert: the BA had spent $900 worth of tokens in one day. That one prompted the company to introduce the cost controls and rate limiting I’d been advocating for.

The business impact

Categorizing 20,000 contracts would have taken 2 months of manual work. With AI + human review, it took 3 weeks. This wasn’t agentic work, it was LLMs integrated into the app flow.

That ratio is why the team restructured around AI-assisted workflows.

That said, there were rough edges. For some of this work, measurements and evals were absent, which made the speed gains hard to quantify as ROI. Moving fast felt productive but “it felt faster” isn’t a number you can put in front of a stakeholder. Developers were running evals against LLM APIs with no cost estimate upfront, burning through budget to measure performance without knowing what the measurement itself would cost. Those gaps are why I put together a set of standards for shipping LLM features.

The autonomy problem

Here’s where I diverge from the hype.

There’s a lot of marketing around giving an AI agent a Linear ticket, walking away, grabbing coffee, and coming back to a perfect PR ready to merge.

I call bullshit.

In practice, what happens is:

The agent makes assumptions about requirements that were never specified
It misses edge cases that are obvious to someone with domain knowledge
It produces code that works but doesn’t fit the architecture
It can’t answer the fundamental question: “Is this what the user actually needs?”

The agentic flow where you completely disconnect from the work? That’s a fad. That’s FOMO-driven marketing.

What actually works: Using AI as a pair programmer who types really fast but needs course correction and some guidance.

A key part of the job now is to understand the product, question the agent, realize when something was missing from the plan, investigate, talk with business stakeholders, understand their needs, update the plan, and iterate.

Yeah, that sounds like business as usual. It should. High-performing teams have always worked this way — tight collaboration between product and engineering. The difference is that without it, AI tools actively make things worse. The PM pushing untested PRs and the BA burning $900 on an app that never shipped weren’t tool failures. They were process failures that AI made cheaper to commit.

What changed in the team

Here’s the uncomfortable part.

The CEO had already started pushing the team toward AI tools before I arrived. Over the engagement, the team went from 8 developers to 3. One per application. Those were his calls, not mine. But working alongside the team daily, I saw what separated the developers who thrived from those who didn’t.

Some developers multiplied their output. Others waited to be “fed a story with all the details” and struggled when the expectation shifted to “understand the product and take initiative with AI assistance”.

The AI didn’t replace five developers. It removed the buffer that let some developers avoid engaging with the product. When the workflow shifted from “wait for a detailed spec, implement it, hand it back” to “understand the problem, use AI to move fast, own the outcome,” some people didn’t have the skills that workflow demands. That’s not an AI story, it’s a team composition story that AI accelerated.

The three who remained aren’t necessarily the most senior or the fastest coders from the original eight. They’re the ones who:

Communicate well with leadership about product decisions
Ask good questions of both AI tools and stakeholders
Understand when to trust the AI and when to dig deeper
Care about the craft, not just completing tickets

With 3 AI-enabled developers, the team is shipping faster than it did with 8.

The bottleneck was never typing speed. It was always understanding what to build and why.

How my role shifted

After course-correcting a couple of projects and introducing better practices, the engagement changed shape. By the end, I was doing less advising and more building: writing code with AI assistance, setting up MCP integrations, pairing with the remaining developers directly.

That shift says something about where AI tools push you. The line between “product consultant” and “hands-on contributor” gets blurry when the tools let you move between thinking about what to build and actually building it in the same afternoon.

What I took away

What this engagement solidified for me is that a clear business vision and strategy needs to be in place for development teams to thrive with these AI tools.

The AI doesn’t replace the need for direction — it amplifies whatever direction already exists. When the CEO had a clear vision and the team understood the problem, AI multiplied output. When goals were vague or ownership was unclear, AI just produced more code that missed the point faster.

The classification pipeline was the clearest example of both sides. When we had evals, defined metrics, and a domain expert in the loop, we took recall from 68% to 95% without increasing LLM spend. When the same team tried to ship LLM features without any of that structure — no test datasets, no performance thresholds, no feedback loops — they burned weeks on prompt tweaks that introduced downstream breakages nobody could measure. Same tools, same models, same team. The difference was whether anyone had defined what “working” meant before writing the first prompt.

The developers who made it weren’t the fastest coders. They were the ones who understood the product, asked good questions, and knew when the AI was wrong.