Why most AI pilots in advice firms never make it to production

Three months ago, someone booked a discovery call with me and opened with this: “We ran a pilot last year. Everyone said it was impressive. Nobody uses it.” I’ve now heard a version of that sentence more times than I can count. The demo worked. The technology was fine. The project is dead anyway.

This is not a technology problem. It is a sequencing problem, and it is almost entirely predictable once you know what to look for.

The pattern that kills pilots

Here is what typically happens. A financial advice firm decides it wants to use AI. Someone senior champions it. A working group forms. Before any automation is built, the firm begins investing in the things that feel responsible: a governance framework, a data infrastructure review, an approval workflow, a vendor assessment process. Months pass. When the actual automation work finally starts, it has to satisfy a committee that was built to evaluate something much larger than what’s being proposed. The pilot gets built in a sandbox. Real users aren’t involved until late. The workflow it’s meant to improve carries on as normal while the pilot runs in parallel.

Then the pilot ends, the committee reviews it, the demo looks fine, and nobody makes the leap to change the actual workflow. The project stalls. Someone quietly archives the Confluence page.

The root cause is straightforward: the pilot was built next to reality rather than inside it. A proof of value is not the same thing as a proof of survival. Something can look genuinely promising in principle and still collapse when it meets the actual operational system with real users, real edge cases, and real time pressure.

The governance framework will not save a pilot that was never embedded in a real workflow. You need proof of survival, not proof of promise.

Early in my career at Computacenter, a mentor taught me that a project’s true test isn’t the launch, it’s whether people still use it six months later. He called it “stickiness.” No amount of executive sponsorship can force people to keep using something that doesn’t actually help them. In advice firms, where advisers are already drowning in tools, that bar is even higher. If your AI doesn’t make their day genuinely easier, they’ll abandon it the moment the project sponsor stops checking in.

What the FCA and Bank of England are watching

The regulatory environment is not making this easier. The FCA and Bank of England joint statement from earlier this month was a clear signal that AI governance for UK-regulated firms is moving from voluntary best practice toward something firmer. The framing, notably, was that frontier AI models may be creating cyber vulnerabilities faster than they solve problems. That is not a reason to stop, but it is a reason to be precise about what you are building and why.

The firms most at risk are those who built governance frameworks without production use cases to govern. You end up with infrastructure that exists to manage risk from automation that doesn’t exist yet, and when the automation does arrive, the approval overhead is so heavy that nobody can move quickly enough to learn anything useful.

The three-part test for a first production plan

The single most underrated decision in any automation programme is choosing the first thing to automate. Get this wrong and you could spend months learning nothing, which is often the most common failure.

A first production plan needs to pass three tests.

First: narrow enough to finish in six to eight weeks. If the scope requires longer than that, you have already built in the conditions for stalling. The timeline is not arbitrary. It is short enough that the firm can’t reorganise around the pilot. Real decisions have to get made with incomplete information, which is exactly what production looks like.

Second: important enough that the firm would notice if it disappeared. This is the test most firms skip, and it is the reason pilots die. If the automation handles something the firm cares about, someone owns the outcome. If it handles something peripheral, nobody does. Suitability letter drafting, post-meeting note generation, client data reconciliation before reviews: these are things the firm notices. A dashboard that nobody checks is not.

Third: embedded in the real workflow, not built beside it. The pilot cannot run in a sandbox while the normal process continues unchanged. Real users have to use it. Real data has to flow through it. Real operational pressure has to test it. This is the difference between a demo and a production system, and you cannot shortcut it.

Where regulated firms should start

For financial advice firms specifically, the L1/L2/L3 framework is a useful place to begin. Level 1 (education) covers things you can fix with a prompt or a template inside a tool you already pay for. Level 2 (integration) covers connecting two or three tools that already exist in your stack using something like Make or Power Automate. Level 3 (custom build) is real engineering and should wait until you have proved something works at Level 1 or 2 first.

Most firms that come to me thinking they need Level 3 actually need Level 1 or 2. The discipline is always to ask whether a simpler version of the fix already exists before commissioning a build.

For an advice firm, a sensible Level 2 first slice might look like this: a workflow that takes a completed fact-find, pulls the relevant client data from the back-office system, and drafts the suitability letter structure for a paraplanners to review and complete. Not a fully automated letter. A structured draft that removes forty minutes of blank-page work per client. Completable in six weeks. Noticeable if it disappeared. Embedded in the actual paraplanning workflow from day one.

The governance conversation becomes much easier once something like that is running. You are no longer managing hypothetical risk. You are managing a real system with real outputs, which gives you something concrete to audit, refine, and expand.

The contrarian point worth sitting with

There is a piece of evidence that should give any firm pause. Google Search, arguably the most battle-tested AI deployment on earth, still fails in ways that are hard to predict and difficult to catch. If that system has reliability gaps, the honest conclusion is that internal enterprise document retrieval and AI-assisted advice workflows will too. The answer is not to stop, but to instrument early. Build in review steps. Track where the automation produces outputs that need correcting. That data is the foundation of a governance framework that actually fits your firm, rather than one that was written before you knew what you were governing.

The pilot-to-production gap is not inevitable. It is a sequencing failure, and sequencing failures are fixable. If your firm has a stalled pilot, or is about to start one, a discovery call is a reasonable place to start.

Why most AI pilots in advice firms never make it to production

The pattern that kills pilots

What the FCA and Bank of England are watching

The three-part test for a first production plan

Where regulated firms should start

The contrarian point worth sitting with

Also in the Journal

Inside the software factory: what a month of building with AI agents actually looks like

The future of automation in financial services

What the shift from rented AI to autonomous agents means for your firm

Schedule an obligation-free call

Why most AI pilots in advice firms never make it to production

The pattern that kills pilots

What the FCA and Bank of England are watching

The three-part test for a first production plan

Where regulated firms should start

The contrarian point worth sitting with

Also in the Journal

Inside the software factory: what a month of building with AI agents actually looks like

The future of automation in financial services

What the shift from rented AI to autonomous agents means for your firm

Schedule an obligation-free call

Get the weekly journal

Get in touch

Get the weekly journal