Essay6 May 20267 min read

How to build a distinctive, useful AI agent — without burning a million pounds

If you run a business — a startup, a scale-up, an established company with a product team — you have probably been in a meeting in the last six months about building an AI agent. A customer-facing assistant. A specialist co-pilot. A branded voice in your app or your dashboard or your support flow.

You have probably also seen the quotes. £200,000 from a consultancy for a “discovery and pilot.” £600,000 for a custom-built agent over nine months. £1.2m for a full agent platform integration. The numbers are remarkable, and most of what they buy is not.

This piece is a practical guide to building an AI agent that is genuinely worth using — distinctive, useful, on-brand, and trusted — for a fraction of what the industry is currently charging. It is not a polemic. It is the order of operations we use, the decisions that matter, and the specific pitfalls to avoid.

Start with the job, not the agent

The first decision is the most important and the most commonly fudged.

What is the one job this agent does?

Not “be a customer service assistant.” Not “help users with anything.” A single, specific, measurable job. Help customers find the right size and fit before they buy. Resolve refund requests in under two minutes. Help a clinician summarise a patient’s history at the start of an appointment. Help a finance team query the data warehouse in plain English.

A specific job has three properties: it happens often, it has measurable success, and it has a current cost (a human doing it, a customer giving up, a transaction failing). If your candidate job lacks any of those, find a better one.

Companies that ship general-purpose agents on day one almost always end up with generic agents that do nothing well. Companies that pick one valuable job, ship it brilliantly, and extend later end up with agents users actively prefer to the alternatives. The discipline to start narrow is the most valuable discipline in this category.

If you cannot describe the job in one sentence, do not start the project. Spend another week deciding what the job is.

Write the conversations before you build anything

Once you have the job, write what the agent will say. Twenty representative exchanges. Real sentences. Real word choices. Real decisions about tone, structure, length and signal.

This is the phase most teams skip, and it is the phase that determines whether the agent will be any good.

Write the happy path: the user asks for the thing, the agent does it well. Write the awkward paths: the user asks for something adjacent, the agent has to clarify. Write the failure paths: the agent doesn’t know, can’t help, has to hand off to a human. Write the trust paths: the agent is about to do something irreversible and needs to confirm. Write the tone paths: the agent has to deliver bad news, or apologise, or push back.

Each exchange should be written, edited, and tested by reading it aloud. If it sounds like a generic LLM, rewrite it. If it sounds like your brand, you’re getting somewhere.

A useful test: take ten of the exchanges, strip the avatar and the colour scheme, and show them to someone who knows your brand. Can they tell which company built this agent? If yes, your conversation design is doing its job. If no, you have written generic AI in your colour palette and the rest of the project will not save you.

This phase takes one to two weeks for a focused job. It can be done by one or two senior conversation designers working closely with the team that owns the use case. It costs almost nothing in technical infrastructure. And it is the single highest-leverage piece of work in the entire project.

Design the fallbacks and escalations explicitly

The happy path is the smallest part of an agent. The fallbacks and escalations are the largest.

What does the agent say when it doesn’t know? What does it say when it’s unsure? What does it say when the user asks for something it can’t do? When does it hand off to a human, and how does it phrase the handoff? What does it surface to the human when it does? What does the user see while the handoff happens?

These patterns will run a thousand times more often than the happy path, and they are where trust is built or destroyed. Most agents fail in production not because the happy path is wrong but because the fallback is bad — generic, evasive, defensive, or quietly hallucinating its way through territory it doesn’t actually know.

A good agent has a small, well-designed set of fallback patterns that are recognisably part of the same product and the same brand as the happy path. They are not apologetic boilerplate. They sound like the same agent in a moment of honesty.

Design these alongside the happy path, not after. They are not edge cases. They are the bulk of real usage.

Pick a strong foundation model and build a thin layer

Once the conversation design is sound, the engineering is mostly unremarkable.

Use a major foundation model. Claude, GPT, Gemini — pick one based on the specific characteristics your job needs (Claude for nuance and tone, GPT for breadth, Gemini for cost at scale, broadly speaking, though this changes month to month). Build the smallest amount of bespoke infrastructure you can.

If your agent needs your data, use standard RAG patterns. There are now mature libraries and managed services that handle most of this. If it needs to take actions, use function calling and a small set of well-typed tools. If it needs orchestration across multiple steps, use one of the established frameworks rather than inventing your own.

The temptation, especially inside larger companies with serious engineering budgets, is to build bespoke. Bespoke routing. Bespoke memory. Bespoke retrieval. In almost every case this is rebuilding components that already exist, less reliably, at higher cost. The work that makes your agent distinctive happens in the prompts, the conversation design, and the fallback patterns — not in the substrate.

A reasonable rule: if the technical work feels novel, you are probably about to overspend. If it feels boring and well-trodden, you are probably on the right track.

Build a real evaluation harness

Most agent evaluation is vibes-based. Someone on the team types into the agent, sees what comes back, says “feels good” or “that’s weird,” and moves on. This is fine for the first day. It is not fine for production.

Build a held-out test set of representative exchanges — a hundred to two hundred is enough to start. Score each one against criteria you agree on in advance: accuracy, tone match, trust signals, hand-off quality, refusal quality, brand fit. Run the eval every time you change the prompt, the model, the data, or the tools.

When something regresses, treat it as a bug. When something improves, understand why. Over time the eval becomes the single most valuable piece of internal documentation about how your agent should behave.

This is the piece of work most internal teams underinvest in, and the piece that separates agents that get better over time from agents that drift quietly into mediocrity.

Ship narrow, watch closely, extend deliberately

Launch the agent on a small surface, to a small audience, doing the one job.

Watch what happens. Look at the conversations users are actually having. Compare them to the conversations you designed. Look at the moments where the agent failed and the moments where users gave up. Look at the language users use that the agent doesn’t expect.

Iterate. Re-run the evaluation. Extend the conversation design. Tighten the fallbacks. Improve the worst patterns first.

Only once the narrow agent is genuinely loved should you extend its scope. Extension is much easier than reimagination. The companies that try to launch a do-everything assistant on day one almost always have to throw it away and start over. The companies that ship one job brilliantly and extend can ride that foundation for years.

What this should cost

A focused, single-use-case agent built well — conversation design, prompt engineering, integration with a foundation model, RAG where needed, function calling where needed, full evaluation harness, deployment — should cost between £40,000 and £150,000 end to end, depending on the integration complexity and the surface it lives in.

If you are quoted £400,000 or £700,000 for a “discovery and pilot,” ask carefully where the money is going. If most of it is engineering, you are paying to rebuild components that already exist. If most of it is “research and stakeholder workshops,” you are paying for theatre. If little of it is conversation design, you are paying for the wrong thing entirely.

A useful rule of thumb: in a well-budgeted agent project, between 40% and 60% of the spend should go to conversation design, prompt engineering, and evaluation — the actual writing and judgment work that makes the agent distinctive. If your proposal allocates 10% to “tone of voice” and 70% to engineering, the proposal is wrong.

How Fieldwork builds agents

This is the work we do.

A typical agent engagement at Fieldwork takes between eight and sixteen weeks, end to end. Senior conversation design, prompt engineering and evaluation, working in parallel with senior integration, infrastructure and deployment. The cost is between £40,000 and £150,000 depending on scope.

The structure of the engagement mirrors the order above. We start by helping you decide on the single job. We write the conversation design before any code is written. We design the fallbacks and escalations explicitly. We use a major foundation model and build a thin technical layer around it. We build the evaluation harness alongside the agent itself. We ship narrow, watch closely, and help you extend deliberately.

We are not an AI consultancy. We are a product studio that knows how to design agents because we ship them. Instinct, the dog enrichment app we built in three months, includes a Claude-powered personalisation layer designed and shipped using exactly this method. The same approach scales up: it works for customer-facing assistants on consumer products, internal co-pilots on enterprise tools, and specialist agents inside vertical software.

If you are sitting on a £400,000 quote for an agent project and wondering whether it really has to cost that much, it doesn’t. If you are inside a company that has been told an agent will take nine months and need it in three, it can be done. If you have an idea for an agent and don’t know where to start, the first step is a Field Study — two to three weeks, a clear recommendation on the job, the conversation design, and the build path.

The point

A good AI agent is not a function of model choice or infrastructure spend. It is a function of three things, in this order: choosing the right job, designing the conversations well, and shipping narrow before extending. Get those right and the rest is unremarkable engineering. Get those wrong and no amount of engineering will save you.

If you are about to commission an agent project, ask three questions before you sign anything.

What is the one job this agent does, and how is success measured?

Who is doing the conversation design, and how much of the budget is going to that work?

How will we know it’s getting better — what’s the evaluation plan?

If the answers are unsatisfying, the project is unlikely to produce an agent worth using. Those three questions are most of the work.