The Prompt Is the Easy Part What lives below the waterline

You can build an impressive AI demo in an afternoon. Turning it into something your security team, your finance team, and your on-call rotation can all live with is a different kind of work entirely.

June 18, 2026 8 min read
Still an application Data and context You cannot eyeball quality The model gateway Workflows and agents Agents act on their own Cost is architecture Day Two is where AI dies The steering wheel Not build versus buy

There has never been a better time to build an AI demo. Wire a prompt to a model API, point it at a few documents, and you can have something that looks like magic before lunch. That is real, and it is genuinely useful. It is also where most AI efforts quietly go wrong, because the demo gets mistaken for the product.

By now the pattern is familiar. A team ships a striking prototype, everyone is excited, and then months later the thing is still a prototype, or it limped into production and slowly fell over. The common story is not that the model was bad. The models are good and getting better every month. What breaks is everything around the model: the parts that make it safe, affordable, reliable, and able to improve. That work does not fit in an afternoon, and you cannot vibe-code your way through it.

A single iceberg at dusk - a small warm-lit tip above the waterline and an enormous mass of ice descending into deep navy water below

The cleanest way to picture it is an iceberg. The prompt is the tip above the waterline. Identity, routing, evaluation, drift detection, autoscaling, audit trails, cost controls, guardrails, tracing, and failover are the enormous mass beneath it. This is a tour of what lives below the waterline, and why it matters more than the part everyone sees.

The prompt
The part everyone demos
An afternoon's work
the waterline
Everything that makes it production-ready
Identity Routing Evaluation Drift detection Autoscaling Audit trails Cost controls Guardrails Tracing Failover

It is still an application

Before any of the AI-specific concerns, an AI feature is still software that has to run somewhere. It needs real hosting, separate environments, a deployment pipeline, and secrets management with proper key rotation. None of this is novel, and all of it is still required.

It runs somewhere

Hosting, separate environments, a deployment pipeline, secrets and key rotation. Old problems, still mandatory.

It knows who is asking

SSO through your IdP, OIDC or SAML, MFA, real sessions, role-based access. The identity at the door is the one you carry all the way to the action.

It has to scale

Spiky load on expensive GPUs. Autoscaling that absorbs bursts without idle spend, plus high availability, regional failover, and a recovery plan you have tested.

It also needs to know who is using it. Authentication is the front door: single sign-on through your identity provider, OIDC or SAML, multi-factor, real session handling, and role-based access at the application layer. That matters on its own, and it matters even more later, because the identity you establish at the door is the identity you will have to carry all the way through to the actions an agent eventually takes. Then there is scale. AI workloads are spiky and the compute behind them is expensive, especially when you are running your own models on GPUs. You need autoscaling that absorbs bursts without leaving costly capacity idle, and once a team depends on the system, an outage becomes a business event.

Data, context, and knowledge

A base model is fluent and, about your business, completely ignorant. It has read the internet and none of your data. Every bit of value comes from connecting it to your systems, your context, and your institutional knowledge, and that connection is where most of the engineering actually goes.

Connect to your data

Databases, the warehouse, SaaS tools, documents. Query live or index a copy? How is it kept fresh? And it must respect the permissions already on it.

Assemble context

The right information at the right moment, not more information. Precise, grounded context cuts hallucination. Stuffing the window raises cost and lowers quality at once.

A durable knowledge layer

A maintained knowledge base with provenance, so any answer traces to its source. Keep it fresh and permission-aware, or it gives confident, outdated answers.

Connecting to your data means reaching into structured databases, your warehouse, your SaaS tools, and your documents. Each source raises the same awkward questions. Do you query it live or index a copy? How do you keep it fresh? How do you respect the permissions that already exist on it, so the assistant never surfaces something the person asking was not allowed to see? Connectors are easy to demo and hard to operate. Assembling context is the next skill, and it is subtler than it looks. The goal is to get the right information in front of the model at the right moment, which is not the same as getting more information in front of it.

You cannot eyeball quality

Traditional software is deterministic, so you can test it by checking that the same input gives the same output. AI systems are not. A change that helps one case can quietly break ten others, and "it looked fine in three tries" is not a quality bar anyone should ship behind.

The shape of it

An evaluation harness is the regression suite for AI. A curated set of representative cases with known-good answers, scored on every change - with a strong model grading accuracy, groundedness, and tone, calibrated against human judgment so you can trust it.

From there, changes flow through evaluation gates in your pipeline, then a canary or shadow deployment, then a real comparison against live traffic, with one-click rollback when something regresses. Prompts and model versions get pinned and versioned, so you always know exactly what is running in production.

STAGE 1

Eval gates

Every change is scored against the curated set before it can move.

STAGE 2

Canary / shadow

Ship to a slice, or run alongside without serving, and watch.

STAGE 3

Compare live

Measure the new version against real traffic, not a hunch.

STAGE 4

One-click rollback

Pinned, versioned prompts and models mean instant reversal.

The model layer should not be hardwired

Calling one provider directly from your application code feels simplest, and it is a decision you will regret. Models change constantly. Prices move, providers have outages, and a better or cheaper option ships next month. If every one of those events forces an incident or a rewrite, you have built a fragile system.

Your application
one internal interface
Gateway
route · retry · cache
Frontier model
Small / cheap model
Self-hosted open weights
retries fallback caching rate limits spend tracking smart routing

A gateway between your application and the models is the seam that buys back your freedom. One internal interface, many models behind it, with retries, fallback, caching, rate limits, and spend tracking handled in one place. It also lets you route intelligently: not every request deserves the most expensive model, and a large share of traffic can go to smaller or open models with no loss in quality. And when compliance or data residency rules forbid sending data to a third party at all, the same interface lets you self-host open-weight models without rewriting everything that depends on them.

Workflows, agents, and the difference between them

There is a spectrum from a fixed pipeline of model calls to a fully autonomous agent that plans its own steps. Autonomy buys flexibility and costs you predictability, money, and debuggability. The discipline is to reach for the least autonomy that solves the problem. A surprising number of production wins are plain workflows, not agents.

Fixed workflow
Predictable, cheap, easy to debug
Autonomous agent
Flexible, pricier, harder to trace
The discipline

Reach for the least autonomy that solves the problem. And any agent that loops needs hard limits: a cap on steps, clear termination conditions, safe retries, and tool calls that are idempotent, so a repeat does not double-charge a customer.

When you do build agents, they are only as useful as the tools they can call, which means well-described tools with typed schemas, a registry of what exists, and increasingly a shared protocol to connect them. Each agent also holds working memory while it runs, the context window and the state it carries from step to step, which is a different thing from the durable knowledge layer described earlier. One is the library the system consults; the other is the short-term memory at the desk while it works.

Agents act on their own now

The moment an agent calls a tool, queries a system, or moves money, it is acting on someone's behalf, and the old model of a single service account with broad rights does not survive contact with that reality. The user's identity and permissions have to flow through every hop, carried as scoped, short-lived credentials rather than swapped for a master key at the first step.

Person Application Agent Downstream tool

scoped, short-lived credentials at every hop · never swapped for a master key

The rule

An agent should never be able to do something the person it is acting for could not. And you have to be able to prove, after the fact, who it acted for, what it was permitted to do, and what it actually did.

That is authorization plus an immutable audit trail, designed in from the start rather than reconstructed after an incident. Around all of it sit guardrails against prompt injection and data leakage, the governance that lets your legal and compliance teams say yes, and the unglamorous safety basics: sandboxed tool execution, least-privilege permissions, human approval on the riskiest actions, and a way to stop a misbehaving agent instantly. Plan the off switch before you need it.

Guardrails

Against prompt injection and data leakage, with sandboxed tool execution and least privilege.

Human approval

On the riskiest actions - and a way to stop a misbehaving agent instantly. Plan the off switch first.

Cost is an architecture decision

With usage-based pricing, cost scales with adoption, and a runaway agent loop can get expensive fast. You need per-team and per-feature cost attribution, budgets and alerts that fire before the invoice does, and a way to show each team what it is consuming.

Route easy traffic

Send the large share of simple requests to cheaper or open models, with no loss in quality.

Cache aggressively

Stop paying twice for the same answer. The biggest, dullest lever on the bill.

Cap runaways

Step limits and budgets that fire before the invoice does, not after.

The good news: the biggest levers on the bill - routing easy traffic to cheaper models, caching aggressively, and capping runaways - all live at the gateway you already built. Controlling spend is an architecture problem, not a spreadsheet exercise.

Day Two is where AI dies

Launch is the start of the work, not the end. Most AI fails after it ships, because nobody built the means to see it, measure it, or improve it. A single agent request fans out into many steps, so you need distributed tracing across each one to debug a slow or wrong answer. You roll those traces up into live operational views of latency, errors, cost, and quality.

SEE

Distributed tracing

One request fans out into many steps. Trace each to debug a slow or wrong answer.

MEASURE

Operational views

Latency, errors, cost, and quality, rolled up live.

WATCH

Drift detection

Real inputs wander from what you tested; providers update models underneath you.

IMPROVE

Close the loop

Capture feedback, fold it back into evals, prompts, retrieval, and tuning.

feedback folds back into the evaluation sets · the system compounds instead of decaying

You watch for drift, because real-world inputs wander away from what you tested and providers update models underneath you. And you close the loop by capturing user feedback, folding it back into your evaluation sets, and feeding it into prompts, retrieval, and tuning, so the system compounds instead of decaying. Then you treat the whole thing like the production system it is, with alerting, on-call, and runbooks.

Someone owns it at 2am, or it dies.

The steering wheel

Here is where this is heading. More of this stack collapses into a prompt every year, and autonomy keeps improving. We are simply not there yet. There remains a long tail of scenarios that today's models and coding assistants handle badly or unpredictably. Self-driving cars are the same story: good enough to impress, not yet good enough to remove the wheel.

A production AI platform is that steering wheel. It lets the system drive itself where it is capable and lets your team take over and adjust by hand the moment it hits the long tail. That manual control is not a sign of immaturity. It is what makes shipping today possible.

It is no longer build versus buy

For years the choice was framed as build or buy, and both answers were bad. Build the whole stack yourself and you spend a year on undifferentiated plumbing before you ship anything that matters. Buy a narrow point tool and you still have to stitch the rest of the stack together by hand.

Build it all

A year on plumbing

You spend a year on undifferentiated infrastructure before you ship anything that is actually your business.

Buy a point tool

Stitch the rest by hand

A narrow tool solves one box. You still assemble the gateway, evals, tracing, and governance yourself.

There is a third option. Start on a platform that already provides these components - the gateway, the evaluation harness, tracing, governance, identity propagation, and the rest - and it carries you roughly eighty percent of the way. Your team then spends its time on the twenty percent that is actually your business: the use cases, the data, and the judgment that make your product yours.

0%
Undifferentiated platform the right foundation already carries - the plumbing
0%
Your differentiators - the use cases, the data, the judgment that make it yours

None of this removes the discipline. People and process still decide whether any of it pays off, which is why we order it people, process, platform, with the platform last. The platform exists to take the undifferentiated weight off your team so they can spend their attention where it counts. The prompt was always the easy part. Production is the discipline, and it is the part worth getting right.

Building something and not sure how much of this you have covered?

We offer a free expert report on your AI idea or roadmap, reviewed by our Chief AI Officer. It is the lowest-risk way to find out where the gaps are before they become incidents.

Get your free expert report