There has never been a better time to build an AI demo. Wire a prompt to a model API, point it at a few documents, and you can have something that looks like magic before lunch. That is real, and it is genuinely useful. It is also where most AI efforts quietly go wrong, because the demo gets mistaken for the product.
By now the pattern is familiar. A team ships a striking prototype, everyone is excited, and then months later the thing is still a prototype, or it limped into production and slowly fell over. The common story is not that the model was bad. The models are good and getting better every month. What breaks is everything around the model: the parts that make it safe, affordable, reliable, and able to improve. That work does not fit in an afternoon, and you cannot vibe-code your way through it.
The cleanest way to picture it is an iceberg. The prompt is the tip above the waterline. Identity, routing, evaluation, drift detection, autoscaling, audit trails, cost controls, guardrails, tracing, and failover are the enormous mass beneath it. This is a tour of what lives below the waterline, and why it matters more than the part everyone sees.
It is still an application
Before any of the AI-specific concerns, an AI feature is still software that has to run somewhere. It needs real hosting, separate environments, a deployment pipeline, and secrets management with proper key rotation. None of this is novel, and all of it is still required.
It runs somewhere
Hosting, separate environments, a deployment pipeline, secrets and key rotation. Old problems, still mandatory.
It knows who is asking
SSO through your IdP, OIDC or SAML, MFA, real sessions, role-based access. The identity at the door is the one you carry all the way to the action.
It has to scale
Spiky load on expensive GPUs. Autoscaling that absorbs bursts without idle spend, plus high availability, regional failover, and a recovery plan you have tested.
It also needs to know who is using it. Authentication is the front door: single sign-on through your identity provider, OIDC or SAML, multi-factor, real session handling, and role-based access at the application layer. That matters on its own, and it matters even more later, because the identity you establish at the door is the identity you will have to carry all the way through to the actions an agent eventually takes. Then there is scale. AI workloads are spiky and the compute behind them is expensive, especially when you are running your own models on GPUs. You need autoscaling that absorbs bursts without leaving costly capacity idle, and once a team depends on the system, an outage becomes a business event.
Data, context, and knowledge
A base model is fluent and, about your business, completely ignorant. It has read the internet and none of your data. Every bit of value comes from connecting it to your systems, your context, and your institutional knowledge, and that connection is where most of the engineering actually goes.
Connect to your data
Databases, the warehouse, SaaS tools, documents. Query live or index a copy? How is it kept fresh? And it must respect the permissions already on it.
Assemble context
The right information at the right moment, not more information. Precise, grounded context cuts hallucination. Stuffing the window raises cost and lowers quality at once.
A durable knowledge layer
A maintained knowledge base with provenance, so any answer traces to its source. Keep it fresh and permission-aware, or it gives confident, outdated answers.
Connecting to your data means reaching into structured databases, your warehouse, your SaaS tools, and your documents. Each source raises the same awkward questions. Do you query it live or index a copy? How do you keep it fresh? How do you respect the permissions that already exist on it, so the assistant never surfaces something the person asking was not allowed to see? Connectors are easy to demo and hard to operate. Assembling context is the next skill, and it is subtler than it looks. The goal is to get the right information in front of the model at the right moment, which is not the same as getting more information in front of it.
You cannot eyeball quality
Traditional software is deterministic, so you can test it by checking that the same input gives the same output. AI systems are not. A change that helps one case can quietly break ten others, and "it looked fine in three tries" is not a quality bar anyone should ship behind.
An evaluation harness is the regression suite for AI. A curated set of representative cases with known-good answers, scored on every change - with a strong model grading accuracy, groundedness, and tone, calibrated against human judgment so you can trust it.
From there, changes flow through evaluation gates in your pipeline, then a canary or shadow deployment, then a real comparison against live traffic, with one-click rollback when something regresses. Prompts and model versions get pinned and versioned, so you always know exactly what is running in production.
Eval gates
Every change is scored against the curated set before it can move.
Canary / shadow
Ship to a slice, or run alongside without serving, and watch.
Compare live
Measure the new version against real traffic, not a hunch.
One-click rollback
Pinned, versioned prompts and models mean instant reversal.
The model layer should not be hardwired
Calling one provider directly from your application code feels simplest, and it is a decision you will regret. Models change constantly. Prices move, providers have outages, and a better or cheaper option ships next month. If every one of those events forces an incident or a rewrite, you have built a fragile system.
A gateway between your application and the models is the seam that buys back your freedom. One internal interface, many models behind it, with retries, fallback, caching, rate limits, and spend tracking handled in one place. It also lets you route intelligently: not every request deserves the most expensive model, and a large share of traffic can go to smaller or open models with no loss in quality. And when compliance or data residency rules forbid sending data to a third party at all, the same interface lets you self-host open-weight models without rewriting everything that depends on them.
Workflows, agents, and the difference between them
There is a spectrum from a fixed pipeline of model calls to a fully autonomous agent that plans its own steps. Autonomy buys flexibility and costs you predictability, money, and debuggability. The discipline is to reach for the least autonomy that solves the problem. A surprising number of production wins are plain workflows, not agents.
Reach for the least autonomy that solves the problem. And any agent that loops needs hard limits: a cap on steps, clear termination conditions, safe retries, and tool calls that are idempotent, so a repeat does not double-charge a customer.
When you do build agents, they are only as useful as the tools they can call, which means well-described tools with typed schemas, a registry of what exists, and increasingly a shared protocol to connect them. Each agent also holds working memory while it runs, the context window and the state it carries from step to step, which is a different thing from the durable knowledge layer described earlier. One is the library the system consults; the other is the short-term memory at the desk while it works.
Agents act on their own now
The moment an agent calls a tool, queries a system, or moves money, it is acting on someone's behalf, and the old model of a single service account with broad rights does not survive contact with that reality. The user's identity and permissions have to flow through every hop, carried as scoped, short-lived credentials rather than swapped for a master key at the first step.
scoped, short-lived credentials at every hop · never swapped for a master key
An agent should never be able to do something the person it is acting for could not. And you have to be able to prove, after the fact, who it acted for, what it was permitted to do, and what it actually did.
That is authorization plus an immutable audit trail, designed in from the start rather than reconstructed after an incident. Around all of it sit guardrails against prompt injection and data leakage, the governance that lets your legal and compliance teams say yes, and the unglamorous safety basics: sandboxed tool execution, least-privilege permissions, human approval on the riskiest actions, and a way to stop a misbehaving agent instantly. Plan the off switch before you need it.
Guardrails
Against prompt injection and data leakage, with sandboxed tool execution and least privilege.
Human approval
On the riskiest actions - and a way to stop a misbehaving agent instantly. Plan the off switch first.
Cost is an architecture decision
With usage-based pricing, cost scales with adoption, and a runaway agent loop can get expensive fast. You need per-team and per-feature cost attribution, budgets and alerts that fire before the invoice does, and a way to show each team what it is consuming.
Route easy traffic
Send the large share of simple requests to cheaper or open models, with no loss in quality.
Cache aggressively
Stop paying twice for the same answer. The biggest, dullest lever on the bill.
Cap runaways
Step limits and budgets that fire before the invoice does, not after.
The good news: the biggest levers on the bill - routing easy traffic to cheaper models, caching aggressively, and capping runaways - all live at the gateway you already built. Controlling spend is an architecture problem, not a spreadsheet exercise.
Day Two is where AI dies
Launch is the start of the work, not the end. Most AI fails after it ships, because nobody built the means to see it, measure it, or improve it. A single agent request fans out into many steps, so you need distributed tracing across each one to debug a slow or wrong answer. You roll those traces up into live operational views of latency, errors, cost, and quality.
Distributed tracing
One request fans out into many steps. Trace each to debug a slow or wrong answer.
Operational views
Latency, errors, cost, and quality, rolled up live.
Drift detection
Real inputs wander from what you tested; providers update models underneath you.
Close the loop
Capture feedback, fold it back into evals, prompts, retrieval, and tuning.
You watch for drift, because real-world inputs wander away from what you tested and providers update models underneath you. And you close the loop by capturing user feedback, folding it back into your evaluation sets, and feeding it into prompts, retrieval, and tuning, so the system compounds instead of decaying. Then you treat the whole thing like the production system it is, with alerting, on-call, and runbooks.
“Someone owns it at 2am, or it dies.
The steering wheel
Here is where this is heading. More of this stack collapses into a prompt every year, and autonomy keeps improving. We are simply not there yet. There remains a long tail of scenarios that today's models and coding assistants handle badly or unpredictably. Self-driving cars are the same story: good enough to impress, not yet good enough to remove the wheel.
A production AI platform is that steering wheel. It lets the system drive itself where it is capable and lets your team take over and adjust by hand the moment it hits the long tail. That manual control is not a sign of immaturity. It is what makes shipping today possible.
It is no longer build versus buy
For years the choice was framed as build or buy, and both answers were bad. Build the whole stack yourself and you spend a year on undifferentiated plumbing before you ship anything that matters. Buy a narrow point tool and you still have to stitch the rest of the stack together by hand.
A year on plumbing
You spend a year on undifferentiated infrastructure before you ship anything that is actually your business.
Stitch the rest by hand
A narrow tool solves one box. You still assemble the gateway, evals, tracing, and governance yourself.
There is a third option. Start on a platform that already provides these components - the gateway, the evaluation harness, tracing, governance, identity propagation, and the rest - and it carries you roughly eighty percent of the way. Your team then spends its time on the twenty percent that is actually your business: the use cases, the data, and the judgment that make your product yours.
None of this removes the discipline. People and process still decide whether any of it pays off, which is why we order it people, process, platform, with the platform last. The platform exists to take the undifferentiated weight off your team so they can spend their attention where it counts. The prompt was always the easy part. Production is the discipline, and it is the part worth getting right.
Building something and not sure how much of this you have covered?
We offer a free expert report on your AI idea or roadmap, reviewed by our Chief AI Officer. It is the lowest-risk way to find out where the gaps are before they become incidents.
Get your free expert report