Giving Agents a Memory of Places, Not Just Words

A dark warehouse aisle where the far corner dissolves into a dense cloud of translucent electric-blue points - a 3D reconstruction of a physical space held in an agent's memory and re-rendered from an angle no camera occupied

Ask a text agent what it remembers and it will hand you a transcript. Ask it what the back corner of the warehouse looked like, the corner it glanced at once on the way to somewhere else, and you get nothing. The observation happened, the moment passed, and the only record is whatever tokens the model happened to emit at the time. If those tokens missed the thing you are now asking about, the memory is gone. Not buried. Gone.

That gap is the quiet ceiling on every agent that has to operate in or reason about a physical place. A warehouse, a job site, a retail floor, a building an inspection robot is walking, a room a home robot lives in. We have spent two years teaching agents to remember words. The harder problem, and the one that is starting to break open in the research, is teaching them to remember space.

Gaussian splatting is turning into the answer. Not as a graphics trick, though that is where it came from, but as a memory substrate. A way for an agent to hold a place in a form it can re-observe later, from viewpoints it never physically occupied. This is not one lab's bet. Over the last six months a cluster of independent groups has landed on the same idea from different directions, which is usually the sign that a capability is about to leave the research phase. A March 2026 paper called GSMem [1] makes the case most cleanly, so it is worth walking through in detail, because the architecture it implies is exactly the kind of thing that looks magical in a demo and quietly falls over on day two unless you build the unglamorous parts right. We will come back to the others, and to why they point the same way.

What an agent's memory usually is, and why it fails

Today an embodied or spatially-aware agent remembers its environment in one of two ways, and both throw away most of what it saw.

Object-centric

The scene graph

A perception model detects objects and writes them as structured nodes: a chair here, a pallet there, a fire extinguisher on that wall. Compact and easy to query. But the moment you reduce a rich scene to a list of labels, everything the detector missed is simply not in memory. A single detection failure becomes an irrecoverable memory omission. The raw scene was discarded the instant it became nouns.

View-based

The snapshot scrapbook

Instead of labels, the agent keeps egocentric keyframes captured along its path. This preserves more detail, but it is sparse and locked to wherever the camera happened to be pointing. If the agent caught a shelf at a bad angle, occluded, half in shadow, that bad angle is the memory. There is no way to lean in and look from the side.

Both paradigms share a deeper flaw. They lack what the paper calls post-hoc re-observability. A human who walks through a room and is later asked about a detail can mentally re-enter the space and look again from a new angle. Current agents cannot. They are locked into the observations they made the first time, at the moment they made them. Miss it then, miss it forever.

What splatting changes

3D Gaussian splatting represents a scene not as labels and not as snapshots, but as a dense cloud of millions of tiny translucent blobs, each with a position, a shape, a color, and an opacity. Composite them from a given camera angle and you get a photorealistic image. The important property for our purposes is that this representation is continuous and renderable. Once you have built the field, you can render the scene from any viewpoint, including viewpoints the agent never actually visited.

Spatial Recollection

Re-observing a place from an angle you never stood at

Sparse real observations build a continuous field that renders any viewpoint - including new ones.

That single property is what turns splatting from a rendering technique into a memory. The agent explores a space and continuously builds up a Gaussian field as it goes. Later, asked a question about that space, it does not dig through a scene graph hoping the right label exists, and it does not flip through snapshots hoping one caught the right angle. It renders a fresh view of the relevant region, choosing an angle that actually answers the question, and hands that synthesized image to a vision-language model to reason over. The agent re-observes a place it has already left, from a perspective it never stood at. The paper calls this spatial recollection, which is the right name for it. It is the closest thing an agent has had to the human ability to revisit a memory and look again.

How GSMem actually works

The mechanism matters here, because the production implications fall directly out of it. There are four moving parts.

The GSMem Loop

Four moving parts

Map the space, lay meaning over the geometry, retrieve and render an answer, decide where to look next.

First, mapping. As the agent moves, it takes RGB-D frames, color plus depth, and incrementally builds the Gaussian field. It does not optimize every frame; it selects keyframes when the view has changed enough to be worth it, using optical flow to measure that change, and keeps a sliding window of recent keyframes to keep the update stable and bounded. This is the persistent spatial memory, and it is growing continuously as the agent explores.

Second, a language field laid over the geometry. Pure geometry tells you where surfaces are, not what they mean. So each Gaussian also carries a compressed language embedding, derived from a vision-language feature extractor and folded onto the 3D points as the agent maps. The clever, production-relevant detail is that they do this without a separate training pass. The same blending weights used to render the scene forward are reused in reverse to push 2D semantic features back onto the 3D Gaussians, and the embeddings are squeezed from 768 dimensions down to 32 to keep the memory footprint sane. The result is a scene you can query by meaning, in real time, without an offline optimization step.

Third, retrieval and rendering. When a question arrives, the agent localizes the region of interest two ways at once. It asks the scene graph for relevant objects, and in parallel it queries the language field for Gaussians whose embeddings match the meaning of the query. These two paths cover for each other. If the object detector never labeled the target, the semantic field can still find the region, because the raw appearance was never thrown away. Once a region is localized, the agent samples a ring of candidate camera poses around it, scores them for visibility and framing and rendering quality, and picks the best one. Then it renders that optimal view and feeds it to the VLM. This is the re-observation step, the part that the older architectures structurally cannot do.

Fourth, exploration. When the current views do not contain enough to answer, the agent has to decide where to go next. It blends two signals: a semantic score from the VLM for how likely a direction is to reveal something relevant, and a geometric information-gain score that measures how much a new vantage would reduce uncertainty in the Gaussian field. Task-aware when it has a lead, coverage-driven when it does not.

The numbers

On an active embodied question-answering benchmark, GSMem scored higher than the object-graph and snapshot-based approaches it was compared against, and the gap widened in the lifelong navigation setting, where the agent accumulates memory across many episodes and has to retrieve things it saw long ago. Persistent, re-renderable memory helps most exactly where you would expect it to: over long horizons, where the older approaches have had the most time to lose information.

Why this is a Strongly kind of problem

If you have read our work on LLM memory and on modern retrieval, this will feel familiar in shape. We have argued before that memory and retrieval are where agents quietly succeed or fail in production. Spatial memory is the same argument with one more dimension. And the moment you try to take an architecture like this out of a benchmark and into a real deployment, the demo-day magic gives way to a set of engineering questions that decide whether it survives contact with Tuesday at 2am.

A spatial-memory agent is, underneath, a retrieval system whose corpus is a living 3D reconstruction. Everything we know about running retrieval in production applies, plus a set of new constraints that come specifically from the fact that the memory is geometric, continuous, and growing.

It grows without bound

Millions of Gaussians per room, accumulating for as long as the agent patrols. Footprint management is a first-class design constraint, not an afterthought.

Rendering is in the loop

Retrieval now means running a GPU render mid-decision, sometimes several times a step. A rendering budget sits right next to the token budget.

It captures the real world

A photorealistic, re-renderable record of a physical place and whoever was in it. That is a governance and privacy problem from day one.

Retrieval is now geometry

A bad retrieval can render a confident, photorealistic view of a place the field reconstructed poorly. A spatial hallucination is harder to catch than a text one.

The memory grows without bound

A scene graph is small. A snapshot scrapbook is bounded by how many frames you keep. A Gaussian field is neither. It grows as the agent explores, and in a lifelong deployment, a robot that patrols the same facility every day for a year, it grows without a natural stopping point. Millions of Gaussians per room is normal. The GSMem authors compress embeddings aggressively and degrade the color representation deliberately to hold the footprint down, and that is the tell: footprint management is not an afterthought here, it is a first-class design constraint.

In production this becomes a real set of decisions. What do you keep at full fidelity and what do you compress? When a region has not been queried in months, does it get evicted, archived to cold storage, or downsampled in place? When the physical space changes, shelves move, walls get repainted, a machine is replaced, how does the memory get updated rather than just accumulating a contradictory pile of old and new Gaussians for the same location? None of this shows up in a single-session demo. All of it shows up in month three of an actual deployment. This is day-two work, and it is the kind of thing we build for from the start rather than bolting on once the field has already ballooned. It is also live research: TGSFormer [3] exists specifically to merge overlapping Gaussians and regulate density so the memory stays compact as the scene grows, which is the eviction-and-compaction problem by another name.

Rendering sits inside the agent's reasoning loop

In a text agent, retrieval is a database lookup. In this architecture, retrieval includes rendering a novel view, which means running a GPU workload in the middle of the agent's decision loop, sometimes several times per step as it scores candidate viewpoints. Our inference series went deep on exactly this kind of problem: the difference between a model that works in a notebook and a system that serves under real latency and cost constraints. Splatting puts a rendering budget right next to the token budget, and the two compete for the same accelerators.

The questions that follow are concrete. How many candidate viewpoints can you afford to score before the agent's response time becomes unusable? Can you cache rendered views for regions that get queried repeatedly, the spatial equivalent of a prefix cache? When the agent is reasoning about a place it is not currently standing in, where does that rendering run, on the robot, at the edge, or in the cloud, and what does the round trip cost you? These are serving and economics questions, and they are the ones that determine whether a spatially-aware agent is a product or a research demo.

The reconstruction is a capture of the real world

Here is the part that is easy to skip and expensive to skip. A Gaussian field of a real place is a photorealistic, re-renderable reconstruction of that place. If the place is a hospital ward, an office, a customer's home, a secure facility, then the agent's memory is a high-fidelity recording of a physical environment, including whatever and whoever happened to be in frame while it mapped.

The memory can be rendered from angles no camera ever actually occupied, which is powerful for reasoning and uncomfortable for privacy, because it means the system can produce views of a space that were never literally photographed.

That has immediate governance consequences. Who is allowed to query that memory? Can a later request render a view of a part of the building the requester has no right to see? When someone asks for their data to be deleted, what does deletion even mean when their appearance is diffused across millions of Gaussians rather than stored in a row you can drop? If the agent is acting on what it reconstructs, you need the same things we argue for with any production agent: a record of what was captured, what was rendered, what was queried and by whom, and guardrails on what the spatial memory is allowed to be used for. A reconstruction of a real place that anyone can re-render from any angle is exactly the kind of capability that needs governance designed in, not discovered later.

Retrieval quality is now a geometry problem too

With text RAG, a bad retrieval gives the model the wrong passage. With spatial memory, a bad retrieval can give the VLM a rendered view that is geometrically plausible and semantically wrong, an angle that looks fine but misses the occluded thing that actually answers the question, or a region the field reconstructed poorly because the agent only ever saw it once from far away. The agent can render a confident, photorealistic image of a place it does not actually have good data for. That is a spatial hallucination, and it is harder to catch than a text one because it looks real.

Handling this in production means treating the field's own uncertainty as a first-class signal, knowing which regions are well-observed and which are thin, and being willing to say the memory is not good enough here rather than rendering a confident guess. The research already carries the ingredients for this; the field's opacity and the information-gain math give you a handle on where the reconstruction is trustworthy. Wiring that uncertainty through to the agent's behavior, so it explores or defers instead of confabulating, is the production discipline that separates a system you can trust from one that is occasionally and invisibly wrong.

Why this is a direction, not a single paper

It would be easy to read GSMem as one clever result and move on. The reason to take it more seriously is that it is not alone. Several independent groups, working separately, have arrived at variations of the same architecture within months of each other, and the differences between them map almost exactly onto the production problems above.

[2]

GaussExplorer

Uses semantic 3D Gaussians as a compact episodic memory and adds a "VLM-as-judge" step that evaluates rendered views to pick the most informative one before reasoning. Same core idea, different mechanism for the viewpoint problem, which tells you viewpoint selection is real and contested.

Jan 2026

[3]

TGSFormer

Goes straight at the unbounded-memory problem. Maintains a persistent, compact Gaussian memory for large scenes and explicitly merges overlapping primitives to regulate density. The eviction-and-compaction problem as a first-class research target.

Nov 2025

[4]

SAGE-3D

Attacks the gap between a splat field you look at and one a robot can act in. Makes Gaussians physically executable, with collision geometry and object-level semantics, so "go to the red chair next to the white bookshelf" can be grounded and executed, not merely rendered.

Oct 2025

[5]

Embodied Gaussians

Closes the loop entirely: a robot keeps a live, correctable world model from Gaussians plus physics particles, renders it, compares the render to its actual cameras, and corrects the model in real time at 30Hz from three cameras. Spatial memory turned into a running world model.

Realtime

The infrastructure is converging on the same point. NVIDIA has been pushing splatting through its robotics stack, with NuRec neural reconstruction inside Isaac Sim and Gaussian fields serving as the geometric backbone for its Cosmos world models [6]. When the research and the platform vendors arrive at the same representation in the same year, from opposite ends, that is the pattern that precedes adoption rather than the pattern of a passing trend.

Read together, the trajectory is clear. Spatial memory for agents is moving through four stages, and the major infrastructure is already laid down underneath.

Stage 1

Represent

Hold a place re-renderably. GSMem, GaussExplorer.

Stage 2

Scale

Keep it compact over a lifetime. TGSFormer.

Stage 3

Act

Move and stay synced with a changing world. SAGE-3D, embodied Gaussians.

Stage 4

Infrastructure

Platform backbone already laid. NVIDIA NuRec, Cosmos.

Those are exactly the stages a capability passes through on its way to production.

What this looks like in practice

The applications are not exotic. In every one of these, the value is not the pretty reconstruction. It is the agent's ability to go back and look again at something it has already left behind.

Warehouse

Answer where the misrouted pallet ended up by re-observing aisles it walked hours ago.

Inspection

Asked after the fact whether it saw corrosion near a valve, render the view to check rather than shrugging because nobody labeled it at the time.

Facilities

Hold a persistent, queryable model of a building that anyone authorized can interrogate.

Retail

Understand the floor as a space, not a database of SKUs.

That ability is real and it is close. The teams that turn it into something dependable will be the ones who treat the splat field as what it is: a growing, expensive, sensitive, sometimes-wrong memory that has to be governed, served, and maintained like any other production system, only with geometry layered on top. The demo is the easy part. Making an agent remember a place well enough to bet on what it tells you, on day two and day two hundred, is the work. That is what we do.

Giving Agents a Memory of Places Not just words