Google Glassの投影データ. Gemini Generates Reality For Windows 7,8,10,11

Building Invisible City Part #3. How event-driven AI agents create spatial interfaces that don’t exist until you need them

This is the third in a series of posts on my distributed A2A based agents as service mesh within apps post using Gemini see the part 1 そして part 2 for a technical deep dive.

https://medium.com/media/fe023eefd7015e97be1d27ad8fdf22f5/href

“Dad! I see a glowing blue pipe under the fire hydrant. Can we add kittens down there?」

That was my daughters first reaction to seeing our street in Augmented Reality. Not the three AI agents working in concert. Not the Gemini model generating a spatially-aware overlay. Kittens. She saw boring pavement overlaid with information, watched a hidden world appear underneath, and immediately wanted to populate it with collectibles. Actually I think that’s a fun idea.

Remember the skateboard that started all this? The one that went down the storm drain last summer? She finally got her X-ray vision. But what she’s seeing now is bigger than what I imagined when I climbed down that manhole cover. She’s seeing a canvas.

And that changed how I think about AR.’

From projecting data to processing reality

Back in 2014 I flew up to Google’s Base Camp in NYC for the Glass pre-launch hackathon. That’s where I got the Pioneer crystal that still sits on my office shelf next to the original Glass hardware. I built LynxFit, an AR fitness app that floated your workout stats into your peripheral vision. We thought we were pioneering the future. We were actually discovering its limits.

Here’s what I took away from those Glass days, and it took me years to articulate it: we treated AR as a display problem. Show notifications. Project metrics. Cram a stripped-down smartphone UI onto a tiny prism and call it spatial computing. I’d demo LynxFit to runners and they’d get excited about seeing their pace floating mid-air but we never got digital artifacts to truly spatially anchor on the real world. We were so busy putting information IN the world that we never stopped to ask what the world was telling US.

Fast forward to last years’ Android XR announcement.

In this world of AI Agents, the concept of building apps for XR is dying. The job is no longer building apps for screens. The job is orchestrating AI agents that understand the physical geometry, context, and intent of wherever you’re standing.

https://medium.com/media/25cef304b503ed18a5e81ac55c441f9c/href

I’ve been predicting this for over a decade checkout my speakerdeck talks on XR. We went from projecting data to processing reality. That’s the whole shift.

The event that changed everything

Adding mobile AR to The Invisible City was the unlock. When bounding boxes started locking onto a real fire hydrant and tracking as I walked around it, something clicked. This was ambient intelligence responding to the world.

Here’s what actually runs in live mode right now:

Camera moves. Motion event fires at 30fps from the phone.
Surface Eye auto-detects. New markers stream in over WebSocket.
Pattern Oracle notices the anomaly. “That gas line is only 18 inches deep. That’s shallow.”
Depth Renderer paints a red warning overlay.
Gemini Live speaks up. “Gas line predicted at current position.”

Nobody said “hey AI, analyze this.” The system sees, understands, and tells you. Autonomously.

I tested this walking down my street. The moment I crossed a utility marker cluster, the voice kicked in with something close to: *three utilities converging here, water main at standard depth, telecom conduit appears to cross above it, unusual configuration.* (That one’s illustrative of what the system narrates, not a verbatim log. The real transcripts are messier.) The point stands. I didn’t ask. It just knew this was worth mentioning.

That’s a different kind of AR than anything I built on Glass.

Making streets transparent

Here’s the moment developers lean in. We tell Gemini to make the street transparent, and it understands what that means spatially.

config = types.GenerateContentConfig(
 response_modalities=["IMAGE", "TEXT"],
)
response = await client.aio.models.generate_content(
 model="gemini-3.1-flash-image-preview",
 contents=[
 types.Part.from_bytes(data=image_data, mime_type="image/jpeg"),
 prompt
 ],
 config=config
)

That config line is doing a lot of work. We’re asking Gemini to generate an image AND explain its reasoning in the same call. It’s creating the visualization, not just classifying pixels.

The prompt is where the behavior comes from:

"transparent": """
CRITICAL: You must PRESERVE the original image exactly, only modifying the
ground surface to be semi-transparent.
Instructions:
1. Keep all buildings, cars, people EXACTLY as they are
2. Apply "frosted glass" effect ONLY to pavement/road
3. Show underground utilities glowing beneath this surface
4. Use APWA colors for the lines (Red: Electric, Blue: Water…)
5. Do NOT add any text, labels, or UI elements
"""

Gemini figures out what’s ground versus what isn’t, applies transparency selectively, embeds utilities so they look like they’re IN the earth rather than floating on top, and preserves perspective while revealing depth. Remember in post one when Surface Eye kept confusing manhole covers with frisbees? That same model family now understands material properties and 3D space well enough to make asphalt selectively see-through while leaving every car and pedestrian untouched.

Pattern Oracle’s 350-foot spacing math from post two? It renders as glowing pipes at the correct depth beneath that transparent layer. The three agents finally show up together in one frame.

Generative UI: interfaces that exist only when you need them

My daughter’s kitten request pointed at something I hadn’t put words around yet.

In traditional software, developers hard-code every button, menu, and dashboard ahead of time. The UI waits for you. It’s always there whether you need it or not.

In our AR system the interface is ephemeral. It doesn’t exist until you need it. When she asked for kittens, she wasn’t requesting a feature in a backlog. She was describing an interface that should generate itself on demand.

Here’s a simplified version of how we branch context. The real system is more complex and the context classifier is itself a model call. This is the illustrative skeleton:

# Simplified illustration, not the full implementation
async def generate_contextual_interface(
 user_context: str, # "construction_worker" | "child" | "city_planner"
 live_markers: list, # streaming in from Surface Eye
 current_inference: dict, # from Pattern Oracle
):
 if user_context == "child":
 prompt = f"""
 Create an AR treasure hunt interface:
 - Hide virtual objects near {len(live_markers)} detected markers
 - Make them glow and pulse to attract attention
 - Add particle effects when discovered
 """
 elif user_context == "construction_worker":
 prompt = f"""
 Create utility safety overlay:
 - Highlight shallow gas lines in red
 - Show required dig clearances
 - Display 811 call status for this location
 """
return await gemini.generate_spatial_ui(prompt, current_scene)

Same physical street, completely different generated reality. I’ve tested both. The construction worker sees depth measurements and safety zones. My daughter sees glowing collectibles near the hydrants.

Now extend that to head-mounted displays. You’re wearing Android XR glasses, you look at a street-level junction box, and the AI doesn’t just identify it. Based on gaze duration (are you studying it?), your role (electrician on call?), and context (power outage reported on this block?), it generates a custom interface: a diagnostic panel with voltage readings, arrows pointing to the main breaker, step-by-step reset instructions positioned spatially, a voice asking whether it should dial the utility for you.

The interface doesn’t exist until the moment you need it. Look away and it’s gone. No menus to close. No windows to minimize.

This is the thing I couldn’t do in 2014. On Glass we had to pre-build every possible interface. Now Gemini generates them on demand.

What I shipped (and what surprised us)

Here’s what actually happens when you pull up the app today.

Live camera mode with real-time narration. Hit “Start Live” and your camera feed comes up with a scan line effect. Surface Eye processes 30fps video, streaming markers over WebSocket. Gemini Live is listening the whole time and narrates what it sees: *I see a fire hydrant at twelve o’clock about ten feet ahead, blue spray paint indicates water main below, orange marks suggest telecom crossing at this intersection.* You didn’t ask. It assumed you’d want to know.

One honest caveat here. Gemini 2.5 Flash Native Audio Preview’s tool calling during Live API sessions is unreliable roughly once in every 20 calls. That’s GitHub issue #843 if you want to follow along. I work around it thanks also to Gemini CLI it implemented a parallel 15-second backend timer that catches dropped tool invocations and replays them. Not sexy. Absolutely necessary. Production AR lives and dies on these “boring” reliability patches.

Four visualization styles, each rendered from the same camera frame:

Transparent. Pavement turns into frosted glass. Blue water pipes glow beneath, red electric lines pulse. Everything above ground stays exactly as shot.

X-Ray. Whole scene goes dark blueprint. Only the infrastructure glows. Looks a little like Blade Runner, generated live from a phone.

Cutaway. My favorite. Like somebody took a massive saw to the earth and sliced a cross-section. You see soil layers, gravel beds, pipes at their actual depths. Construction crews get it immediately.

AR Overlay. The mode built for walking around. Holographic boxes lock onto real markers, glowing lines connect underground paths, distances float in mid-air. All of it updating at 30fps.

And because I promised honesty: early on, Gemini would generate a beautiful transparent street… with a random giraffe standing in it. Or it would make the road transparent AND the cars transparent. One particularly special output I can only describe as “pipes having an existential crisis in the void.” So we built a deterministic fallback path:

if not visualization_base64:
 logger.info("Falling back to deterministic compositing")
 # Gemini failed? No problem. Draw it ourselves.
 visualization_base64 = await create_visualization(
 surface_image_base64,
 surface_analysis,
 network_inference,
 style=style
 )

When the model misbehaves we composite the visualization with traditional graphics. The user never sees the failure. This hybrid pattern, AI when it works and deterministic when it doesn’t, is what separates a demo from something people actually walk around with.

Anti-chatbot: event-driven agents

Let me be direct about something. This is not a chatbot with a camera strapped on.

I have notoriously ranted about how Chatbots are a lazy interface (though was useful to showcase what LLMs could do back in 2022). They all work the same way: you ask, they answer. Even the fancy ones wait for you to upload an image and type a question.

Our agents are event-driven. They respond to the world, not to prompts. Here’s the actual WebSocket handler from live mode:

onToolCall: ({ 名前, args }) => {
 if (name === "surface_eye_analysis") {
 // Agents detected something, no user prompt needed
 const markers = args.markers;
 updateAROverlay(markers);
if (hasAnomaly(markers)) {
 speakWarning("Unusual configuration detected");
 }
 }
}

The user isn’t asking “what do you see?” The agents are continuously processing, inferring, and alerting.

Where this goes

My daughter wants kitten mode. Her playful idea is pointing at something serious.

What I’m building right now, not someday. Multi-user AR sessions where two phones pointed at the same street see the same generated overlays, backed by shared spatial state from the same agent cluster. Voice-first interaction through Gemini Live so the interaction model is plain conversation. On-device Pattern Oracle inference for head-mounted displays, because cloud round-trips kill presence in under 100ms.

The part that made me rethink the whole premise. We keep framing AR as “adding to reality.” What if we’ve had it backwards the whole time? What if reality is just the default UI, and AR lets us generate better ones on top of it?

A street is a UI for transportation. A wall is a UI for spatial division. A door is a UI for access control. Architects and engineers designed those interfaces decades or centuries ago and we’ve been stuck with them. Now we can generate new interfaces over the old ones in real-time, personalized to whoever’s looking.

A surgeon sees vitals floating over organs, inferred from skin color change and chest movement rather than from wired sensors. A chef sees timers over pots and temperature gradients over pans, generated from the ingredients and the tools on the counter. My daughter sees a treasure hunt where virtual kittens hide near real fire hydrants, and the game generates differently each time depending on weather, time of day, and which markers Surface Eye finds. The game isn’t something we built. It’s something the system builds, on the fly, against the physical world as the canvas.

The stack

This runs on:

Gemini Live API for voice-first AR interaction (DeepMind)
– Vertex AI Agent Engine orchestrating the three agents at scale
– Google Cloud Run for low-latency edge deployment
In the future…
– Android XR as the spatial computing foundation

The same Gemini model that writes poetry is generating spatially-aware AR overlays. The same Agent Engine that powers chatbots is orchestrating real-time vision systems. We’ve moved from APIs that process to agents that perceive.

Your turn

The Invisible City started with a skateboard in a storm drain. It became a system that makes streets transparent. That’s the beginning, not the end.

My daughter wants to hide virtual kittens in the pipes for other kids to find. She’s already designing power-ups and debating whether kittens should glow different colors based on which utility they’re near. (They should.)

Construction crews don’t just want to see what’s safe to dig. They want the AI to warn them before they pick up a shovel: based on your proximity to the marked gas line and current wind conditions, approach from the north.

City planners don’t just want dashboards of infrastructure capacity. They want patterns humans miss: this neighborhood’s water usage peaks 30 minutes earlier than surrounding areas, pipe stress indicators suggest upgrading this section before the predicted failure in 18 months.

So here’s my real question. What will you build when the interface generates itself based on who’s looking? What’s your kitten mode… the playful, practical, or profound overlay you’d put on top of reality if the model would just do the drawing for you?

The code is real. The platform is here. The models are ready. My daughter already has a feature list.

Let’s build something different…something that matters.

— –

Next in the series will be determined by what I get access to in May but I hope it is my journey to deploying this on real AI Glasses, stay tuned.

Noble is a Google Developer Expert for AI/ML and a Glass Pioneer from the 2014 NYC Base Camp hackathon, currently obsessed with what happens when AI agents can see, reason about, and generate spatial interfaces in real-time. Probably testing AR features on his street right now, much to his neighbors’ confusion.

Posts one and two in the series: Building a Vision-Powered Infrastructure Detection Agent with Gemini 3 · Building an A2A Discoverable Reasoning Agent with Domain Knowledge

Follow the journey: YouTube: Noble Ackerson

Google Glassの投影データ. Gemini Generates Reality was originally published in Google Developer Experts on Medium, 人々がこのストーリーをハイライトして応答することで会話を続けている場所.