5 min read

The Embedding Race Looks Like a Tie. It Isn't.

Google didn't invent multimodal embeddings. Amazon got there first. So what exactly did Google ship? An architectural argument — and it's one worth understanding.

Embedding

A few weeks ago, Amazon announced Nova Multimodal Embeddings. It maps text, images, video, and audio into a single vector space. One model, one API, unified retrieval across modalities.

Three months later, Google announced Gemini Embedding 2. It maps text, images, video, and audio into a single vector space. One model, one API, unified retrieval across modalities.

Google called it "our first natively multimodal embedding model."

Amazon had already called theirs "the first unified embedding model" that does exactly the same thing.

So which is it? Did Google ship something meaningfully new, or is this a badge race dressed up in launch copy?

The answer depends on one word Google keeps using. Natively.

Two Roads to the Same Destination

Imagine you want to build a library where books, photographs, films, and recordings all sit on the same shelf — findable by the same search, comparable to each other.

You have two ways to build it.

Option one: hire a specialist for each medium. A librarian for books. A curator for photos. A film archivist for video. A sound engineer for audio. Each one catalogs their collection in their own system. When someone searches, you query all four systems separately, collect the results, and stitch them together at the end. This is called late fusion. The modalities never actually meet — they just get reconciled after the fact.

Option two: build one unified cataloging system from scratch, designed from day one to understand all four mediums simultaneously. Every item — regardless of whether it's a book, a photo, a film, or a recording — gets understood through the same lens and placed in the same index. This is what "natively multimodal" means.

Most multimodal embedding models today — including Amazon Nova and Voyage — use some version of the first approach. Train strong individual encoders, project them into a shared mathematical space, align them. It works. The results sit in the same vector space. You can compare them.

Gemini Embedding 2 claims to use the second approach. One transformer backbone. All modalities processed through the same architecture simultaneously. The model doesn't learn to translate between modalities — it learns to understand them together.

That's an architectural claim worth taking seriously. But also one worth interrogating.

We've Seen This Movie Before

Two years ago, the same debate played out in large language models.

Early multimodal LLMs took the late fusion approach — a strong language model, a strong vision encoder, a connector layer in the middle. GPT-4V worked this way initially. The results were impressive but had a ceiling. The model was good at describing images and answering questions about them. It struggled with tasks that required deep reasoning across modalities — understanding the relationship between what was said and what was shown, not just each in isolation.

Then natively multimodal models emerged — trained from the beginning on interleaved text and images, not bolted together after. The improvement on cross-modal reasoning tasks was significant. Not because the individual modalities got better, but because the model developed genuine understanding of how they relate.

The pattern: late fusion gets you most of the way there, fast. Native architecture gets you the rest, eventually — but the "rest" turns out to matter more than expected on the hardest tasks.

Gemini Embedding 2's benchmark numbers hint at the same dynamic. On straightforward text-to-image retrieval, the gap between Google and competitors is real but not dramatic. On text-to-video retrieval — the task that requires understanding motion, timing, and semantic content simultaneously — the gap widens considerably. Google scores 68.8. Amazon scores 60.3. That's exactly where you'd expect a natively multimodal architecture to pull ahead.

The Thing That Actually Doesn't Exist Yet Elsewhere

One capability in Gemini Embedding 2 has no direct equivalent in any competing model right now: interleaved input.

You can pass an image and a text description together in a single request, and get back one embedding that captures the relationship between both — not two embeddings that get averaged or combined downstream. The model understands the image in the context of the text. A product photo alongside a query gets processed as a unified intent, not as two separate retrievals that get reconciled later.

This matters for a specific class of applications — ones where the meaning only emerges from the combination. A user's voice note and the screenshot they took at the same moment. A video frame and the timestamp label. A diagram and its caption. Previously you'd retrieve each separately and hope the ranking held. Now the combination is the unit of retrieval.

Whether that unlocks applications that weren't possible before, or just makes existing ones cleaner — that's the open question.

What's Not Resolved Yet

Native architecture is a legitimate technical distinction. It doesn't automatically mean better outcomes in production.

Late fusion models have years of iteration, real-world testing, and infrastructure investment behind them. Google's model is in public preview, which means the benchmark numbers are Google's own, and independent production benchmarks are still thin. The re-indexing cost to migrate is non-trivial — you have to re-embed your entire corpus because the vector spaces are incompatible with previous models.

And there's a version of this story where the architectural difference doesn't matter much for most use cases. If you're doing text-to-image product search, any of these models probably works fine. The native advantage shows up at the edges — complex cross-modal reasoning, audio fidelity, tasks where the relationship between modalities is the point.

The Question Worth Sitting With

In infrastructure, architecture compounds. The choices that look like implementation details in year one become the constraints or the advantages in year three.

The late fusion vs. native multimodal debate in embeddings is young. We don't have enough production evidence to call it. But we have the LLM precedent, we have the benchmark shape — widening gaps on harder tasks — and we have one capability (interleaved input) that the stitched approach genuinely can't replicate.

Google didn't invent unified multimodal embeddings. Amazon got there first commercially. What Google shipped is a different architectural bet on how to get there — and an early signal that the bet might be paying off on the tasks that are hardest to fake.

Whether that matters to you depends on what you're building. If your retrieval needs are mostly text, or text-plus-image in simple combinations, the architecture story is interesting but not urgent. If you're building something where the meaning lives in the relationship between modalities — where a transcript isn't good enough, where context requires combination — the architectural choice is worth thinking about more carefully than most people currently are.

The destination looks the same. The road is different. In infrastructure, the road has a way of mattering more than it seems upfront.

Related reading