Gemini Embedding 2: Multimodal RAG Over Your Images

Multimodal RAG Without an OCR Pipeline

For years, putting a chart, a scanned invoice, or a product photo into a retrieval system meant gluing together an OCR step, a captioning model, a vision-only embedding stack, and a separate text store. Each piece added latency, cost, and one more thing to break. Google just collapsed that whole pipeline into a single managed tool: the Gemini API File Search tool now runs on Gemini Embedding 2, the first natively multimodal embedding model in the Gemini family. It maps text, images, video, audio, and documents into one shared vector space, so a question typed in plain English can pull back the right slide image or the right paragraph from the same store.

This is blowing up across developer communities right now because it removes the most annoying part of building RAG over real documents. You no longer OCR a chart to make it searchable. You drop the PNG into a File Search store, and the model embeds the pixels directly. Retrieval becomes visual, not a lossy text approximation of an image. On top of that, Google made storage and query-time embeddings free, so the only thing you pay for is the one-time indexing pass.

In this guide you will build a working multimodal RAG service from scratch in Python: create a multimodal store, index a mix of text and image assets, ask grounded questions, read back citations so you can verify every answer, filter by metadata, and force the model to return clean JSON. Everything here is checked against the official Gemini API docs (Interactions API, currently in Beta).

Prerequisites

Python 3.9 or newer.
A Gemini API key from Google AI Studio (aistudio.google.com/apikey). The free tier gives you a 1 GB File Search store, which is plenty for this tutorial.
The Google Gen AI SDK, version 2.0.0 or newer. The Interactions API and gemini-embedding-2 require the new SDK.
Basic familiarity with what an embedding is. You do not need to manage vectors yourself; File Search is a fully managed RAG service.

pip install -U google-genai
export GEMINI_API_KEY="your-key-here"

The SDK reads GEMINI_API_KEY from the environment automatically, so genai.Client() needs no arguments.

Step 1: Create a multimodal File Search store

A File Search store is a persistent container for your document embeddings. Raw files you upload through the Files API are deleted after 48 hours, but anything imported into a store lives until you delete it. The one decision that makes the store multimodal is the embedding_model: set it to models/gemini-embedding-2 and the store can embed images and text in the same vector space.

from google import genai

client = genai.Client()

store = client.file_search_stores.create(
    config={
        "display_name": "product-kb",
        "embedding_model": "models/gemini-embedding-2",  # the multimodal model
    }
)

print(store.name)
# -> fileSearchStores/product-kb-7f3a9c12

Hold on to store.name. It is a globally scoped identifier you will pass to every query. If you forget it, you can list your stores with client.file_search_stores.list().

Step 2: Index a mix of text and images

Now feed the store. The point of Embedding 2 is that you treat an image file exactly like a text file: upload it and import it. There is no captioning or OCR step. Below we index a Markdown spec sheet and two images, a revenue chart and a product photo, into the same store. The upload_to_file_search_store call uploads and indexes in one shot, returning a long-running operation you poll until it finishes.

import time

def index_file(path, display_name, metadata=None):
    cfg = {"display_name": display_name}
    if metadata:
        cfg["custom_metadata"] = metadata
    op = client.file_search_stores.upload_to_file_search_store(
        file=path,
        file_search_store_name=store.name,
        config=cfg,
    )
    while not op.done:
        time.sleep(5)
        op = client.operations.get(op)
    print(f"indexed: {display_name}")

# A text document
index_file("specs/widget_pro.md", "widget_pro_specs",
           metadata=[{"key": "doc_type", "string_value": "spec"}])

# Images, embedded directly from pixels, no OCR
index_file("assets/q1_revenue.png", "q1_revenue_chart",
           metadata=[{"key": "doc_type", "string_value": "chart"}])
index_file("assets/widget_pro.jpg", "widget_pro_photo",
           metadata=[{"key": "doc_type", "string_value": "photo"}])

Behind the scenes each file is chunked, embedded with Gemini Embedding 2, and stored. The custom metadata (here a simple doc_type tag) rides along with each chunk so you can filter later and so it shows up in citations. You only pay for this indexing pass, at $0.15 per 1M tokens; storage and query-time embeddings are free.

Step 3: Ask grounded questions across modalities

Querying is a normal interactions.create call with the file_search tool attached. You do not embed the query yourself; the tool does it, runs semantic search over the store, and feeds the top chunks to the model as context. Crucially, a text question can retrieve an image chunk, because both live in the same embedding space.

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input="What was the Q1 revenue trend, and how heavy is the Widget Pro?",
    tools=[{
        "type": "file_search",
        "file_search_store_names": [store.name],
    }],
)

# The final model_output step holds the answer text
print(interaction.steps[-1].content[0].text)

Example output (illustrative):

Q1 revenue rose steadily month over month, from about $1.2M in
January to roughly $1.9M in March, an upward trend of ~58% across
the quarter. The Widget Pro weighs 340 g.

The revenue figure was read from the chart image, and the weight came from the Markdown spec, in a single answer. That cross-modal join is the whole reason this feature is interesting: one store, one query, two source types.

Step 4: Read citations so you can trust the answer

An ungrounded answer is a liability. File Search attaches citations to the response so you can show users (and your QA process) exactly which file each claim came from. Citations live in the annotations field of each content block inside the model_output step.

for step in interaction.steps:
    if step.type != "model_output":
        continue
    for block in step.content:
        if block.type == "text" and block.annotations:
            print("\nSources:")
            for ann in block.annotations:
                if ann.type == "file_citation":
                    print(f"  - {ann.file_name}: {ann.source}")

Example output (illustrative):

Sources:
  - q1_revenue_chart: chunk showing monthly revenue bars Jan-Mar
  - widget_pro_specs: "Net weight: 340 g"

Each annotation also carries any custom metadata you attached at index time. If you stored a page number, a source URL, or an author, you can surface it directly in your UI for click-through verification.

Step 5: Filter by metadata to scope the search

Once a store grows, you rarely want to search everything. The metadata_filter field restricts retrieval to chunks whose metadata matches a filter expression (the syntax follows AIP-160 list-filter rules). Here we answer a question using only the image assets, ignoring the text spec entirely.

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input="Summarize what the visuals show.",
    tools=[{
        "type": "file_search",
        "file_search_store_names": [store.name],
        "metadata_filter": 'doc_type = "chart" OR doc_type = "photo"',
    }],
)
print(interaction.steps[-1].content[0].text)

Metadata filtering is the cheapest way to improve retrieval quality. Tag documents by tenant, product line, language, or recency, and a single physical store can safely back many logical views of the data.

Step 6: Force structured JSON output

Starting with the Gemini 3 family, you can combine File Search with structured outputs, so the grounded answer comes back as schema-validated JSON instead of prose. This is what you want when the RAG result feeds another system rather than a human.

from pydantic import BaseModel, Field

class ProductFact(BaseModel):
    name: str = Field(description="Product name")
    weight_grams: int = Field(description="Net weight in grams")
    q1_revenue_trend: str = Field(description="Short description of Q1 trend")

interaction = client.interactions.create(
    model="gemini-3-flash-preview",
    input="Extract the product name, weight, and Q1 revenue trend.",
    tools=[{
        "type": "file_search",
        "file_search_store_names": [store.name],
    }],
    response_format={
        "type": "text",
        "mime_type": "application/json",
        "schema": ProductFact.model_json_schema(),
    },
)

fact = ProductFact.model_validate_json(interaction.steps[-1].content[0].text)
print(fact)
# name='Widget Pro' weight_grams=340 q1_revenue_trend='Up ~58% Jan-Mar' (quarter)

You now have grounded, cited, typed data extracted from a mixed-media knowledge base, with no vector database to run and no OCR step to maintain.

Worked example: a support bot over screenshots

Put it together with a realistic case. A SaaS support team has hundreds of help-center articles (Markdown) plus a folder of annotated UI screenshots that show where buttons live. Historically the screenshots were dead weight for search. With a multimodal store they become first-class answers.

import glob, time
from google import genai

client = genai.Client()
store = client.file_search_stores.create(config={
    "display_name": "support-kb",
    "embedding_model": "models/gemini-embedding-2",
})

def index(path, name, doc_type):
    op = client.file_search_stores.upload_to_file_search_store(
        file=path, file_search_store_name=store.name,
        config={"display_name": name,
                "custom_metadata": [{"key": "doc_type", "string_value": doc_type}]},
    )
    while not op.done:
        time.sleep(5)
        op = client.operations.get(op)

for md in glob.glob("articles/*.md"):
    index(md, md.split("/")[-1], "article")
for img in glob.glob("screens/*.png"):
    index(img, img.split("/")[-1], "screenshot")

q = client.interactions.create(
    model="gemini-3-flash-preview",
    input="Where do I change my billing email? Point to the exact screen.",
    tools=[{"type": "file_search", "file_search_store_names": [store.name]}],
)
print(q.steps[-1].content[0].text)

The model can retrieve the billing-settings screenshot whose annotated region matches the intent, and quote the article that describes the flow, citing both. The same store powers text-only questions, image-only questions, and the mixed ones in between, which is exactly the messy reality of a real knowledge base.

Common pitfalls and gotchas

1. File Search cannot be combined with Google Search or URL Context. As of now the File Search tool is mutually exclusive with the other built-in grounding tools in a single call. If you need both web grounding and your private docs, run two calls and merge, or route by intent. It can be combined with custom function calling on Gemini 3 models.

2. The raw File object expires; the store does not. When you use the Files API to upload then import, the temporary File object is deleted after 48 hours. The embeddings inside the store persist indefinitely until you delete them. Do not write code that depends on the raw file still existing the next day.

3. Leave temperature at the default of 1.0. Gemini 3 reasoning models are tuned for temperature 1.0. Dropping it to force determinism can cause looping or degraded answers, especially on multi-step reasoning over retrieved chunks. Resist the old habit of setting temperature low.

4. Indexing cost scales with the size of what you embed, not the original file. Billing notes that store size is roughly 3x the input size once embeddings are added, and you are charged $0.15 per 1M tokens at index time. Re-indexing the same file repeatedly during development quietly adds up. Index once, then iterate on queries.

5. Tune chunking for long documents. By default files are auto-chunked. For dense or unusually structured text, pass a chunking_config with max_tokens_per_chunk and max_overlap_tokens to control recall versus precision. Too-large chunks dilute relevance; too-small chunks lose context.

6. Keep stores under ~20 GB for latency. A single store can hold up to your tier limit (1 GB free, up to 1 TB on Tier 3), but Google recommends staying under 20 GB per store for optimal retrieval latency. Shard large corpora across multiple stores and query the relevant one.

7. Mind the model list. File Search is supported on Gemini 3.1 Pro, 3.1 Flash-Lite, 3 Flash, and the 2.5 Pro / 2.5 Flash-Lite models, but not in the Live API. Pick a supported model or the tool call will fail.

Quick reference

Task	Call / setting
Make a store multimodal	embedding_model = "models/gemini-embedding-2"
Upload + index in one step	client.file_search_stores.upload_to_file_search_store(...)
Import an existing uploaded file	client.file_search_stores.import_file(...)
Query with retrieval	tools=[{"type":"file_search","file_search_store_names":[...]}]
Scope the search	metadata_filter='doc_type = "chart"'
Read citations	block.annotations -> type == 'file_citation'
Typed JSON output	response_format with mime_type application/json + schema
Indexing price	$0.15 / 1M tokens (storage + query embeddings free)
Per-file size limit	100 MB
Cannot combine with	Google Search, URL Context (single call)

Next steps

Swap the model to gemini-3.1-pro-preview for harder reasoning over retrieved evidence, or keep gemini-3-flash-preview for cost and speed.
Add richer metadata (tenant ID, language, last-updated date) and build per-tenant logical views over one physical store with metadata_filter.
Index PDFs and slide decks directly; File Search handles dozens of document types alongside images.
Compare against a self-hosted vector DB on your own corpus to decide where managed RAG wins and where you still want control.
Read the official File Search and Gemini Embedding 2 docs for the full API surface, including document-level list/get/delete management.

The takeaway: multimodal retrieval used to be a systems-integration project. With Gemini Embedding 2 and File Search it is a few API calls, and your images finally carry their own weight in search.