Gemini API 文件搜索多模态化：原生视觉检索与可溯源 RAG 实战

Gemini File Search 是 Gemini API 的内置检索增强生成（RAG）工具。2026 年 5 月 5 日的更新为其添加了三个关键能力：多模态检索（图像与文本统一索引）、自定义元数据过滤，以及逐页引用溯源。这三个能力并非简单叠加，而是重新设计了整个索引管线的底层嵌入模型 —— 从 gemini-embedding-001 升级为 gemini-embedding-2，使图像数据能够绕过传统 OCR 流程，以原生视觉特征直接参与语义相似度计算。

嵌入模型选型与索引管线

File Search Store 是持久化文档嵌入的容器，相当于一个托管型向量数据库。在创建 Store 时通过 embedding_model 参数指定嵌入模型：

from google import genai
client = genai.Client()

file_search_store = client.file_search_stores.create(
    config={
        "display_name": "product-catalog",
        "embedding_model": "models/gemini-embedding-2"  # 关键参数
    }
)

嵌入模型选型直接影响后续检索能力：

嵌入模型	适用场景	是否支持图像
`gemini-embedding-001`（默认）	纯文本工作负载，成本优先	否
`gemini-embedding-2`	多模态检索（文档 + 图像）	是

重要约束：embedding_model 参数在创建 Store 时指定，之后无法更改。如果不填，Store 默认使用 gemini-embedding-001，该设置永久固定。

在索引阶段，最简洁的调用路径是 upload_to_file_search_store 方法，它在单次调用中完成上传与索引：

import time

# 上传 PDF 文档（含嵌入式图像）
op = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="product_catalog.pdf",
    config={"display_name": "Product Catalog"}
)

# 轮询等待索引完成
while not op.done:
    time.sleep(5)
    op = client.operations.get(op)

# 直接上传产品图像
for image_file in ["sneaker_red.png", "sneaker_blue.jpeg", "sneaker_white.png"]:
    img_op = client.file_search_stores.upload_to_file_search_store(
        file_search_store_name=file_search_store.name,
        file=image_file,
        config={"display_name": image_file}
    )
    while not img_op.done:
        time.sleep(5)
        img_op = client.operations.get(img_op)

当前版本不支持音频和视频，仅支持 PDF、图片（PNG、JPEG 等）格式。使用 gemini-embedding-2 时，PDF 内部的嵌入式图像也会与文本一起被原生嵌入，无需预处理。

自定义元数据与查询时过滤

自定义元数据允许为文档附加键值标签（如 department: Legal、status: Final），在查询时通过 metadata_filter 参数对候选集进行预过滤，显著减少语义检索前的噪声：

# 上传时附加元数据
op = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="shoes_collection.pdf",
    config={
        "display_name": "Spring 2026 Shoes",
        "custom_metadata": [
            {"key": "category", "string_value": "footwear"},
            {"key": "season", "string_value": "spring-2026"},
            {"key": "price_tier", "numeric_value": 2}
        ]
    }
)

# 查询时应用元数据过滤
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Do you have blue spring shoes?",
    config={
        "tools": [{
            "file_search": {
                "file_search_store_names": [file_search_store.name],
                "metadata_filter": 'category="footwear" AND season="spring-2026"',
            }
        }]
    }
)

元数据过滤采用类 SQL 的表达式语法，支持 AND、OR 以及数值比较操作符。在大规模企业文档库场景中，元数据过滤是控制召回精度与降低语义检索计算量的核心手段。

Page-Level 引用溯源机制

引用溯源是此次更新的第三个能力，也是面向用户可验证性的关键特性。File Search 在生成响应时同时返回 grounding_metadata，其中包含每个检索片段的精确页面编号：

grounding = response.candidates[0].grounding_metadata

for chunk in grounding.grounding_chunks:
    ctx = chunk.retrieved_context
    if ctx.media_id:
        # 图像引用 — 通过 media_id 下载原始图像
        print(f"Cited image: {ctx.title}")
        print(f"   Media ID: {ctx.media_id}")

        blob = client.file_search_stores.download_media(
            media_id=ctx.media_id
        )
        with open(f"cited_{ctx.title}.png", "wb") as f:
            f.write(blob)
    else:
        # 文本引用含精确页码
        print(f"Cited text: {ctx.title}")
        if ctx.page_number:
            print(f"   Page: {ctx.page_number}")
        print(f"   {ctx.text[:200]}...")

# 查看响应中每个声明的来源映射
for support in grounding.grounding_supports:
    print(f"Claim: '{support.segment.text}'")
    print(f"  Grounded in chunks: {support.grounding_chunk_indices}")

这意味着用户看到的每一个答案都可以追溯到原始文档的具体位置。在法律文档审查、保险理赔、医疗报告等需要严格溯源的场景中，这个能力直接决定了产品能否进入生产流程。

批量处理吞吐量估算参数

在实际生产部署中，索引吞吐量是核心指标。基于 SDK 异步操作的轮询机制，建议按以下参数配置批量上传流程：

import concurrent.futures

def upload_with_retry(file_path, display_name, max_retries=3):
    """带重试的单文件上传封装"""
    for attempt in range(max_retries):
        try:
            op = client.file_search_stores.upload_to_file_search_store(
                file_search_store_name=file_search_store.name,
                file=file_path,
                config={"display_name": display_name}
            )
            while not op.done:
                time.sleep(5)
                op = client.operations.get(op)
            return {"file": file_path, "status": "success"}
        except Exception as e:
            if attempt == max_retries - 1:
                return {"file": file_path, "status": "error", "error": str(e)}
            time.sleep(2 ** attempt)  # 指数退避

# 并发上传控制（建议不超过 5 并发以避免 API 限流）
files = [
    ("doc1.pdf", "Document 1"),
    ("img1.png", "Image 1"),
    # ...
]

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(lambda f: upload_with_retry(f[0], f[1]), files))

关键参数建议：轮询间隔 5 秒（过低会增加 API 调用次数，过高会拖慢吞吐）；并发度控制在 5 以内；单个 Store 的文件数量建议不超过 10,000 个以维持检索延迟在合理范围。

分块策略与结构化输出

对于超长文档，可通过 chunking_config 控制分块策略：

operation = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="long_document.pdf",
    config={
        "display_name": "Technical Manual",
        "chunking_config": {
            "white_space_config": {
                "max_tokens_per_chunk": 200,
                "max_overlap_tokens": 20
            }
        }
    }
)

建议 max_tokens_per_chunk 设置在 150–300 范围内，max_overlap_tokens 设为前者的 10%–15%，可在检索完整性与召回精度之间取得平衡。

结合 Gemini 3 模型的结构化输出能力，File Search 可直接输出 JSON Schema 格式的结构化数据：

from pydantic import BaseModel, Field

class ProductMatch(BaseModel):
    name: str = Field(description="Product name")
    description: str = Field(description="Brief product description")
    confidence: str = Field(description="Match confidence level")

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Find products similar to a red running shoe",
    config={
        "tools": [{
            "file_search": {
                "file_search_store_names": [file_search_store.name]
            }
        }],
        "response_mime_type": "application/json",
        "response_schema": ProductMatch.model_json_schema()
    }
)

这使得 RAG 输出可直接对接到下游业务系统，无需额外解析。

定价模型与成本控制

File Search 的定价设计为全托管方案：

索引嵌入：按 gemini-embedding-2 嵌入模型计费（参考官方定价页）
存储：免费
查询时嵌入：免费
召回 Token：按常规上下文 Token 计费

这意味着成本集中在初始化索引阶段，查询阶段仅按实际输入输出 Token 计费。对于文档库规模稳定（不频繁更新）的场景，全生命周期成本显著低于自建 RAG 管线。

适用场景与落地建议

多模态 File Search 最适合以下场景：视觉产品检索（如电商图册自然语言搜索）、研究文档中的图表检索、保险理赔中表单与现场照片的统一检索、设计系统的视觉组件库搜索，以及房地产中平面图与室内照片的组合检索。

落地时建议优先验证图像嵌入质量而非批量吞吐 —— 先用 10–20 张不同质量的图像测试 "找一张包含蓝色元素的图表" 这类视觉语义查询的准确率，确认精度满足业务要求后再扩大规模。

资料来源：Google Blog（2026-05-05）、Google Developers 开发者指南（2026-05-05）。

ai-systems

内容声明：本文无广告投放、无付费植入。

如有事实性问题，欢迎发送勘误至 i@hotdrydog.com。