2025年09月17日 mlops

使用 markitdown 构建 Python ETL 管道：解析 Word/PDF 为结构化 Markdown，支持 RAG 和 LLM 微调

介绍如何利用 markitdown 在 Python ETL 中转换 Office 文档和 PDF，保留表格和图像结构，便于 RAG 数据摄取和 LLM 训练，提供实现参数与最佳实践。

内容加载中...

在 AI 应用开发中，特别是构建 RAG（Retrieval-Augmented Generation）系统或 LLM（Large Language Model）微调数据集时，数据准备是关键瓶颈。企业文档往往以 Word（.docx）、PDF 等非结构化格式存在，直接摄取这些文件会导致信息丢失，尤其是表格、图像和层次结构。Microsoft 推出的 markitdown 工具提供了一个高效解决方案，它能将这些 Office 文档和 PDF 转换为结构化的 Markdown 格式，保留核心元素如标题、列表、表格和链接，从而提升数据质量。本文聚焦于使用 markitdown 构建 Python ETL（Extract-Transform-Load）管道，针对 Word 和 PDF 文件的解析，提供可操作的代码实现、参数配置和集成建议，确保输出适合 RAG 摄取和 LLM 训练。

为什么选择 markitdown 作为 ETL 核心？

传统文档转换工具如 textract 虽能提取文本，但往往忽略结构化信息，导致 Markdown 输出平坦化，影响 LLM 的语义理解。markitdown 专为 LLM 管道设计，它不仅提取文本，还保留 Markdown 语法表示的文档结构：标题用 # 标记，表格用 | 分隔，图像用 alt 嵌入。这使得转换后的文件 token 高效，便于向量数据库如 Pinecone 或 FAISS 索引。

证据显示，markitdown 支持多种格式，包括 docx、pdf、pptx 和 xlsx。通过其插件机制，可集成 Azure Document Intelligence 提升 PDF 解析准确率。对于 RAG，结构化 Markdown 能提高检索精度；对于 LLM 微调，保留的表格和图像描述（可选通过 LLM 生成）减少了噪声数据。根据 markitdown 的官方文档，它已优化为“文本分析工具的输入”，而非人类阅读的高保真版本，这正契合 AI 数据 prep 的需求。

在 ETL 管道中，markitdown 的优势在于轻量级：Python 3.10+ 环境即可运行，无需复杂依赖。相比全栈 OCR 工具如 Tesseract，它更注重结构保留，适用于批量处理企业报告、合同等。

安装与环境准备

构建 ETL 管道的第一步是安装 markitdown。推荐使用虚拟环境避免冲突：

python -m venv markitdown_env
source markitdown_env/bin/activate  # Linux/Mac
# 或 Windows: markitdown_env\Scripts\activate
pip install 'markitdown[all]'  # 安装所有可选依赖，包括 docx、pdf 和图像支持

[all] 选项包含 pdf（依赖 PyMuPDF）、docx（python-docx）和 xlsx（openpyxl）。如果仅处理 Word 和 PDF，可精简为 pip install 'markitdown[docx,pdf]'，减少包大小约 200MB。

对于 PDF 的高级解析，集成 Azure Document Intelligence：创建 Azure 资源，获取 endpoint 和 key，然后在安装时添加 [az-doc-intel]。这能处理复杂布局，提高表格提取准确率达 20-30%（基于 Microsoft 基准）。

环境变量设置：若使用 LLM 生成图像描述，安装 openai 并设置 API key：

pip install openai
export OPENAI_API_KEY=your_key

这些准备确保管道稳定运行，处理 100+ 文件时内存占用 < 2GB。

核心 ETL 管道实现

ETL 管道分为提取（从文件路径或目录读取）、转换（markitdown 处理）和加载（输出 Markdown 到指定目录或数据库）。

提取阶段：文件发现与预处理

使用 Python 的 pathlib 扫描输入目录，支持递归处理子文件夹。针对 Word 和 PDF，过滤扩展名：

import os
from pathlib import Path
from typing import List

def extract_files(input_dir: str, extensions: List[str] = ['.docx', '.pdf']) -> List[Path]:
    """提取指定扩展名的文件路径"""
    input_path = Path(input_dir)
    files = []
    for ext in extensions:
        files.extend(input_path.rglob(f'*{ext}'))
    return sorted(files)  # 按路径排序，便于日志追踪

参数建议：input_dir 为 '/data/raw/docs'，extensions 限定为 .docx 和 .pdf，避免无关文件。预处理包括文件大小检查（>50MB 的 PDF 可跳过或分块），防止内存溢出。

转换阶段：markitdown 集成

核心使用 markitdown 的 Python API。初始化 MarkItDown 实例，配置插件和 LLM（可选）：

from markitdown import MarkItDown
from openai import OpenAI
import io

def convert_to_markdown(file_path: Path, output_dir: str, use_doc_intel: bool = False, llm_client=None) -> str:
    """转换单个文件到 Markdown"""
    md = MarkItDown(enable_plugins=False)  # 禁用插件以简化
    
    # Azure Document Intelligence 配置
    if use_doc_intel:
        md = MarkItDown(docintel_endpoint=os.getenv('AZURE_ENDPOINT'),
                        docintel_key=os.getenv('AZURE_KEY'))
    
    # LLM for 图像描述（针对 PDF 中的图像）
    if llm_client:
        md = MarkItDown(llm_client=llm_client, llm_model='gpt-4o-mini',  # 经济模型
                        llm_prompt='Describe this image concisely for LLM context.')
    
    # 读取二进制文件（注意 0.1.0 版本变更）
    with open(file_path, 'rb') as f:
        result = md.convert_stream(io.BytesIO(f.read()))
    
    output_file = output_dir / f"{file_path.stem}.md"
    with open(output_file, 'w', encoding='utf-8') as out:
        out.write(result.text_content)
    
    return output_file.name  # 返回输出文件名用于日志

关键参数：

enable_plugins=False：默认禁用第三方插件，避免意外依赖；若需扩展，可设 True 并列出 --list-plugins。
convert_stream：输入必须为 binary-like（如 BytesIO），输出 result.text_content 为纯 Markdown 字符串。
llm_model：用 gpt-4o-mini 平衡成本与质量，prompt 自定义以生成简洁描述（<50 词），提升 RAG 中的多模态支持。
use_doc_intel=True：针对扫描 PDF，准确率高于内置 pdf 解析器；阈值：若文件 <10 页，用内置节省 API 调用。

对于表格保留：markitdown 自动转换为 Markdown 表格语法，确保列对齐。图像则嵌入 base64 或路径引用，便于后续向量化。

加载阶段：输出与元数据

将 Markdown 保存到 '/data/processed/markdown'，同时生成元数据 JSON（如源文件路径、转换时间、页数）：

import json
from datetime import datetime

def load_metadata(output_file: str, source_file: str, metadata: dict):
    meta = {
        'source': source_file,
        'output': output_file,
        'converted_at': datetime.now().isoformat(),
        **metadata  # e.g., {'pages': 5, 'has_tables': True}
    }
    with open(f"{output_file}.json", 'w') as f:
        json.dump(meta, f, indent=2)

这便于 RAG 系统中追踪 lineage，支持 LLM 微调时的审计。

完整管道脚本

整合以上，添加错误处理和并行处理（使用 multiprocessing 加速）：

import multiprocessing as mp
from functools import partial

def etl_pipeline(input_dir: str, output_dir: str, max_workers: int = 4):
    """完整 ETL 管道"""
    os.makedirs(output_dir, exist_ok=True)
    
    files = extract_files(input_dir)
    convert_func = partial(convert_to_markdown, output_dir=output_dir,
                           use_doc_intel=True, llm_client=OpenAI())
    
    with mp.Pool(max_workers) as pool:
        results = pool.map(convert_func, files)
    
    print(f"Processed {len(results)} files.")

参数：max_workers=4（CPU 核心数 / 2），防止 I/O 瓶颈。错误处理：用 try-except 捕获转换失败（如损坏 PDF），日志到 file.log。

集成 RAG 和 LLM 微调的最佳实践

转换后的 Markdown 直接用于 RAG：用 LangChain 或 Haystack 加载，chunk_size=512 tokens（保留表格完整性），embedding_model='text-embedding-3-small'。对于图像，LLM 生成的描述作为 alt text，增强多模态检索。

在 LLM 微调中，结构化 Markdown 减少预处理步骤：直接 fine-tune Llama-3 等模型，格式为 instruction-response pairs。监控点：

转换准确率：抽样 10% 文件，手动验证表格完整度 >90%。
Token 效率：目标 <1.5x 原文档 tokens。
回滚策略：若 docintel 失败，fallback 到内置 pdf 解析器。

参数清单：

输入阈值：文件大小 <100MB，页数 <50。
输出清理：移除空节，使用 pandoc 后处理优化表格。
监控：Prometheus 指标，如 conversion_time_sec（目标 <5s/文件）。

潜在风险：PDF 扫描件准确率依赖 OCR，建议 hybrid 模式（docintel + tesseract fallback）。测试集：用 50 个混合文档基准，迭代 prompt。

结论与扩展

使用 markitdown 构建的 ETL 管道简化了 AI 数据管道，从原始 Office/PDF 到 RAG-ready Markdown，仅需几行代码。实际部署中，可 Docker 化：基于官方 Dockerfile，挂载卷处理批量。未来扩展：集成 markitdown-mcp 服务器，与 Claude 等 LLM 实时交互。