数式入りWordファイルをLLMに読ませる：eqword2llmの開発

研究者として日々の業務でWordファイルを扱う機会は多い。特に数式を含む技術文書をAIアシスタントに分析させたいとき、ある問題に直面した。

WordファイルをそのままLLMに渡しても、数式が正しく認識されない。

この課題を解決するために開発したのが eqword2llm だ。

課題：なぜ既存ツールでは不十分なのか

Pandocの限界

Pandocは優秀な変換ツールだが、Word特有の問題に対応しきれない。

たとえば、Wordの数式番号機能を使った文書を変換しようとする。

Pandocだと、以下のように変換される。

Equation with Field Code

$$\begin{array}{r}
\mathbf{E = m}\mathbf{c}^{\mathbf{2}}\mathbf{\#}\left( \mathbf{}\mathbf{\ SEQ\ Equation\ \backslash*\ ARABIC\ }\mathbf{}\mathbf{1}\mathbf{} \right)
\end{array}$$

Another Equation

$$\begin{array}{r}
\mathbf{F = ma\#}\left( \mathbf{}\mathbf{\ SEQ\ Equation\ \backslash*\ ARABIC\ }\mathbf{}\mathbf{2}\mathbf{} \right)
\end{array}$$

SEQ Equation というWordの内部フィールドコードがそのまま出力されてしまう。これではLLMが数式を正しく理解できない。

Mammothの限界

Mammothは数式を完全に無視する。見出し構造は保持されるが、肝心の数式が消えてしまっては科学技術文書には使えない。

eqword2llmの特徴

1. クリーンなLaTeX出力

同じ文書をeqword2llmで変換すると：

Equation with Field Code

$$
E=mc^{2}
$$

Another Equation

$$
F=ma
$$

フィールドコードは除去され、純粋なLaTeX数式のみが出力される。2. 標準的なLaTeX環境の使用

行列式の変換を比較してみる。

Pandoc:

$$|A| = \left| \begin{matrix}
a & b \\
c & d
\end{matrix} \right| = ad - bc$$

eqword2llm:

$$
\left|A\right|=\begin{vmatrix}a & b \\ c & d\end{vmatrix}=ad-bc
$$

eqword2llmは \begin{vmatrix} という標準的な行列式環境を使用する。コードが27%短くなり、意図も明確になる。

3. LLM-Ready出力形式（v0.4.0〜）

v0.4.0からは、LLMに渡すことを前提とした出力形式をサポートしている。

Structured Format（YAML Frontmatter付き）

eqword2llm document.docx --format structured

出力：

---
format: eqword2llm/v1
source: document.docx
stats:
  sections: 3
  equations: 5
  headings: 8
equations:
  - id: 1
    latex: "E = mc^{2}"
    type: block
  - id: 2
    latex: "F = ma"
    type: block
---

# 文書内容...

数式のリストと統計情報がメタデータとして付与される。LLMが文書構造を把握しやすくなる。

Prompt Format（LLMプロンプト生成）

eqword2llm document.docx --format prompt

変換結果をそのままLLM APIに渡せる形式で出力する：

# Document for Analysis

## Document Information
- Source: document.docx
- Equations: 5
- Sections: 3

## Content

[変換されたMarkdown]

---

## Instructions

This document was converted from a Word file using eqword2llm.
Mathematical equations are formatted in LaTeX:
- Block equations: `$$...$$`
- Inline equations: `$...$`

Please analyze the content. If you need clarification about any equation or
concept, ask the user.

使い方

インストール

pip install eqword2llm

依存パッケージはゼロ。Python標準ライブラリのみで動作する。

CLI

# 基本的な変換
eqword2llm paper.docx -o paper.md

# 数式番号なし
eqword2llm paper.docx -o paper.md --no-equation-numbers

# メタデータ付き
eqword2llm paper.docx -o paper.md --format structured

# LLMプロンプト形式
eqword2llm paper.docx --format prompt | pbcopy  # クリップボードにコピー

Python API

from eqword2llm import WordToMarkdownConverter

# 基本変換
converter = WordToMarkdownConverter("paper.docx")
markdown = converter.convert()

# メタデータ付き変換
result = converter.convert_structured()
print(f"数式数: {result.metadata.equation_count}")
for eq in result.metadata.equations:
    print(f"  [{eq.id}] {eq.latex}")

# LLMプロンプト生成
prompt = converter.to_llm_prompt()
# カスタム指示を追加
prompt = converter.to_llm_prompt(
    instructions="この文書の主要な方程式を解説してください。"
)

Claude APIとの連携

import anthropic
from eqword2llm import WordToMarkdownConverter

# Word文書をLLMプロンプトに変換
converter = WordToMarkdownConverter("research_paper.docx")
prompt = converter.to_llm_prompt()

# Claudeに送信
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    messages=[{"role": "user", "content": prompt}]
)

print(response.content[0].text)

変換の仕組み

Word文書（.docx）は実際にはZIPファイルで、内部にXMLが格納されている。数式はOMML（Office Math Markup Language）という形式で記述されている。

変換フロー

eqword2llmはこのOMMLを解析し、LaTeX形式に変換する。変換時に：

Wordのフィールドコード（SEQ Equationなど）を除去
適切なLaTeX環境を選択（vmatrix, bmatrix, pmatrixなど）
数学関数を標準形式に変換（lim → \lim, sin → \sin）
見出し構造をMarkdown形式で保持

Pandocとの比較

機能	Pandoc	eqword2llm
数式変換	△ 部分的	✅
フィールドコード処理	❌	✅
Markdown見出し	❌	✅
行列式環境	`\left\|...\right\|`	`\begin{vmatrix}`
ベクトル表記	`\overset{\rightarrow}{v}`	`\vec{v}`
メタデータ出力	❌	✅
LLMプロンプト生成	❌	✅
依存関係	多数	ゼロ

詳細な比較はcomparison-details.mdを参照。

今後の展望

現在検討している機能：

テーブル変換: Word表 → Markdownテーブル
画像抽出: 埋め込み画像の外部ファイル化
クロスリファレンス: 数式参照の維持

まとめ

eqword2llmは「数式入りWord文書をLLMに正確に読ませる」という一点に集中したツールだ。

研究論文、技術レポート、講義資料など、数式を含む文書をAIアシスタントに分析させたいときに活用してほしい。

リポジトリ: https://github.com/manabelab/eqword2llm
PyPI: https://pypi.org/project/eqword2llm/
ライセンス: MIT

Sand box on the shoulder

このブログを検索