to-markdown
A versatile, TypeScript-first utility for converting PDF, DOCX, HTML, Excel, CSV, and more into clean Markdown — ready for RAG pipelines and LLM context windows.
What's in the box
Anyone building a RAG pipeline or document workflow that needs to normalise heterogeneous file formats into a single, LLM-friendly representation. Each capability is opt-in — use the parts that fit, leave the rest.
Multi-format support
Converts PDF, DOCX, HTML, Excel, CSV, and other formats into structured Markdown with a single API.
Promise-based API
Simple async interface that returns Markdown strings with predictable structure for downstream parsing.
TypeScript first
Written in TypeScript with full type definitions and zero-config import in modern toolchains.
Customisable conversion
Options to control table handling, image extraction, heading depth, and chunking-friendly output.
RAG-ready output
Output shape is tuned for ingestion: clean headings, stable IDs, and minimal noise from source styling.
Modular & fast
Per-format adapters keep the bundle lean — only load the converters you actually need.
How it runs
A small, modular pipeline: detect the source format, run the right adapter, normalise the structure, and emit clean Markdown — ready to chunk and embed.
Inputs
- PDF · DOCX
- HTML · plain text
- Excel · CSV
- Bring-your-own buffer
Options
imageModetableStrategyheadingDepthchunkHint
Output
- Clean Markdown
- Stable headings
- Per-doc metadata
- Streaming-friendly
Quickstart
Install, configure, run. The example below is the smallest piece of code that does something useful in production.
How it compares
Against the utilities options teams most often weigh — focused on operational concerns, not feature inventories.
| Capability | to-markdown | pandoc | unstructured.io | turndown |
|---|---|---|---|---|
| ● native | ◐ partial | ● native | ○ missing | |
| DOCX | ● native | ● native | ● native | ○ missing |
| HTML | ● native | ● native | ● native | ● native |
| Excel / CSV | ● native | ◐ partial | ● native | ○ missing |
| TypeScript | ● native | ○ missing | ○ missing | ● native |
| Open source | ● native | ● native | ◐ partial | ● native |