How I Built a PDF to EPUB Converter That Actually Works
I got fed up with PDF converters that mangle everything, so I built my own in one night. Here's how I taught a web app to tell scanned pages from real text.
I've been hoarding PDFs for years. Technical papers, scanned book chapters, research notes. The problem? Reading a PDF on a Kindle is a miserable experience — tiny text, no reflow, constant pinch-to-zoom. I needed EPUBs.
So I tried every converter I could find. Calibre is powerful but the UI is from 2006. Online tools either inject watermarks, compress images into soup, or completely butcher the structure. None of them could handle scanned documents — they'd try to OCR everything, including pages that already had clean text.
That's when the core insight hit me: PDFs have three distinct content types, and every converter conflates them.
- Native text — text objects embedded in the PDF. Already perfect. Just extract and reflow.
- Scanned images — pictures of pages. Need OCR to recover text.
- Mixed pages — tables, figures, equations with surrounding text. Need careful extraction to avoid double-processing.
The fix was simple in concept: classify every page before deciding how to process it.
type PageType = 'native-text' | 'scanned-image' | 'mixed' | 'image-heavy';
function classifyPage(page: PDFPage): PageType {
const textItems = page.getTextContent();
const imageCount = page.getImages().length;
if (textItems.length > 20 && imageCount === 0) return 'native-text';
if (textItems.length < 5 && imageCount >= 1) return 'scanned-image';
if (textItems.length > 10 && imageCount > 0) return 'mixed';
return 'image-heavy';
}
Once you know what each page is, conversion becomes deterministic: native text gets extracted with font metrics to infer headings, scanned images get sent to OCR, mixed pages get both.
The Tech Stack
I went with Next.js 16 and TypeScript because I wanted fast iteration without fighting the framework. Tailwind v4 for styling, shadcn/ui for components. The whole conversion runs client-side using PDF.js — no server uploads, no privacy concerns. Your files never leave your browser.
The EPUB generation was the hard part. EPUB is just a ZIP of HTML files with specific metadata, but the structure has to be right or e-readers reject it silently. I spent more time debugging EPUB packaging than on the actual OCR logic.
What Surprised Me
File sizes were wild. The Feynman Lectures scan went from 87 MB to 14 MB. That's not compression — that's replacing 560 pages of images with text. Native-text PDFs shrink even more because you're stripping layout overhead.
I deleted the settings panel. My first version had toggles for OCR engine, output format, image quality, chapter detection sensitivity... it was overwhelming. I cut it all. The app makes smart defaults and shows you what it did after. Much better.
Warm colors matter. I started with the typical dark-blue tech aesthetic. It felt like a terminal. Switching to amber and parchment — bookish tones — completely changed how the app felt. People trust warm tones for documents.
What's Next
Math equation support is the big gap — LaTeX renders in PDFs as vector paths that confuse OCR. Batch queue processing for converting entire folders. And eventually a CLI for power users who want to pipe this into their workflows.
Drop a PDF, get an EPUB. Works on scanned books, clean documents, and everything in between. Let me know what breaks — I'm building this for people who actually read.