📄 Local LLM PDF OCR

Transform scanned and written documents into fully searchable, selectable PDFs using the power of Local LLM Vision.

PDF LLM OCR is a next-generation OCR tool that moves beyond traditional Tesseract-based scanning. By leveraging OCR Vision Language Models (VLMs) like olmOCR running locally on your machine, it "reads" documents with human-like understanding while keeping 100% of your data private.

✨ Features

🧠 AI-Powered Vision: Uses advanced VLMs to transcribe text with high accuracy, even on complex layouts or noisy scans.
🤝 Hybrid Alignment Strategy: Combines Surya OCR Detection for precise bounding boxes with Local LLM for perfect text content via position-based alignment.
⚡ 10-21x Faster Detection: Uses detection-only mode (skips slow recognition) and batch processing for maximum speed.
🔒 100% Local & Private: No cloud APIs, no subscription fees. Run it entirely offline using LM Studio.
🔍 Searchable Outputs: Embeds an invisible text layer directly into your PDF, making it compatible with valid PDF readers for searching (Ctrl+F) and selecting.
🖥️ Dual Interfaces:
- Web UI: An interface with Drag & Drop, Dark Mode, and Real-time progress tracking.
- CLI: A robust command-line tool for power users and batch automation, featuring a "lively" terminal UI.
⚡ Real-time Feedback: Watch your document process page-by-page with live web sockets or animated terminal bars.

🏗️ Architecture

graph TD
    A[Input PDF] --> B[PDF to Image Conversion]
    B --> C[Batch Processing]

    subgraph "Phase 1: Layout Detection (Surya)"
        C --> D[Surya DetectionPredictor]
        D --> E[Bounding Boxes]
        E --> F[Sorted by Reading Order]
    end

    subgraph "Phase 2: Text Extraction (Local LLM)"
        C --> G[OlmOCR Vision Model]
        G --> H[Pure Text Content]
    end

    F --> I[Position-Based Aligner]
    H --> I

    I -->|Distribute by Box Width| J[Aligned Text Blocks]
    J --> K[Sandwich PDF Generator]
    K --> L[Searchable PDF Output]

How It Works

Batch Layout Detection: Surya's DetectionPredictor processes all pages at once, extracting bounding boxes without slow text recognition (~1s total vs ~20s per page with recognition).
LLM Text Extraction: A local vision model (OlmOCR) reads each page with human-like understanding, handling handwriting and complex layouts perfectly.
Position-Based Alignment: The aligner distributes LLM text across detected boxes proportionally by box width in reading order—no fuzzy matching needed.
Sandwich PDF: The original page is rendered as an image with invisible, searchable text overlaid using PyMuPDF.

🚀 Getting Started

Prerequisites

Python 3.10+
LM Studio: Download and install LM Studio.
- Load a Vision Model (highly recommended: allenai/olmocr-2-7b).
- Start the Local Server at default port 1234.

Configuration

Create a .env file in the root directory to configure your Local LLM:

LLM_API_BASE=http://localhost:1234/v1
LLM_MODEL=allenai/olmocr-2-7b

Installation

This project is managed with uv for lightning-fast dependency management.

Install uv (if not installed):
```
pip install uv
```

Clone the repository:

git clone https://github.com/ahnafnafee/pdf-ocr-llm.git
cd pdf-ocr-llm

Sync Dependencies:
```
uv sync
```

Usage

1. 🌐 Web Interface (Recommended)

The easiest way to use the tool. Features a modern dashboard with Dark Mode and Text Preview.

Start the Server:

uv run uvicorn server:app --reload --port 8000

Open your browser to http://localhost:8000.
Drag & Drop your PDF.
Watch the magic happen! ✨
- Real-time Progress: Track per-page OCR status.
- Preview: Click "View Text" to inspect the raw AI extraction.
- Dark Mode: Toggle the moon icon for a sleek dark theme.

2. 💻 Command Line Interface (CLI)

Perfect for developers or integrating into scripts.

Run the OCR tool on any PDF:

uv run main.py input.pdf output_ocr.pdf

Options:

Option	Description
`input_pdf`	Path to input PDF (required)
`output_pdf`	Path to output PDF (optional, defaults to `<input>_ocr.pdf`)
`-v`, `--verbose`	Enable debug logging (alignment details, box counts)
`-q`, `--quiet`	Suppress all output except errors
`--dpi <int>`	DPI for image rendering (default: 200)
`--pages <range>`	Page range to process, e.g., `1-3,5` (default: all)
`--api-base <url>`	Override LLM API base URL
`--model <name>`	Override LLM model name

Examples:

# Basic usage (auto-generates input_ocr.pdf)
uv run main.py scan.pdf

# Process specific pages with higher quality
uv run main.py document.pdf output.pdf --pages 1-5 --dpi 300

# Use a different model with verbose output
uv run main.py report.pdf --model "custom-model" --verbose

You'll see beautiful animated progress bars showing batch detection and per-page LLM processing.

📁 Project Structure

local-llm-pdf-ocr/
├── src/pdf_ocr/           # Core package
│   ├── core/              # OCR processing modules
│   │   ├── aligner.py     # Hybrid text alignment
│   │   ├── ocr.py         # LLM OCR processor
│   │   └── pdf.py         # PDF handling utilities
│   └── utils/             # Utility modules
│       └── tqdm_patch.py  # Progress bar silencer
├── scripts/               # Debug and visualization tools
├── static/                # Web UI assets
├── examples/              # Sample PDFs
├── main.py                # CLI entry point
└── server.py              # Web server

🛠️ Tech Stack

Backend: FastAPI (Async Web Framework)
Frontend: Vanilla JS + CSS Variables
PDF Processing: PyMuPDF (Fitz)
Layout Detection: Surya OCR (Detection-only mode)
AI Integration: OpenAI Client (compatible with Local LLM servers)
CLI UI: Rich (Terminal formatting)

⚡ Performance

Document Type	Detection Time	Speedup vs Recognition
Digital PDF	~1s	21x faster
Handwritten	~1s	10x faster
Hybrid Form	~1s	11x faster

Detection uses batch processing—all pages in one call.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License: MIT

ahnafnafee/local-llm-pdf-ocr