GitHunt
AH

ahnafnafee/local-llm-pdf-ocr

Convert scanned PDFs into searchable text locally using Vision LLMs (olmOCR). 100% private, offline, and free. Features a modern Web UI & CLI.

πŸ“„ Local LLM PDF OCR

Python
FastAPI
License
Local AI

Transform scanned and written documents into fully searchable, selectable PDFs using the power of Local LLM Vision.

PDF LLM OCR is a next-generation OCR tool that moves beyond traditional Tesseract-based scanning. By leveraging OCR Vision Language Models (VLMs) like olmOCR running locally on your machine, it "reads" documents with human-like understanding while keeping 100% of your data private.


✨ Features

  • 🧠 AI-Powered Vision: Uses advanced VLMs to transcribe text with high accuracy, even on complex layouts or noisy scans.
  • 🀝 Hybrid Alignment Strategy: Combines Surya OCR Detection for precise bounding boxes with Local LLM for perfect text content via position-based alignment.
  • ⚑ 10-21x Faster Detection: Uses detection-only mode (skips slow recognition) and batch processing for maximum speed.
  • πŸ”’ 100% Local & Private: No cloud APIs, no subscription fees. Run it entirely offline using LM Studio.
  • πŸ” Searchable Outputs: Embeds an invisible text layer directly into your PDF, making it compatible with valid PDF readers for searching (Ctrl+F) and selecting.
  • πŸ–₯️ Dual Interfaces:
    • Web UI: An interface with Drag & Drop, Dark Mode, and Real-time progress tracking.
    • CLI: A robust command-line tool for power users and batch automation, featuring a "lively" terminal UI.
  • ⚑ Real-time Feedback: Watch your document process page-by-page with live web sockets or animated terminal bars.

πŸ—οΈ Architecture

graph TD
    A[Input PDF] --> B[PDF to Image Conversion]
    B --> C[Batch Processing]

    subgraph "Phase 1: Layout Detection (Surya)"
        C --> D[Surya DetectionPredictor]
        D --> E[Bounding Boxes]
        E --> F[Sorted by Reading Order]
    end

    subgraph "Phase 2: Text Extraction (Local LLM)"
        C --> G[OlmOCR Vision Model]
        G --> H[Pure Text Content]
    end

    F --> I[Position-Based Aligner]
    H --> I

    I -->|Distribute by Box Width| J[Aligned Text Blocks]
    J --> K[Sandwich PDF Generator]
    K --> L[Searchable PDF Output]
Loading

How It Works

  1. Batch Layout Detection: Surya's DetectionPredictor processes all pages at once, extracting bounding boxes without slow text recognition (~1s total vs ~20s per page with recognition).

  2. LLM Text Extraction: A local vision model (OlmOCR) reads each page with human-like understanding, handling handwriting and complex layouts perfectly.

  3. Position-Based Alignment: The aligner distributes LLM text across detected boxes proportionally by box width in reading orderβ€”no fuzzy matching needed.

  4. Sandwich PDF: The original page is rendered as an image with invisible, searchable text overlaid using PyMuPDF.


πŸš€ Getting Started

Prerequisites

  1. Python 3.10+
  2. LM Studio: Download and install LM Studio.
    • Load a Vision Model (highly recommended: allenai/olmocr-2-7b).
    • Start the Local Server at default port 1234.

Configuration

Create a .env file in the root directory to configure your Local LLM:

LLM_API_BASE=http://localhost:1234/v1
LLM_MODEL=allenai/olmocr-2-7b

Installation

This project is managed with uv for lightning-fast dependency management.

  1. Install uv (if not installed):

    pip install uv
  2. Clone the repository:

    git clone https://github.com/ahnafnafee/pdf-ocr-llm.git
    cd pdf-ocr-llm
  3. Sync Dependencies:

    uv sync

Usage

The easiest way to use the tool. Features a modern dashboard with Dark Mode and Text Preview.

  1. Start the Server:
    uv run uvicorn server:app --reload --port 8000
  2. Open your browser to http://localhost:8000.
  3. Drag & Drop your PDF.
  4. Watch the magic happen! ✨
    • Real-time Progress: Track per-page OCR status.
    • Preview: Click "View Text" to inspect the raw AI extraction.
    • Dark Mode: Toggle the moon icon for a sleek dark theme.

2. πŸ’» Command Line Interface (CLI)

Perfect for developers or integrating into scripts.

Run the OCR tool on any PDF:

uv run main.py input.pdf output_ocr.pdf

Options:

Option Description
input_pdf Path to input PDF (required)
output_pdf Path to output PDF (optional, defaults to <input>_ocr.pdf)
-v, --verbose Enable debug logging (alignment details, box counts)
-q, --quiet Suppress all output except errors
--dpi <int> DPI for image rendering (default: 200)
--pages <range> Page range to process, e.g., 1-3,5 (default: all)
--api-base <url> Override LLM API base URL
--model <name> Override LLM model name

Examples:

# Basic usage (auto-generates input_ocr.pdf)
uv run main.py scan.pdf

# Process specific pages with higher quality
uv run main.py document.pdf output.pdf --pages 1-5 --dpi 300

# Use a different model with verbose output
uv run main.py report.pdf --model "custom-model" --verbose

You'll see beautiful animated progress bars showing batch detection and per-page LLM processing.


πŸ“ Project Structure

local-llm-pdf-ocr/
β”œβ”€β”€ src/pdf_ocr/           # Core package
β”‚   β”œβ”€β”€ core/              # OCR processing modules
β”‚   β”‚   β”œβ”€β”€ aligner.py     # Hybrid text alignment
β”‚   β”‚   β”œβ”€β”€ ocr.py         # LLM OCR processor
β”‚   β”‚   └── pdf.py         # PDF handling utilities
β”‚   └── utils/             # Utility modules
β”‚       └── tqdm_patch.py  # Progress bar silencer
β”œβ”€β”€ scripts/               # Debug and visualization tools
β”œβ”€β”€ static/                # Web UI assets
β”œβ”€β”€ examples/              # Sample PDFs
β”œβ”€β”€ main.py                # CLI entry point
└── server.py              # Web server

πŸ› οΈ Tech Stack

  • Backend: FastAPI (Async Web Framework)
  • Frontend: Vanilla JS + CSS Variables
  • PDF Processing: PyMuPDF (Fitz)
  • Layout Detection: Surya OCR (Detection-only mode)
  • AI Integration: OpenAI Client (compatible with Local LLM servers)
  • CLI UI: Rich (Terminal formatting)

⚑ Performance

Document Type Detection Time Speedup vs Recognition
Digital PDF ~1s 21x faster
Handwritten ~1s 10x faster
Hybrid Form ~1s 11x faster

Detection uses batch processingβ€”all pages in one call.


🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License: MIT

ahnafnafee/local-llm-pdf-ocr | GitHunt