GitHunt
TH

thiagoaramizo/file-to-md

A document processing service designed to extract structured text (Markdown) from various file formats using OCR (Tesseract) and native parsers.

File (Local OCR) to Markdown Service

A local document processing service designed to extract structured text (Markdown) from various file formats using OCR (Tesseract) and native parsers.

This project follows Clean Architecture principles, ensuring decoupling between business rules, frameworks, and infrastructure details.


Features

  • Offline Processing: No dependency on external APIs (Cloud Vision, AWS Textract, etc.).
  • Multi-format Support: Native support for:
    • PDF: Hybrid text extraction (native text + OCR for scanned pages).
    • DOCX: Structure preservation (headings, paragraphs, tables).
    • CSV / XLSX: Intelligent conversion of spreadsheets to Markdown tables.
    • Images: Direct OCR via Tesseract.
  • Standardized Output: All content is returned in Markdown, ideal for LLMs (RAG) or indexing.
  • REST API: Simple and direct interface via FastAPI.

Security Features

This service implements several security best practices:

  • Input Validation: Enforces strict file size limits (10MB default) and validates file types using Magic Bytes (Mime Sniffing) to prevent malicious file uploads.
  • Host Protection: TrustedHostMiddleware prevents Host Header Injection attacks.
  • CORS Configuration: Restrictive CORS policy configurable via environment variables.
  • Safe Parsing: Uses defusedxml and safe library configurations to prevent XML External Entity (XXE) attacks.
  • Non-root Execution: Runs as a non-privileged user inside the Docker container.

Architecture

The project is organized following Clean Architecture:

src/
├── domain/                  # Core Layer (Pure)
│   ├── entities/            # Business Entities (e.g., Document)
│   └── interfaces/          # Contracts (DocumentParser, OCRProvider)
│
├── application/             # Use Cases
│   ├── use_cases/           # Orchestration (ProcessDocumentUseCase)
│   └── dto/                 # Data Transfer Objects (Input/Output)
│
├── infrastructure/          # Technical Details
│   ├── ocr/                 # Tesseract Implementation
│   └── parsers/             # Parsing Implementations (PDF, Docx, etc.)
│
└── presentation/            # External Entry Point
    └── api/                 # REST API (FastAPI)

Processing Pipeline

  1. API receives the file in Base64.
  2. UseCase decodes and selects the appropriate Parser via Factory.
  3. Parser extracts the text (triggering OCR if necessary).
  4. Entity is created, and the formatted result is returned.

The easiest way to run the service is using Docker, as it automatically configures all system dependencies (Tesseract, Poppler).

Prerequisites

  • Docker and Docker Compose installed.

Steps

  1. Build and Start:

    docker compose up -d --build
  2. Check Status:
    The service will be running at http://localhost:8000.

    • Healthcheck: http://localhost:8000/health
    • Documentation (Swagger): http://localhost:8000/docs
  3. Stop the Service:

    docker-compose down

Configuration (Environment Variables)

For production environments, you should configure the following environment variables in docker-compose.yml or your deployment system:

  • ALLOWED_ORIGINS: Comma-separated list of allowed origins for CORS (e.g., https://myapp.com,http://localhost:3000). Defaults to *.
  • ALLOWED_HOSTS: Comma-separated list of allowed hosts (e.g., api.myapp.com,localhost). Defaults to *.

Local Execution (Development)

If you wish to run it outside Docker, you will need to install system dependencies manually.

1. System Dependencies

  • macOS (Homebrew):
    brew install tesseract tesseract-lang poppler libmagic
  • Ubuntu/Debian:
    sudo apt-get install tesseract-ocr tesseract-ocr-por poppler-utils libmagic1

2. Python Environment

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. Run Application

python src/main.py

Testing the API

Via Web Interface (Included)

We provide a simple HTML interface for quick testing.
Simply open the file below in your browser:

tests/interface.html

Via cURL

curl -X POST "http://localhost:8000/process" \
     -H "Content-Type: application/json" \
     -d '{
           "filename": "test.csv",
           "mimeType": "text/csv",
           "contentBase64": "bmFtZSxhZ2UKQWxpY2UsMzAKQm9iLDI1"
         }'

Response Example

{
  "markdown": "# test.csv\n\n| name | age |\n|---:|---:|\n| Alice | 30 |\n| Bob | 25 |"
}

Automated Tests

To run unit tests (make sure you are in the virtual environment):

export PYTHONPATH=$PYTHONPATH:.
python tests/test_service.py