nsourlos/OCR_and_RAG
Tests of OCR and RAG with LLMs
OCR and RAG Experiments
Description
This project benchmarks and demonstrates a wide range of Optical Character Recognition (OCR) and Retrieval-Augmented Generation (RAG) techniques for extracting, cleaning, and querying information from PDF documents. It covers both text-based and image-based PDFs, with a special focus on handling mathematical equations and complex layouts. The notebook provides code and commentary for using and comparing popular OCR tools, PDF parsers, and RAG pipelines, including integration with state-of-the-art LLMs and Vision-Language Models.
Table of Contents
- Installation
- Usage
- Supported Tools & Models
- Example Workflow
- Results & Recommendations
- References
- License
Installation
-
Clone the repository and navigate to the project directory.
-
Install dependencies:
pip install -r requirements.txt
-
Set up API keys
Create an.envfile in your project directory with your API keys:OPENAI_API_KEY_DRACO=your_openai_api_key MISTRAL_API_KEY=your_mistral_api_key GEMINI_API_KEY=your_gemini_api_key COHERE_API_KEY=your_cohere_api_key
Usage
-
Open the notebook
Launch Jupyter and openocr_RAG_tests.ipynb. -
Configure file paths
Place your PDF files in thepdfsdirectory or update thefiles_pathandpdf_filevariables as needed. -
Run the cells
The notebook is organized into sections for each tool and workflow. You can run all cells or focus on the tools/models you want to benchmark. -
Output
- Cleaned and formatted Markdown files are saved to your Desktop.
- JSON files with parsed document data are also generated.
- Results and recommendations are printed in the notebook.
Supported Tools & Models
The notebook includes code and benchmarks for:
- PDF Text Extraction:
OpenAI OCRMarked-pdfdoclingPytesseractMistral OCRsurya-ocralibaba-damo/mgp-str-baseLatexOCRzeroxOllama OCRX-PLUG/mPLUG-DocOwlOlmOCRGOT OCRNougat OCRMegaParse
- RAG & LLMs:
- OpenAI GPT-4o, GPT-4o-mini
- Gemini 2.5
- Cohere Embed v4
- ColPali (via Byaldi)
- Qwen2-VL-2B/7B
- Visual RAG pipelines
Results & Recommendations
- Best overall, especially for equations and complex layouts:
- OpenAI GPT-4o Vision (via API) and marked-pdf.
- Good for contracts and simple text:
docling,mistral,pytesseract.
- Visual RAG:
- ColPali + Qwen2-VL pipeline is promising for multimodal retrieval and QA.
- Other tools:
- Many open-source OCR tools struggle with equations and complex formatting.
See the notebook for detailed benchmarks and code for each tool.
References
- OpenAI Cookbook: Parse PDF Docs for RAG
- Marker
- Docling
- MistralAI
- Byaldi (ColPali)
- Qwen2-VL
- Cohere
- Google Gemini
Note:
Some tools require API keys and/or GPU support. See the notebook comments for installation and usage tips for each tool.