🧾 PDF Summarizer with OCR, Language Detection & BART Summarization

This Python project extracts and summarizes content from PDFs, including scanned/image-based ones. It uses OCR for unreadable pages, detects the language (English or Malay), and summarizes the content using a pretrained NLP model.

📦 Features

📄 Handles both text-based and scanned PDFs
🧠 Summarizes using facebook/bart-large-cnn from Hugging Face
🏷️ Language detection (langdetect) — supports English (en) and Malay (ms)
🔍 OCR via Tesseract for image-based pages
🧹 Intelligent gibberish detection and filtering
📸 Image preprocessing with OpenCV to improve OCR accuracy

📁 Folder Structure

document summarization/
├── test.py                  # Your main script
├── README.md                # This file
├── assets/
│   └── sample.pdf           # Example PDF

⚙️ Requirements

Python 3.8 or higher recommended

🔌 Required Libraries

Install everything in one go:

pip install pdfplumber pytesseract Pillow langdetect transformers pdf2image opencv-python torch numpy

Or create a requirements.txt file and run:

pip install -r requirements.txt

requirements.txt contents:

pdfplumber==0.10.2
pytesseract==0.3.10
Pillow==10.2.0
langdetect==1.0.9
transformers==4.39.3
pdf2image==1.17.0
opencv-python==4.9.0.80
torch>=1.13.0
numpy==1.26.4

🛠️ Setup Guide

1. Install Tesseract OCR

Tesseract is used for reading text from scanned images.

Windows:
Download and install from https://github.com/tesseract-ocr/tesseract

After installation, set the path in your code:
```
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR	esseract.exe"
```
macOS:
```
brew install tesseract
```
Linux (Ubuntu):
```
sudo apt install tesseract-ocr
```

2. Install Poppler

Poppler is required by pdf2image to convert PDFs into images.

Windows:
Download from http://blog.alivate.com.au/poppler-windows/
Extract and add the /bin folder to your system PATH.
macOS:
```
brew install poppler
```
Linux:
```
sudo apt install poppler-utils
```

🚀 How to Use

1. Set the PDF path

Edit the pdf_path in your script:

pdf_path = r"C:\path_to_your_file.pdf"

2. Run your script

python text.py

3. Output

You’ll see logs for:

OCR processing time
Language detection result
Summary output per page

📝 Output Format

Each page will return:

Page X (Language: en/ms):
<summary>

Unreadable or gibberish pages will be flagged and skipped.

📃 License

This project is open-source and uses the MIT License.

🙋‍♂️ Author

Built with ❤️ by Amier (Ole Kacak)
Feel free to contribute, fork, or suggest improvements.

olekacak/SSM-Summarization