GitHunt
OL

olekacak/SSM-Summarization

A Python project for document summarization using various NLP techniques. The project includes a basic setup with a .gitignore file, necessary dependencies listed in requirements.txt, and a script (text.py) for processing documents. It is designed to help automate and streamline document summarization tasks.

๐Ÿงพ PDF Summarizer with OCR, Language Detection & BART Summarization

This Python project extracts and summarizes content from PDFs, including scanned/image-based ones. It uses OCR for unreadable pages, detects the language (English or Malay), and summarizes the content using a pretrained NLP model.


๐Ÿ“ฆ Features

  • ๐Ÿ“„ Handles both text-based and scanned PDFs
  • ๐Ÿง  Summarizes using facebook/bart-large-cnn from Hugging Face
  • ๐Ÿท๏ธ Language detection (langdetect) โ€” supports English (en) and Malay (ms)
  • ๐Ÿ” OCR via Tesseract for image-based pages
  • ๐Ÿงน Intelligent gibberish detection and filtering
  • ๐Ÿ“ธ Image preprocessing with OpenCV to improve OCR accuracy

๐Ÿ“ Folder Structure

document summarization/
โ”œโ”€โ”€ test.py                  # Your main script
โ”œโ”€โ”€ README.md                # This file
โ”œโ”€โ”€ assets/
โ”‚   โ””โ”€โ”€ sample.pdf           # Example PDF

โš™๏ธ Requirements

Python 3.8 or higher recommended

๐Ÿ”Œ Required Libraries

Install everything in one go:

pip install pdfplumber pytesseract Pillow langdetect transformers pdf2image opencv-python torch numpy

Or create a requirements.txt file and run:

pip install -r requirements.txt

requirements.txt contents:

pdfplumber==0.10.2
pytesseract==0.3.10
Pillow==10.2.0
langdetect==1.0.9
transformers==4.39.3
pdf2image==1.17.0
opencv-python==4.9.0.80
torch>=1.13.0
numpy==1.26.4

๐Ÿ› ๏ธ Setup Guide

1. Install Tesseract OCR

Tesseract is used for reading text from scanned images.

  • Windows:
    Download and install from https://github.com/tesseract-ocr/tesseract

    After installation, set the path in your code:

    pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR	esseract.exe"
  • macOS:

    brew install tesseract
  • Linux (Ubuntu):

    sudo apt install tesseract-ocr

2. Install Poppler

Poppler is required by pdf2image to convert PDFs into images.


๐Ÿš€ How to Use

1. Set the PDF path

Edit the pdf_path in your script:

pdf_path = r"C:\path_to_your_file.pdf"

2. Run your script

python text.py

3. Output

Youโ€™ll see logs for:

  • OCR processing time
  • Language detection result
  • Summary output per page

๐Ÿ“ Output Format

Each page will return:

Page X (Language: en/ms):
<summary>

Unreadable or gibberish pages will be flagged and skipped.


๐Ÿ“ƒ License

This project is open-source and uses the MIT License.


๐Ÿ™‹โ€โ™‚๏ธ Author

Built with โค๏ธ by Amier (Ole Kacak)
Feel free to contribute, fork, or suggest improvements.