OtenMoten/pdf-alchemist
It's designed for transmuting PDFs into HTML. Harness the power of OCR, image processing, and web technologies to unlock the secrets within your PDF documents.
๐ฉ๐ผโ๐ฌ PDF Alchemist: A PDF to HTML Transmuter
Welcome to the realm of PDF Alchemist, where the secrets of PDFs are transmuted into HTML.
๐ Project Overview
This Python application lovely named PDF Alchemist is a sophisticated, open-source toolkit that combines the arcane arts of PDF parsing, OCR, image processing, and HTML generation. It's designed for those who seek to unlock the knowledge sealed within the enigmatic tomes we call PDFs.
This project brings together a fellowship of powerful components:
- PDFParser: The Document Detective, powered by PyMuPDF
- OCREngine: The Text Archaeologist, empowered by Tesseract
- ImageProcessor: The Digital Alchemist, enhanced by Pillow
- HTMLGenerator: The Web Illusionist, crafted with Dominate
- ProgressTracker: The Expedition Chronicler, utilizing Python's built-in logging module
โจ Capabilities
- Unearth text and images from PDF archives
- Decipher text using advanced OCR incantations
- Transmute images into optimized, base64-encoded artifacts
- Weave extracted elements into responsive HTML tapestries
- Chronicle the expedition with detailed logs and progress tracking
๐งช Installation
To establish your own PDF Alchemist's laboratory:
- Clone this arcane repository:
git clone https://github.com/team-bitfuture/pdf-alchemist.git - Enter the sacred circle:
cd pdf-alchemist - Summon the required artifacts:
pip install -r requirements.txt - Ensure you possess the Tesseract grimoire. If not, acquire it here.
๐ฎ Usage
To initiate the PDF transmutation ritual:
if __name__ == "__main__":
pdf_path = "input.pdf"
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)
main(pdf_path, output_dir)This will transmute your PDF into a series of HTML pages, complete with extracted text, images, and layout information.
๐งฌ Running Tests
To ensure your PDF Alchemist is operating at peak efficiency:
pytest tests/
This will execute a series of arcane trials, testing each component of the PDF Alchemist.
๐ค Contributing
We welcome fellow arcane researchers to join our quest. If you wish to contribute:
- Fork the repository
- Create your feature branch (
git checkout -b feature/MagicSpell) - Commit your changes (
git commit -m 'Add MagicSpell') - Push to the branch (
git push origin feature/MagicSpell) - Open a Pull Request
๐ License
This project is licensed under the GPL3.0 License - see the LICENSE.md file for details.
๐งโโ๏ธ Authors
- Kevin Ossenbrรผck - Archmage of PDF Transformation - ossenbrรผck.de
See also the list of contributors who participated in this arcane project.
๐ Connect with Team BitFuture
- Website: team-bitfuture.de
- Email: info@team-bitfuture.de
May your PDFs always yield their secrets, and your HTML render with perfection. ๐๐