120 results for “topic:pdf-to-text”
Get your documents ready for gen AI
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Knowledge Agents and Management in the Cloud
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
Table structure recognition dataset of the paper: Complicated Table Structure Recognition
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.
A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.
Pure TypeScript, cross-platform module for extracting text, images, and tabular data from PDFs. Run 🤗 directly in your browser or in Node.js
OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.
PDF text data extraction web app with OCR for scanned documents
Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library
C# and VB.NET samples for Docotic.Pdf library
cli for extracting text from PDF files (and maybe possibly tables)
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework
The code base of the front-end of nocodefunctions.com
Batch-convert pdf to text, extract data from pdf in python
Graphlit Platform
Simple PHP PDF to Text class
Light-weight memory-safe client library for extracting plain text from pdf files.
Simple pdf to text with python using PDFtk and PyPDF2
[Eng] API for obtaining data from the Tide Table, using web scraping. [Pt-Br] API para Obtenção da Tábua de Maré diária, usando web scraping com PHP.
Bangla PDF to text converter that works on Windows, macOS, and Linux without any extra downloads or configurations.
Advanced PDF/Document Translator with interactive comparison. Built on IBM Docling.
Aspose.PDF for JavaScript via C++
ClassroomLM allows educational institutions to create specialized AI assistants for their classrooms.
Build a RAG preprocessing pipeline