"topic:pdf-to-text" — Search

120 results for “topic:pdf-to-text”

Get your documents ready for gen AI

aiconvertdocument-parserdocument-parsingdocumentsdocxhtmlmarkdownpdfpdf-converterpdf-to-jsonpdf-to-textpptxtablesxlsx

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

HTML14.2k1.2kUpdated just now

data-pipelinesdeep-learningdocument-image-analysisdocument-image-processingdocument-parserdocument-parsingdocxdonutinformation-retrievallangchainllmmachine-learningmlnatural-language-processingnlpocrpdfpdf-to-jsonpdf-to-textpreprocessing

run-llama/llama_cloud_services

Knowledge Agents and Management in the Cloud

TypeScript4.2k472Updated 9 hours ago

documentdocument-parserdocument-parsingdocx-to-markdownparsingpdfpdf-document-processorpdf-to-excelpdf-to-jsonpdf-to-markdownpdf-to-textppt-to-jsonppt-to-markdownpptxstructured-datatables

enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Python1.5k144Updated 9 hours ago

aidocument-image-analysisdocument-intelligencedocument-parsingdocument-processinglangchainllmmachine-learningnlpocropenaipdfpdf-to-textpython

yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

Rust41039Updated 7 hours ago

data-extractiondocument-processingfastimage-extractionllmmarkdownpdfpdf-editorpdf-generationpdf-librarypdf-parserpdf-to-markdownpdf-to-textpyo3pythonragrusttext-extraction

Academic-Hammer/SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

Python38059Updated 9 hours ago

pdf-to-textpdf2txttable-structure-recognition

pd3f/pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

HTML33040Updated 9 hours ago

extract-textlanguage-modelmachine-learningocrparsrpd3fpdfpdf-to-textpipelinepythontext-extraction

shoryasethia/markdrop

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Python19516Updated 1 hour ago

agentsdoclingimage-to-textllmmarkdropmarkermarkitdownopen-sourcepdf-to-markdownpdf-to-textpypi-packagetable-to-text

GiftMungmeeprued/document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

1773Updated 9 hours ago

data-pipelinedocument-image-processingdocument-parserdocument-parsinglangchainocrpdfpdf-to-textpreprocessing

mehmet-kozan/pdf-parse

Pure TypeScript, cross-platform module for extracting text, images, and tabular data from PDFs. Run 🤗 directly in your browser or in Node.js

TypeScript15615Updated 2 hours ago

pdfpdf-parsepdf-parserpdf-screenshotpdf-tablepdf-thumbnailpdf-to-imagepdf-to-textpdf-toolspdf-utilspdf-viewerpdf2imagepdf2jsonpdf2picpdf2textpdfjspdfjs-distturkey

NanoNets/ocr-python

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

Jupyter Notebook12617Updated 9 hours ago

extract-tableextract-text-from-imageextract-text-from-pdfimage-to-textimage-to-text-converterocrpdfpdf-to-csvpdf-to-jsonpdf-to-textpytesseract-ocrpythonsearchable-pdftable-extracttesseracttextract

nainiayoub/pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

Python9649Updated 9 hours ago

ocrocr-pythonocr-text-readerpdfpdf-to-textpythonstreamlitstreamlit-webapptext-extraction

datalogics/adobe-pdf-library-samples

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

8458Updated 9 hours ago

ocrocr-pdfpdfpdf-compressionpdf-conversionpdf-converterpdf-documentpdf-generationpdf-libpdf-manipulationpdf-mergerpdf-parserpdf-renderpdf-splitpdf-to-imagepdf-to-officepdf-to-textpdf-toolspdfa

BitMiracle/Docotic.Pdf.Samples

C# and VB.NET samples for Docotic.Pdf library

Visual Basic .NET8039Updated 9 hours ago

docotic-pdfextract-imagesextract-texthtml-to-pdfimages-to-pdfnet-corepdf-annotationpdf-compressionpdf-flattenerpdf-formspdf-generationpdf-librarypdf-manipulationpdf-mergepdf-parserpdf-signaturepdf-to-imagepdf-to-textprint-pdfsign-pdf

galkahana/pdf-text-extraction

cli for extracting text from PDF files (and maybe possibly tables)

C++7922Updated 9 hours ago

pdfpdf-to-text

mbzuai-oryx/KITAB-Bench

[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Python654Updated 9 hours ago

arabicbenchmarklayout-detectionocrpdf-to-texttable-detectionvlmsvqa

papercast-dev/papercast

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Python543Updated 9 hours ago

arxivdagdocument-parserdocument-parsinggrobidnlppdf-converterpdf-document-processorpdf-to-textpipelinepodcastpythonsemantic-scholartts

iditectweb/converter

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework

C#4212Updated 9 hours ago