"topic:document-parsing" — Search

79 results for “topic:document-parsing”

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

Python72.0k9.9kUpdated just now

ai4sciencechineseocrdocument-parsingdocument-translationkieocrpaddleocr-vlpdf-extractor-ragpdf-parserpdf2markdownpp-ocrpp-structurerag

docling-project/docling

Get your documents ready for gen AI

Python55.5k3.7kUpdated 1 hour ago

aiconvertdocument-parserdocument-parsingdocumentsdocxhtmlmarkdownpdfpdf-converterpdf-to-jsonpdf-to-textpptxtablesxlsx

Unstructured-IO/unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

HTML14.2k1.2kUpdated just now

data-pipelinesdeep-learningdocument-image-analysisdocument-image-processingdocument-parserdocument-parsingdocxdonutinformation-retrievallangchainllmmachine-learningmlnatural-language-processingnlpocrpdfpdf-to-jsonpdf-to-textpreprocessing

run-llama/llama_cloud_services

Knowledge Agents and Management in the Cloud

TypeScript4.2k472Updated 1 day ago

documentdocument-parserdocument-parsingdocx-to-markdownparsingpdfpdf-document-processorpdf-to-excelpdf-to-jsonpdf-to-markdownpdf-to-textppt-to-jsonppt-to-markdownpptxstructured-datatables

opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

Java1.9k133Updated 1 hour ago

accessibilityaibounding-boxdocument-parsingeaahtmljsonlangchainmarkdownocrocr-recognitionpdfpdf-accessibilitypdf-converterpdf-extractionpdf-parserpdf-uaragtablestagged-pdf

enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Python1.5k144Updated 1 day ago

aidocument-image-analysisdocument-intelligencedocument-parsingdocument-processinglangchainllmmachine-learningnlpocropenaipdfpdf-to-textpython

NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Python1.4k124Updated 1 day ago

aidocument-parserdocument-parsingimage-to-markdownllmmarkdownocrpdf-parserpdf-to-jsonpdf-to-markdownstructured-datastructured-data-capturetables

Topdu/OpenOCR

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

Python1.3k111Updated 1 hour ago

chineseocrdocument-analysisdocument-parsingdocument-processingocrocr-pytorchscene-text-detectionscene-text-recognition

edenai/edenai-apis

Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines

Python47070Updated 1 day ago

aggregatoraiai-as-a-serviceapicomputer-visiondocument-parsingimage-processingmachine-translationnatural-language-processingnlpocroptical-character-recognitionpre-trained-modelpythonspeech-recognitionspeech-to-texttext-to-speechvideo-recognition

harishdeivanayagam/rowfill

Open-source spreadsheets platform for deep research and document processing

TypeScript36820Updated 1 month ago

documentdocument-extractiondocument-parsingimage-ocrlanggraphllamallmnextjsocrocr-javascriptollamaopenaipdfpdfsunstructuredunstructured-datavisionvision-api

GiftMungmeeprued/document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

1773Updated 2 days ago

data-pipelinedocument-image-processingdocument-parserdocument-parsinglangchainocrpdfpdf-to-textpreprocessing

AdemBoukhris457/Documents-Parsing-Lab

Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

Jupyter Notebook789Updated 3 weeks ago

aidocument-parsinggenaiocrparsing-data

CycloneBoy/pdf_table

A Unified Toolkit for Deep Learning-Based Table Extraction

Python599Updated 3 weeks ago

aidocument-parsinglayout-analysisocrpdfpdf-to-htmltabletable-recognition

arthurpanhku/Arthor-Agent

AI agent for security teams: automate assessment of documents, questionnaires & reports. Multi-format parsing, RAG knowledge base, OpenAI/Ollama. Risks, compliance gaps, remediations. MIT.

Python567Updated just now

ai-agentcompliancedocument-parsingfastapillmmcpollamaopenaiopenclawragsecuritysecurity-assessment

papercast-dev/papercast

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Python543Updated 2 days ago

arxivdagdocument-parserdocument-parsinggrobidnlppdf-converterpdf-document-processorpdf-to-textpipelinepodcastpythonsemantic-scholartts

Unstructured-IO/communityArchived

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

297Updated 6 months ago

communitydata-pipelinedeep-learningdocument-aidocument-parsingmachine-learningnlp-parsingocr-pythonopen-sourcepreprocessing-data

docling-project/docling4j

Docling4j brings the functionalities of Docling in document understanding to Java® projects

Java265Updated 2 weeks ago

aidoclingdocument-parserdocument-parsingdocument-understandingdocumentsjavapdfpdf-converterpdf-to-json

Hyland/DocumentFilters

Document Filters is an SDK for applications like content indexing, e-discovery, data migration, and feeding data into AI/ML models by extracting data from unstructured sources. It gives the ability to perform deep inspection, data extraction, output manipulation, and conversion for virtually any type of document, in any programming language.

C++252Updated 2 weeks ago

aiconvertdocument-parserdocument-parsingdocxhtmlllmmachine-learningmarkdownmlpdfpdf-converterpptxpreprocessingsdktablesxlsx

acenji/ats

Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.

JavaScript156Updated 1 week ago

applicant-tracking-systematsdocument-parsinggenerative-aiinvestor-pitchesjob-matchingkeyword-extractionnlp-machine-learningnodejsreactjsresume-analysissorting-algorithms

aimagelab/mugat

Official implementation of our ECCVW paper "μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"

Python110Updated 6 months ago

document-parsingocrtransformer

J-

J-sephB-lt-n/pdf-bank-statement-parser

Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data

Python63Updated 3 months ago

bankbankingdocument-parsingfinancial-analysisfirst-national-bankfnbpdf-parserpdf-parsingpython

Kathan-max/RAG-Enhanced-Chatbot-with-LoRA-Fine-Tuning

Transform your documents into intelligent conversations. This open-source RAG chatbot combines semantic search with fine-tuned language models (LLaMA, Qwen2.5VL-3B) to deliver accurate, context-aware responses from your own knowledge base. Join our community!

Python62Updated 1 week ago

conversational-aidata-pipelinedocument-parsingfastapihuggingface-transformerslarge-language-modelsllamalora-fine-tuningmachine-learningnlpocr-recognitionpythonqwen2-5-vlrag-chatbotreactjsretrieval-augmented-generationsemantic-embeddingssemantic-searchsemantic-segmentationtailwindcss

rithulkamesh/docproc

Document Intelligence Platform — Extract, refine, and query documents with vision LLMs and config-driven RAG.

Python51Updated 2 days ago

content-extractiondata-extractiondocument-analysisdocument-parsingequation-detectionlayout-analysismachine-learningmathematical-symbolsocrpdf-processingpdf-text-extractionpythonregion-detectiontext-classificationtext-extraction

baughmann/tikara

The metadata and text content extractor for almost every file type.

Python50Updated 1 month ago

apache-tikacontent-extractiondocument-parsingdocument-processingdocximage-to-textjavalanguage-detectionllmmetadatametadata-extractionmlnatural-language-processingocrpdf-to-textretrieval-augmented-generationtext-extractiontext-mining

Anmol-Baranwal/doc-parsing

Python scripts to parse and structure invoice data from PDFs using OpenAI, Anthropic and Invofox APIs

Python40Updated 2 months ago

apiclaudeclaude-sonnetdocument-parsinggpt-4oparsingpdf-document-processor

syw2014/langparse

LangParse is a universal document parsing and text chunking engine for LLM or Agent applications — Documents In, Knowledge Out.

Python40Updated 2 months ago

chunkingdocument-parsingdocument-processingdocxexcelexcel-to-markdownllmmarkdownpdf-converterpdf-to-markdownpdf-to-textrag

renswickd/document-parser-collection

This is a collection of various document parsers and hands-on to construct structured data for your RAG applications.

Python40Updated 2 months ago

amazon-textractazure-document-intelligencedocument-parsingllama-parsemistral-ocrunstructured-io

Bharathyalagi/OCR-Document-parser

Smart OCR application built with Tesseract and Streamlit that extracts structured data from Inputs

Python40Updated 2 months ago

aiautomationdata-extractiondeep-learningdocument-parserdocument-parsingimage-processinginvoice-parsermachine-learningocrpdfpythontesseracttesseract-ocr

Rushi-Balapure/pdf_2_json_extractor

A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_to_json preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.

Python31Updated 1 month ago

cli-toolcpu-onlycross-platformdata-extractiondocument-parsingdocument-processingjsonlayout-analysisnlpofflinepdfpdf-extractionpdf-parserpdf-processingpdf-to-jsonpythonpython-librarystructure-extractiontext-extraction

MegrezAI/LeapRAG

LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.

Python30Updated 2 months ago

a2aa2a-protocolagent-to-agentagentschatgptdeepseekdocument-parserdocument-parsingllmnlpollamaopenaipdfpdf-to-textragretrieval-augmented-generation

Page 1 of 3