"topic:content-extraction" — Search

176 results for “topic:content-extraction”

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

batch-processingclaudecontent-extractiondata-collectionfirecrawlfirecrawl-aijavascript-renderingllm-toolsmcpmcp-servermodel-context-protocolsearch-apiweb-crawlerweb-scraping

vakra-dev/reader

Open-source, production-grade web scraping engine built for LLMs. Scrape and crawl the entire web, clean markdown, ready for your agents.

TypeScript47132Updated 8 hours ago

aiai-agentsai-crawlerai-scraperanti-botcloudflare-bypasscontent-extractioncrawlerdata-extractionheadless-browserhtml-to-markdownllmmarkdownnodejsproxy-rotationscrapertypescriptweb-crawlerweb-data-extractionweb-scraping

graphlit/graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

TypeScript37252Updated 1 day ago

claudecontent-extractioncontent-ingestiondata-collectionllm-toolsmcp-servermodel-context-protocolsearch-apiunstructured-dataweb-crawlerweb-scraping

currentslab/extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

HTML29926Updated 1 day ago

author-extractioncontent-extractiondate-extractionmachine-learningnewsnews-articlesnews-extractionnews-extractorpythontext-cleaningtext-miningweb-scrapingwebscraping

pinkpixel-dev/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

JavaScript12111Updated 2 days ago

ai-assistantai-toolscheeriocontent-extractioncrawlerduckduckgoduckduckgo-searchgoogle-searchmcpmcp-serverweb-contentweb-crawlerweb-scraperweb-scrapingweb-searchweb-search-agent

mvasilkov/readability2

Readability2 converts HTML to plain text.

TypeScript10915Updated 3 months ago

content-extractionhtmljavascriptplaintextreadability

tuffstuff9/nextjs-pdf-parser

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

TypeScript6711Updated 1 month ago

content-extractionfilepondnextjsnextjs-pdfnextjs-pdf-parsenextjs-pdf-parsernextjs-pdf-parsingpdf-parsepdf-parserpdf-parsingpdf-uploadpdf2jsonreact-pdfreact-pdf-parser

gregors/boilerpipe-ruby

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

Ruby435Updated 1 year ago

boilerpipeboilerpipe-algorithmcontent-extractionnewswebscraping

oiwn/dom-content-extraction

DOM Based Content Extraction via Text Density

Rust402Updated 3 days ago

content-extractiondom-basedrustscrapingweb-crawling

nikitautiu/learnhtml

Web content extraction using machine learning

HTML349Updated 6 months ago

content-extractiondeep-learninghtml

blessonism/openclaw-skills

A collection of OpenClaw Agent Skills — search, analysis, content extraction, and more.

Python331Updated 1 hour ago

ai-agentcontent-extractiongithub-explorermulti-source-searchopenclawsearchskills

spences10/mcp-jinaai-readerArchived

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

JavaScript316Updated 3 days ago

content-extractiondocumentation-tooljinaaillm-toolsmcpmodel-context-protocoltext-extractionweb-contentweb-scraping

K-

k-kolomeitsev/agent-browser-workspace

Local browser toolkit for AI agents: deep research and browser use automation with local Chrome (CDP) + Playwright. Flexible, extensible scripts for web navigation, extraction and workflow automatization - built for reproducible research and agent-driven browsing.

JavaScript241Updated 11 hours ago

agentic-aiai-agentsbrowser-automationcdpchromechrome-devtools-protocolclicontent-extractiondata-extractiondeep-researchlocal-firstmarkdownnodejspdfplaywrightreproducible-researchresearch-assistantweb-automationweb-researchweb-scraping

gdamdam/sumo

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

Python205Updated 1 year ago

automatic-summarizationcontent-extractionentity-recognitionnlpnltksemantic-analysissentence-extraction

pdfix/pdfix_sdk_example_cpp

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

C++194Updated 3 days ago

accessibilityautotagcontent-extractionconversionconverterdigital-signatureextract-datahtmlmetadatapdfpdf-converterpdf-data-extractionpdf-formspdf-manipulationpdf2htmlpdfuasigntaggingwatermarkwcag

timoteostewart/benson

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

Python161Updated 1 month ago

boilerplate-removalcontent-extractionproductivityweb-scraping

amirthfultehrani/Youtube-Transcript-Copier

A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.

JavaScript162Updated 4 days ago

accessibilityautomationbrowser-extensionclipboardcontent-extractiondata-extractiongreasemonkeyhelperjavascriptproductivitytampermonkeytext-extractiontooltranscriptuserscriptutilitiesvideoviolentmonkeywebyoutube

bencmc/youtube_video_summarizer

This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

Python157Updated 3 months ago

content-extractiongpt-35-turbolangchain-pythonnaturalnatural-language-processingopenaipythonstreamlittext-processingtext-summarizationtranscript-analysisvideo-processingyoutube-api

jocmp/mercury-parser

Extract meaningful content from the chaos of a web page

JavaScript156Updated 2 days ago

article-parsercontent-extractionhtml-parserjavascriptmercury-parsernodejsreadabilityreader-moderssweb-scraping

manooll/webfetch-mcp

Live Web Access for Your Local AI — Tunable Search & Clean Content Extraction

JavaScript145Updated 6 days ago

ai-dataai-toolsapi-freecontent-extractiondata-fetchingdockerllm-integrationlmstudiolocal-llmmcp-servermodel-context-protocolnodejsopen-sourceprivacyscrapersearch-enginesearxngself-hostedurl-fetchweb-search

LandWhale2/TD-Spider

Via Text Density Simple Web Crawler With Go

Go130Updated 3 weeks ago

content-extractiondata-miningdomgolangkeyword-searchopensourcescrapingtext-densityweb-crawler

developer0hye/anytomd-rs

Pure Rust document-to-Markdown converter for LLM workflows (DOCX, PPTX, XLSX, HTML, CSV, JSON, XML, images).

Rust130Updated 1 day ago

anytomdcontent-extractionconvertercsvdocxhtmlimage-extractionjsonllmmarkdownpptxrusttext-processingxlsxxml

peremenov/seizeArchived

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

HTML121Updated 1 year ago

content-extractiondomextractreadabilityreadertext-score

kamjin3086/Crawell

📸 Crawell – 网页图片/正文一键提取、Markdown 转换与批量下载的浏览器扩展，本地化，免费 Crawell browser extension for one-click image & article extraction, Markdown conversion and bulk download – 100 % local processing.

TypeScript122Updated 3 weeks ago

browser-extensionchrome-extensioncontent-extractionedge-extensionfirefox-addonimage-downloadermarkdownprivacy-firstreacttailwindcsstypescriptweb-scraping

helioLJ/youtube-transcript-copier

Chrome extension to copy YouTube transcripts with AI-friendly features

JavaScript111Updated 1 week ago

accessibility-toolsbrowser-extensionchatgpt-toolschrome-extensionclipboard-managercontent-extractioni18njavascriptllm-toolsproductivity-toolstranscript-copieryoutube-apiyoutube-extensionyoutube-transcript

vakharwalad23/mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

TypeScript111Updated 3 months ago

ai-poweredcloudflare-workercontent-extractiondocument-processingmarkdown-conversionpuppeteertweets-extractiontypescriptweb-scraping

zeoagency/mobile-first-indexing-tool

Mobile First Indexing Tool

Python103Updated 5 months ago

aws-lambdaaws-layerscontent-extractionlighthousemfiseoseo-tool

ctokx/url-to-markdown

Convert webpages to clean Markdown for LLM and RAG workflows. Browser-based UI + Node.js CLI with selector drilling, metadata extraction, and batch processing.

JavaScript70Updated 2 days ago

clicontent-extractiondeveloper-toolshtml-parserhtml-to-markdownjavascriptllmmarkdownnodejsragturndownweb-scraping

wszqkzqk/qt-web-extractor

Web content extraction engine backed by Qt WebEngine.

Python70Updated 23 hours ago

chromiumcontent-extractionheadless-browseropen-webuipdf-extractionpyside6qtwebengineweb-scraping

leroyanders/acrticle-scrapper

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

Python51Updated 11 months ago

article-parsercontent-creation-toolscontent-extractiondata-archivinghtml-to-markdown-converterimage-downloadingmarkdown-conversionmetadata-extractionpythonweb-scraping

Page 1 of 6