23 results for “topic:corpus-builder”
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Crawler for linguistic corpora
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
Collector and speech cutter for librivox audiobooks
No description provided.
Ebook Corpus - A parser and extractor for electronic books
TTS plugin for dictpress
Article title, authors, date and body extraction dataset.
Katya or The Liberated Corpus a text corpus that allows you to request and scrape any web resource!
Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.
A corpus builder for evaluation of plagiarism detection tools
The canonical resources to build the backend for a corpus/repository management framework for Crow, the Corpus and Repository of Writing
Extract text from Vikidia/Wikipedia articles [fr]
EU AI Act RAG — End-to-end retrieval-augmented generation pipeline: SPARQL corpus builder, Cloudflare Workers AI backend, and Streamlit playground for querying Regulation (EU) 2024/1689
Crawl Ask.fm QA lists and create corpus for ML.
The user interface for the Corpus & Repository of Writing, built in Angular
This is a text corpus management system for the german linguistic department of the university of Basel.
Chatbot in Polish language, trained on movie subtitles collected using web scraping, based on Transformer architecture.
App and Scripts working with the corpus-builder CorpusCook, to have a corpus updated with corrected wrong predictions
CLI tool to redact and publish spam/phishing emails as a public research corpus.
Builds Wikipedia corpora in I5 (a TEI-based format)
Corpus Development Software for Machine Translation
A Scrapy package based web scraper for collecting Kurdish text data from websites. The tool recursively crawls specified domains, extracts article content using Trafilatura, and filters results by language using Facebook's FastText language identification model.