107 results for “topic:corpus-tools”
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
A very simple news crawler with a funny name
Bitextor generates translation memories from multilingual websites
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python library for handling audio datasets.
OpusFilter - Parallel corpus processing toolkit
An advanced, extensible web front-end for the Manatee-open corpus search engine
Utilities for Processing the Switchboard Dialogue Act Corpus
An open source reimplementation of Benny Brodda's BETA in Python
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
A parser for annotated MuseScore 3 files.
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
A set of workflows for corpus building through OCR, post-correction and normalisation
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Reading the data from OPIEC - an Open Information Extraction corpus
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Rezonator: Dynamics of human engagement
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
MFTE (Multi Feature Tagger of English) Python is the Python version based on Le Foll's MFTE written in Perl. It is extended to include semantic tags from Biber (2006) and Biber et al. (1999), including other specific tags.
Yet another search platform for linguistic corpora.
Collector and speech cutter for librivox audiobooks
Searching in-memory corpus with Corpus Query Language (CQL)
An unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Software for multi-level annotation of linguistic corpora
An Interactive Tool for Annotating Discourse Structure and Text Improvement
No description provided.