44 results for “topic:machine-learning-dataset”
Chinese, English NER, English-Chinese machine translation dataset. 中英文实体识别数据集,中英文机器翻译数据集, 中文分词数据集
A Malware classifier dataset built with header fields’ values of Portable Executable files
jazznet dataset of piano patterns for music audio machine learning research
A large, free audio sample database (10M words pronounced), a test bed for voice activity detection algorithms and for single-syllable word recognition
2D Geometric shapes generator
We currently maintain 488 data sets as a service to the machine learning community. You may view all data sets through our searchable interface. For a general overview of the Repository, please visit our About page. For information about citing data sets in publications, please read our citation policy. If you wish to donate a data set, please consult our donation policy. For any other questions, feel free to contact the Repository librarians.
SPREAD is a large-scale synthetic dataset for image- and point-cloud- based tasks in forestry.
A duplicate-free variant of the CIFAR test set.
Extract Japanese characters database.
UCLA Dining Hall Menus Dataset
Corpus of Coq code related to MathComp including several machine-readable representations
Classification dataset for comparing cats and dogs images
Marktplaats.nl (Dutch Classifieds) Listing Scraper
This repo is the dataset for the paper "A New Dataset and Methodology for Malicious URL Classification"
OpenFrameworks program that generates training data from font-faces installed on your Mac.
Korpus ręcznie sklasyfikowanych komentarzy do uczenia maszynowego (filtrowanie komentarzy obraźliwych)
Given a product name, the python program downloads all the images. This includes pagenation also.
Batch download images from iNaturalist observations. GUI app for creating ML datasets, biodiversity research, and citizen science projects. No coding required - standalone executables for Windows, macOS & Linux.
This repository serves as a collection point for market data from Bybit. Aimed at facilitating machine learning model creation and finetuning.
Rupiah Banknotes Dataset is a collection of Indonesian currency images (Rp1,000, Rp2,000, Rp5,000, Rp10,000, Rp20,000, Rp50,000, and Rp100,000) designed for Machine Learning (ML) and Computer Vision (CV) tasks.
CSV datasets for ML/AI models from captured network traffic during ZAP scanning with web applications like Django, Flask, React, Vue and Spring - Anti-Nex training datasets
Simple task for mixed image-graph data
📚 Authoritative P2 microcontroller documentation: architecture, PASM2/Spin2 languages, smart pins, and examples. Optimized for AI training, developer education, and technical reference
Public dataset of Agent Manifest declarations registered through the Agent Manifest registry.
Generate captchas for ML tasks in parallel.
tools for a deep learning in physics research course
A dataset toolbox for preparing and analyzing conversational datasets, including CSV splitting, CSV → Parquet conversion, dataset statistics, Parquet cleaning and sorting, HuggingFace–style metadata generation, and batched chain insertion into PostgreSQL — with Rich progress, multiprocessing, and 32 GB-RAM-friendly batching.
A full Discord dataset pipeline with end-to-end flow from raw Discord data to final Parquet dataset with full statistics — every stage independant, idempotent, and CLI-driven for ease of automation.
sentence polarity dataset v1.0 (includes sentence polarity dataset README v1.0): 5331 positive and 5331 negative processed sentences / snippets. Introduced in Pang/Lee ACL 2005. Released July 2005.
A dataset toolbox for preparing and analyzing conversational datasets, including CSV splitting, CSV → Parquet conversion, dataset statistics, dialogue-turn filtering, turn-based filtering, token and turn analysis, Parquet cleaning and sorting, HuggingFace–style metadata generation, and batched chain insertion into PostgreSQL.