89 results for “topic:data-centric-ai”
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Refine high-quality datasets and visual AI models
A Doctor for your data
The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
Interactively explore unstructured datasets from your dataframe.
Automatically find issues in image datasets and practice data-centric computer vision.
A curated, but incomplete, list of data-centric AI resources.
Resources for Data Centric AI
Curated list of open source tooling for data-centric AI on unstructured data.
Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 👩🏽💻
Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function: A Minmax Perspective" (NCFM) in CVPR 2025 (Full Score, Highlight).
🏭 Mega Scale Multimodal DataPipeline for SOTA Foundation Models
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Notebooks demonstrating example applications of the cleanlab library
Papers about training data quality management for ML models.
Introduction to Data-Centric AI, MIT IAP 2024 🤖
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets
[ECCV 2022] Official Implementation for Unsupervised Selective Labeling for More Effective Semi-Supervised Learning
A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective
A Data Centric NER annotation tool for your Named Entity Recognition projects
A curated list of awesome open source tools and commercial products to catalog, version, and manage data 🚀
Trending projects & awesome papers about data-centric llm studies.
This data-centric AI repository implements a robust deep learning method (LFBNet) for fully automated tumor segmentation in whole-body [18]F-FDG PET/CT images.
Intriguing Properties of Data Attribution on Diffusion Models (ICLR 2024)
[NeurIPS 2024] 🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.