GitHunt
PE

pemagrg1/Nepali-Datasets

A list of Nepali Dataset sources. (Hoping that it will encourage everyone to research more on Nepali language)

Nepali-Datasets: Comprehensive NLP Resource Collection

A thoroughly verified and curated collection of Nepali datasets for NLP research, development, and benchmarking. This resource aggregates 100+ datasets across 20+ categories to encourage and support research on low-resource Nepali language.

NOTE: Hope that this will encourage everyone to research more on Nepali language. And you are welcome to add the sources if its not listed here πŸ“Œ


Benchmarks & Standards

Comprehensive evaluation frameworks and shared tasks for Nepali NLP.

  • NLUE (Nepali Language Understanding Evaluation) βœ“ - 9 classification + 3 structural prediction tasks (sentiment, hate speech, toxicity, QA, NER). arXiv: 2411.19244
  • Nep-gLUE Benchmark - Official Nepali GLUE-style benchmark (7 NLU tasks). Limited direct access; see NLUE for comprehensive alternatives.
  • FLORES-101 Evaluation Benchmark - Machine translation evaluation across 101 languages including Nepali. GitHub: facebookresearch/flores
  • IndicBench - Benchmark for 11 Indic languages including Nepali (13 tasks). New 2025 addition.
  • SemEval 2026 Task 9 - Polarization type classification with Nepali data. Codabench New 2026.

Nepali Text Corpus

Large-scale text collections for language modeling, pre-training, and linguistic analysis.

Ultra-Large Corpora (>1GB)

  • Nepali-Text-Corpus (IRIISNEPAL) βœ“ - 6.4M articles, 10.1 GB - Largest Nepali corpus from 99 news websites. State-of-the-art pre-training resource. HF: IRIISNEPAL/nepali-text-corpus | arXiv: 2411.15734

  • OSCAR Corpus Nepali βœ“ - 3.8 GB, 100M+ sentences from Common Crawl. Kaggle: hsebarp/oscar-corpus-nepali

  • CC100-Nepali βœ“ - Common Crawl 2019 subset, 200GB uncompressed. Foundation data for multilingual models. MetaText: cc100-nepali

  • Lamsal (2020) Corpus - 12M+ words professionally compiled. Note: Original DOI 404; consider IRIISNEPAL as primary substitute.

Large Curated Collections (100MB-1GB)

Specialized Text Collections


Classification Datasets

News classification, topic modeling, and text categorization.


Named Entity Recognition (NER) Datasets

Annotated datasets for entity recognition (person, organization, location, etc.).


Sentiment Analysis & Hate Speech Datasets

Social media, news, and online text with sentiment/toxicity annotations.

Sentiment Analysis

Hate Speech & Offensive Language


Question Answering (QA) Datasets

Extractive, generative, and domain-specific QA datasets.


Summarization Datasets

Abstractive & extractive summarization, headline generation.


Speech Datasets (ASR & TTS)

Audio data for automatic speech recognition and text-to-speech synthesis.

Large-Scale ASR

TTS & Synthesized Speech

Speech Analysis & Emotion

Multilingual Benchmarks

  • Google FLEURS βœ“ - Multilingual benchmark including Nepali (101 languages). HF: google/fleurs

Image & Video Datasets (Computer Vision)

Datasets for image/video captioning, object detection, and multimodal learning.

Sign Language & Gesture

Image Captioning & Multimodal

Face Recognition & Emotion

Domain-Specific Objects


OCR & Handwriting Datasets

Character recognition, document digitization, and license plate detection.

Handwriting & Character Recognition

License Plate & Vehicle Recognition

Academic OCR Research

  • Nepali Handwritten Character Recognition (Zenodo) βœ“ - Research dataset with detailed annotations. Zenodo: 7472398

  • Improving Tesseract-OCR for Nepali (Zenodo) βœ“ - 5,000+ images with preprocessing techniques (DOI: 10.5281/zenodo.4361896). Zenodo: 4361896


Translation Datasets

Parallel corpora for machine translation and low-resource language pairs.

Large-Scale Parallel Corpora

  • English-Nepali Parallel Corpus (Kathmandu University) βœ“ - 1,800,000 sentence pairs gold standard for EN-NE MT. Largest parallel resource. ELRA: W0077

  • Kathmandu University English-Nepali Corpus βœ“ - 1.8M sentence pairs (direct source confirmation). AI4Bharat: indicnlp_catalog

Medium-Scale Corpora

Multilingual & Specialized

Historical & Shared Tasks

  • WMT19 Parallel Corpus βœ“ - Shared task corpus with filtering challenge. statmt.org/wmt19

  • English - Nepali translated strings - UI/software localization strings. Note: Original link 503; alternative via TDIL-DC not directβ€”use ELRA above.


Word Embeddings & Pre-trained Models

Pre-computed word vectors and language models with training datasets.

Word Embeddings

Large Language Models & Transformers


Lexicons, Linguistics & Resources

Linguistic resources, dictionaries, and instruction-tuned datasets.

Dictionaries & Word Lists

Morphology & Syntax

Instruction Tuning & Multilingual

  • Bactrian-X (Instruction Tuning) βœ“ - Nepali included in multilingual instruction-tuning dataset (50+ languages). HF: MBZUAI/Bactrian-X

  • Aya Dataset (Instruction Tuning) βœ“ - Nepali included in community-driven instruction dataset (101 languages). HF: cohere/aya_dataset


Code-Mixed & Multilingual NLP Datasets

Datasets for code-mixing, cross-lingual learning, and low-resource adaptation.

  • Code-Mixed Nepali-English Abuse Detection βœ“ - 5,000 Nepali-English code-mixed comments. Social media. arXiv: 2504.21026 New 2025.

  • Nepali-English Code-Switched LID, POS, NER, Sentiment βœ“ - Complete NLP pipeline for code-mixed data. GitHub: sagorbrur/codeswitch

  • CLE Parallel Corpus (AI4Bharat) βœ“ - English-Nepali-Urdu parallel data. Multilingual. GitHub: AI4Bharat/indicnlp_catalog


Specialized Collections & Aggregators

One-stop resources for finding related Nepali datasets.


Open Data & Government Resources

Official government datasets and open data portals.

  • Open Data Nepal βœ“ - Official open data portal with 500+ government datasets (health, education, infrastructure). opendatanepal.com

  • Census Nepal βœ“ - Official census data from Central Bureau of Statistics (demographic, geographic, economic). censusnepal.cbs.gov.np/results

  • Local Government of Nepal - Municipal & district government data (federal structure). Note: Original link insufficient; recommend using Open Data Nepal instead.


Tools & NLP Frameworks

Complete NLP toolkits and utilities for Nepali processing.


Research Papers & Benchmarks

Peer-reviewed publications on Nepali NLP and related work.

Recent & High-Impact (2024-2026)

  • NepaliGPT: A Generative Language Model for the Nepali Language βœ“ - Recent LLM research. arXiv: 2506.16399

  • NLUE (Nepali Language Understanding Evaluation) βœ“ - 9 NLU tasks with comprehensive benchmark. arXiv: 2411.19244

  • IRIISNEPAL RoBERTa: State-of-the-art Nepali LM βœ“ - 27.5 GB training corpus from 99 news sites. arXiv: 2411.15734

  • Code-Mixed Nepali-English Abuse Detection βœ“ - 5k annotated code-mixed dataset. arXiv: 2504.21026

  • Nepali Transformers@NLU of Devanagari Script Languages 2025 βœ“ - Transformer architectures for Devanagari. ACL: 2025.chipsal-1.36

Sentiment Analysis & Classification

  • Aspect Based Sentiment Analysis of Nepali Text Using SVM and Naive Bayes βœ“ - Comparative ML approach. ResearchGate

  • An Analysis of Classification Algorithms for Nepali News βœ“ - Benchmark of various classifiers. ResearchGate

  • Nepali Text Document Classification Using Deep Neural Network βœ“ - Deep learning approaches. NEPJOL

  • Application of Nepali Large Language Models to Improve Sentiment βœ“ - LLM applications. ACM New 2024.

NLP Tasks & Applications

  • A Machine Learning Approach to Anaphora Resolution in Nepali Language βœ“ - Pronoun resolution task. IEEE

  • Nepali Image Captioning βœ“ - Vision-language multimodal task. IEEE: 8947436

  • Named-Entity Based Sentiment Analysis of Nepali News Media Texts βœ“ - NER + sentiment joint modeling. ACL Anthology

  • Topic Modeling for Nepali Political News βœ“ - Topic analysis in news domain. IEEE: 11004776 New.

  • NepKanun: A RAG-Based Nepali Legal Assistant βœ“ - RAG systems for legal domain. OpenReview New 2025.

  • Exploring NLP Challenges for Nepali βœ“ - Overview of remaining challenges. Preprints: 202409.1229 New 2024.

Linguistic & Historical

  • Natural language processing for Nepali text: a review βœ“ - Comprehensive NLP review. Springer

  • A Descriptive Grammar of Nepali and an Analyzed Corpus βœ“ - Linguistic grammar reference. Google Books

  • Nepali Spell Checker 1.1 and the Thesaurus βœ“ - Early spell checking research. Wayback: NEP05.pdf

  • Nepali Spell Checker βœ“ - Earlier spell checking work. Wayback: NEP04.pdf

Research Aggregators


Ethical Considerations

  • Sentiment/Hate Speech Data: Contains potentially offensive language; bias mitigation recommended for model training
  • Social Media Data (Tweets, Instagram): May contain personal information; use with GDPR/privacy compliance
  • Copyright: Wikipedia, news articles sourced responsibly; attribution recommended
  • Multilingual Data: Code-mixed datasets reflect real-world language use; social biases may be present

How to Contribute

  1. Verify Link: Test that dataset is publicly accessible
  2. Document Metadata: Include: name, size, domain, language(s), annotation scheme
  3. Format Entry: Follow category structure with title, description, link
  4. Submit PR: To pemagrg1/Nepali-Datasets

Additional Resources

Contributors

Created February 21, 2022
Updated March 12, 2026
pemagrg1/Nepali-Datasets | GitHunt