GitHunt
RE

RedGhoul/awesome-arxiv

A curated, researcher-focused guide to arXiv tools, workflows, datasets, and infrastructure for discovering, reading, analyzing, and building on top of scientific papers.

Awesome arXiv

Your ultimate toolkit for navigating the vast universe of scientific preprints. Discover cutting-edge research, dive deep into papers, stay ahead of trends, and build powerful tools on top of arXiv — the world's largest open-access repository for scientific research.

⚡ Be a good citizen: arXiv serves millions of researchers daily. When building automated tools, use official interfaces (API / OAI‑PMH / RSS / bulk access) and respect their usage guidelines.

Contents

Official arXiv Access & Interfaces

The official, production-ready interfaces for interacting with arXiv at scale. Whether you're building crawlers, mirrors, analytics platforms, or research tooling, start here to avoid rate limits, stay compliant, and build on solid foundations.

  • arXiv API — RESTful interface for programmatic search and metadata retrieval using Atom XML; perfect for lightweight querying, prototyping, and real-time paper lookups.
  • arXiv API User Manual — comprehensive guide covering paging strategies, efficient batching, advanced query syntax, and best practices for production use.
  • OAI‑PMH (metadata harvesting) — industry-standard protocol for maintaining complete, continuously synchronized metadata mirrors; the gold standard for archival and analytics platforms.
  • RSS / Atom feeds — lightweight, real-time streams perfect for alerting systems, newsletters, and monitoring specific subject categories without heavy API overhead.
  • Bulk data access overview — comprehensive directory of all bulk access methods, including S3 buckets, third‑party mirrors, and snapshot archives for large-scale research.
  • Bulk PDF + source via Amazon S3 (requester pays) — complete corpus access: millions of PDFs and LaTeX sources available via S3 for large‑scale NLP, computer vision, and meta-research projects.
  • Robots / crawling guidance — critical rules and best practices for automated access; required reading to avoid IP bans and ensure sustainable scraping.
  • HTML papers on arXiv — native, semantically structured HTML rendering of papers; increasingly valuable for accessibility, automated extraction, and mobile-friendly reading.
  • arXiv Labs — experimental features and integrations showcased directly on arXiv abstract pages; see cutting-edge tools before they become mainstream.
  • arXivLabs Showcase — curated examples of community-built tools that have been officially integrated into arXiv's interface; inspiration for your own projects.

Search & Discovery

Transform how you navigate the research landscape. These tools go beyond simple keyword matching to help you discover hidden connections, understand paper relationships, and map the intellectual terrain of your field.

  • arXivLens — AI-powered arXiv explorer that generates key findings, structured summaries, and cross-paper insights; perfect for rapid triage, trend spotting, and getting up to speed on unfamiliar topics.
  • alphaXiv — collaborative reading platform with open, line-by-line discussions and Q&A directly embedded on papers; learn from the community's collective understanding.
  • ArxivXplorer — semantic paper exploration using embedding-based similarity; discover papers by conceptual similarity rather than keyword matching.
  • ar5iv — beautifully rendered HTML5 versions of arXiv papers optimized for fast scanning, mobile reading, and accessibility; often faster than PDFs for quick skimming.
  • Connected Papers — interactive similarity graphs that visualize paper relationships; identify foundational work, follow-up research, and discover papers you might have missed.
  • Litmaps — citation network explorer that maps how ideas flow through the literature over time; track research evolution and find critical papers in citation chains.
  • Paperscape — stunning visual maps of arXiv's category structure and subfields; explore research landscapes through interactive cartography.
  • ResearchRabbit — intelligent recommendation engine that learns from your interests to suggest relevant authors, topics, and paper collections; your personal research assistant.
  • Semantic Scholar — massive-scale scholarly search engine with deep citation analysis, impact metrics, and AI-powered paper understanding; one of the most comprehensive research databases.
  • arXiv Sanity Preserver — legendary interface for sorting, filtering, and discovering trending papers; particularly powerful for identifying what the community is excited about.
  • dblp — the definitive bibliographic database for computer science publications; authoritative source for publication records, author disambiguation, and venue information.

Notifications & Recommenders

Never miss what matters. With hundreds of papers published daily, staying current is overwhelming. These tools cut through the noise to deliver personalized, relevant research directly to your inbox.

  • AlphaSignal — curated daily alerts highlighting trending papers, breakthrough models, and significant developments in ML/AI; stay ahead of the curve without information overload.
  • Hugging Face Papers — community-powered curation of the most impactful ML papers with discussions, code links, and model cards; the pulse of the ML community.
  • Scholar Inbox — intelligent, personalized research digests that aggregate papers from arXiv, bioRxiv, medRxiv, and more; one inbox for all your preprint needs.
  • arXivDigest — AI-powered email summaries that learn your preferences and recommend papers tailored to your research interests; like having a research assistant scan arXiv for you.
  • arXiv e-mail alerts ("my arXiv") — official, reliable subject‑based notifications directly from arXiv; the foundation for any paper tracking workflow.

Reading, Annotation & Browser Enhancers

Read smarter, not harder. These tools transform paper reading from a passive activity into an active, organized, and deeply understood research practice.

  • Explainpaper — highlight any confusing passage and get instant AI-powered explanations in plain language; break down dense technical jargon and complex concepts on the fly.
  • Elicit — AI research assistant that extracts key information, creates structured literature review tables, and answers questions across multiple papers; revolutionize your literature review process.
  • Sioyek — lightning-fast, keyboard-driven PDF reader built specifically for technical documents; navigate papers with vim-like efficiency and never lose your place.
  • PaperMemory — intelligent personal paper database that automatically enriches metadata, tracks your reading progress, and helps you remember what you've learned; your second brain for research.
  • Zotero arXiv workflows — seamless integration for managing arXiv preprints alongside published papers with proper versioning, DOI linking, and citation tracking; professional-grade reference management.

Bibliography & Citation Utilities

Master your references. Proper citation management saves hours and ensures reproducibility. These tools handle the tedious work so you can focus on research.

  • Zotero — powerful, open‑source reference manager with one-click browser capture, cloud sync, and seamless integration with Word, LaTeX, and Google Docs; the researcher's Swiss Army knife.
  • Better BibTeX — essential Zotero extension that generates stable citation keys, enables reproducible BibTeX exports, and maintains consistency across documents; critical for LaTeX workflows.
  • JabRef — dedicated BibTeX editor with advanced search, duplicate detection, and bibliography quality checks; perfect for LaTeX purists who want full control.
  • arXiv to BibTeX — instant BibTeX generation from arXiv IDs; no more manual entry or formatting errors when citing preprints.

Parsing, Conversion & Extraction

Turn papers into data. Extract structured information, convert formats, and unlock the full potential of research papers for NLP, search engines, summarization systems, and large‑scale analysis.

  • Docling — state-of-the-art PDF parser that converts documents into clean, structured formats (JSON, Markdown, HTML) with exceptional accuracy for tables, figures, and equations.
  • GROBID — industry-standard scholarly PDF parser that extracts metadata, citations, and full text into TEI/XML format; powers many major research platforms and libraries.
  • Science Parse — fast, production-ready PDF parsing pipeline from AllenAI that outputs structured JSON with sections, citations, and metadata; optimized for speed and reliability.
  • LaTeXML — comprehensive LaTeX to XML/HTML/MathML converter that preserves mathematical notation and document structure; essential for processing arXiv source files.
  • pdf2htmlEX — sophisticated PDF to HTML converter that maintains precise layout, fonts, and positioning; perfect for creating web-friendly versions of papers.

SDKs, CLIs & Developer Tooling

Build powerful research tools. Programmatic access to arXiv enables automation, custom pipelines, and sophisticated research infrastructure. These libraries and tools make integration seamless.

  • arxiv.py — elegant Python wrapper for the arXiv API with intuitive search, metadata retrieval, and paper download capabilities; the go-to library for Python-based arXiv tools.
  • arxiv-dl — blazing-fast command-line tool for bulk downloading papers with parallel processing, progress tracking, and flexible filtering options; perfect for building local paper collections.
  • ArXivScraper — flexible Python scraper for harvesting papers by category, date range, or custom queries; ideal for building datasets and monitoring specific research areas.

Metadata & Knowledge Graphs

Map the research universe. These knowledge graphs and metadata services reveal the hidden connections between papers, authors, institutions, and ideas—enabling powerful citation analysis, author disambiguation, and research analytics.

  • OpenAlex — massive open scholarly knowledge graph covering millions of papers, authors, institutions, and concepts with free, open access; the Wikipedia of research metadata.
  • Semantic Scholar API — comprehensive API providing citation‑aware scholarly metadata, paper embeddings, and research insights; powers many academic search and recommendation systems.
  • Crossref — authoritative source for DOI metadata, publication linking, and cross-publisher citation data; the backbone of scholarly communication infrastructure.
  • OpenCitations — open, accessible citation graphs that reveal how ideas flow through the research literature; essential for understanding research impact and knowledge networks.

Datasets & Corpora

Ready-to-use research data. Pre-processed, cleaned, and structured datasets that jumpstart your ML models, benchmarks, and large‑scale research studies without the hassle of data collection and preprocessing.

  • Cornell arXiv Dataset — comprehensive metadata snapshot covering 1.7M+ arXiv papers with abstracts, categories, authors, and publication dates; perfect for exploratory analysis and dataset construction.
  • unarXive — complete LaTeX source files with preserved document structure, making it ideal for studying paper composition, extracting equations, and building LaTeX-aware NLP models.
  • S2ORC — massive open research corpus from AllenAI containing millions of full-text papers with structured metadata, citations, and extracted information; a goldmine for NLP research.
  • ogbn-arxiv — standardized citation graph benchmark dataset for evaluating graph neural networks and node classification models; widely used in graph ML research.

Typical Researcher Workflows

This section shows practical, end-to-end workflows for working with arXiv papers — from first exposure to large-scale dataset creation. If you’re new to arXiv-based research, start here.

1. Entering a New Research Field

Goal: Quickly understand the landscape, key papers, and major contributors.

  • Start with broad keyword or category searches using Semantic Scholar, arXiv Sanity Preserver, or arXivLens.
  • Pick a strong seed paper and expand outward using Connected Papers or Litmaps.
  • Skim abstracts and introductions via ar5iv or native arXiv HTML.
  • Use Explainpaper or SciSpace Copilot for unfamiliar sections.
  • Identify recurring authors, labs, and venues via OpenAlex or dblp.

2. Staying Up-to-Date in an Active Area

Goal: Monitor new work without being overwhelmed.

  • Subscribe to my arXiv alerts or subject RSS feeds.
  • Layer personalized recommenders such as arXivDigest, Scholar Inbox, Hugging Face Papers, or AlphaSignal.
  • Use popularity and trend signals in arXiv Sanity or arXivLens to triage what to read.

3. Deep Reading & Note-Taking

Goal: Build durable understanding and personal research memory.

  • Read PDFs in Sioyek for fast keyboard navigation and annotation.
  • Switch to HTML views (ar5iv, arXiv HTML) for structural skimming.
  • Highlight and explain dense passages using Explainpaper.
  • Track read papers and notes in Zotero (with Better BibTeX) or PaperMemory.

4. Citation & Literature Graph Analysis

Goal: Understand idea flow, prior art, and research gaps.

  • Traverse backward and forward citations via Semantic Scholar or OpenCitations.
  • Analyze author, institution, and venue networks using OpenAlex.
  • Look for under-explored intersections or missing baselines.

5. Dataset & Corpus Creation (ML / NLP / Meta-Research)

Goal: Turn papers into structured, machine-readable data.

  • Harvest metadata via arXiv API, OAI-PMH, or OpenAlex.
  • Acquire full text using arXiv S3 bulk access or curated Kaggle / Hugging Face datasets.
  • Parse PDFs or LaTeX using Docling, GROBID, or LaTeXML.
  • Enrich with citations (OpenCitations), authors/institutions (OpenAlex), and OA links (Unpaywall).
  • Build embedding indexes, search engines, summarization datasets, or benchmarks.

6. Reproducibility & Long-Term Research Hygiene

Goal: Ensure work remains reproducible and extensible.

  • Track exact arXiv versions and DOIs when available.
  • Prefer stable identifiers (arXiv ID, DOI, OpenAlex ID).
  • Archive metadata snapshots and document extraction pipelines.

Contributions are welcome. Prefer stable links, concise descriptions, and open tools.