GitHunt
BH

BhaveshBytess/Research-Paper-Analyzer

Automated research paper analysis: PDF โ†’ JSON with evidence extraction using LLMs (DeepSeek, Gemma). Extracts methods, results, datasets, and claims with precise evidence grounding.

Research Paper Analyzer

Automated extraction of structured data from scientific papers with evidence grounding and validation.

Streamlit App
Python 3.10+
License: MIT
Code style: black
Success Rate
Benchmark Papers
Evidence Precision
Numeric Consistency
PyMuPDF
Pydantic
Streamlit
DeepSeek


๐Ÿš€ Live Demo

Try it now: https://research-paper-analyzer-ack6bpdauvevnlnfbx7gpz.streamlit.app

Note: Demo uses DeepSeek v3.1 free tier. First run may take 30-60 seconds for model initialization.


Demo

Demo

Overview

Research Paper Analyzer transforms scientific PDFs into structured, machine-readable JSON with page-level evidence grounding. Built for researchers, ML engineers, and literature review automation, it extracts methods, results, datasets, and claims while maintaining traceability to source text.

Key differentiator: Evidence-grounded extraction with numeric consistency validation โ€” not just LLM scraping.

PDF Input โ†’ Layout Analysis โ†’ LLM Extraction โ†’ Schema Validation โ†’ Evidence Linking โ†’ Structured JSON

Why This Exists

The Problem

  • Manual paper analysis doesn't scale
  • Existing tools extract text but lose structure
  • LLM outputs are unreliable without validation
  • No traceability from claims to source evidence

This Solution

  • โœ… Structured extraction with enforced schema
  • โœ… Evidence grounding โ€” every claim links to page + snippet
  • โœ… Numeric consistency checks โ€” catches hallucinated metrics
  • โœ… Model-agnostic โ€” works with DeepSeek, Gemma, Claude, GPT
  • โœ… Production-validated โ€” 100% success rate on 10 diverse papers

Features

Core Pipeline

  • PDF Parsing: Multi-layout understanding (text, figures, tables, equations)
  • Context Building: Semantic chunking for 5 extraction heads (metadata, methods, results, limitations, summary)
  • LLM Extraction: Parallel extraction with automatic repair
  • Schema Enforcement: Pydantic models + JSON schema validation
  • Evidence Attachment: Fuzzy matching (85% threshold) with page references
  • Consistency Validation: Range checks, baseline logic, unit verification

Evaluation Metrics (Production-Validated)

Metric Score Status
JSON Validity 100% โœ… Schema compliance
Evidence Precision 81% โœ… Grounding quality
Field Coverage 100% โœ… Complete extraction
Numeric Consistency 100% โœ… Zero hallucinations
Summary Alignment 58% ๐ŸŸก Context matching

Benchmarked on 10 real papers (7-29 pages) including "Attention is All You Need"

User Interfaces

  • Streamlit Web UI: Interactive upload, extraction, visualization
  • CLI Tool: Batch processing with checkpoint/resume
  • Python API: Programmatic access for pipelines

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        INPUT LAYER                          โ”‚
โ”‚  PDF Upload โ†’ PyMuPDF Parser โ†’ Text + Layout Extraction     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     PROCESSING LAYER                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚  โ”‚  Metadata   โ”‚  โ”‚   Methods   โ”‚  โ”‚   Results   โ”‚        โ”‚
โ”‚  โ”‚  Extractor  โ”‚  โ”‚  Extractor  โ”‚  โ”‚  Extractor  โ”‚        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚         โ†“                 โ†“                 โ†“               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚  โ”‚         LLM Backend (DeepSeek/Gemma)           โ”‚        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    VALIDATION LAYER                         โ”‚
โ”‚  JSON Repair โ†’ Schema Validation โ†’ Numeric Consistency      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     EVIDENCE LAYER                          โ”‚
โ”‚  Fuzzy Matching โ†’ Page Linking โ†’ Snippet Extraction         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      OUTPUT LAYER                           โ”‚
โ”‚  Structured JSON + Evidence + Evaluation Metrics            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Quick Start

Installation

# Clone repository
git clone https://github.com/BhaveshBytess/research-paper-analyzer.git
cd research-paper-analyzer

# Create virtual environment (Python 3.10+)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set API key (OpenRouter for DeepSeek)
export OPENROUTER_API_KEY="your-key-here"

Usage

Web UI (Recommended)

# Local
cd research-paper-analyzer
streamlit run app/app.py

# Or visit the live demo:
# https://research-paper-analyzer-ack6bpdauvevnlnfbx7gpz.streamlit.app

CLI (Single Paper)

python run_now.py /path/to/paper.pdf

CLI (Batch Processing)

python batch_deepseek_inline.py
# Processes 2 papers at a time with auto-resume
# Results saved to batch_eval_results/

Python API

from research_paper_analyzer import extract_paper

result = extract_paper(
    pdf_path="paper.pdf",
    model="deepseek",
    validate=True,
    attach_evidence=True
)

print(result.json(indent=2))

Output Schema

Core Fields

{
  "title": "string",
  "authors": ["string"],
  "year": 2024,
  "venue": "string | null",
  "arxiv_id": "string | null",
  "methods": [
    {
      "name": "string",
      "category": "CNN | Transformer | GNN | ...",
      "components": ["string"],
      "description": "string"
    }
  ],
  "results": [
    {
      "dataset": "string",
      "metric": "string",
      "value": 0.95,
      "unit": "%" | "points" | null,
      "split": "test | val | train",
      "higher_is_better": true,
      "baseline": "string | null",
      "ours_is": "string | null",
      "confidence": 0.9
    }
  ],
  "tasks": ["string"],
  "datasets": ["string"],
  "limitations": "string | null",
  "ethics": "string | null",
  "summary": "string",
  "evidence": {
    "title": [{"page": 1, "snippet": "..."}],
    "methods": [{"page": 3, "snippet": "..."}],
    "results": [{"page": 7, "snippet": "..."}]
  }
}

Validation Rules

  • โœ… All numeric results must have valid value (not null)
  • โœ… Percentages constrained to [0, 100]
  • โœ… Confidence scores constrained to [0, 1]
  • โœ… higher_is_better logic enforced vs. baseline
  • โœ… Evidence keys must match extracted fields

Benchmarks

Performance (10 Papers, Mixed Domains)

Metric Target Achieved Notes
JSON Validity 100% 100% All outputs schema-compliant
Evidence Precision โ‰ฅ70% 81% Grounding to source text
Field Coverage 100% 100% No missing required fields
Numeric Consistency 100% 100% Zero hallucinated metrics
Processing Speed <2 min/paper ~2 min On free-tier API

Test Set Details

  • Papers: 10 (GNN methods, transformers, graph learning)
  • Page range: 7-29 pages
  • Venues: ICLR, NIPS, arXiv
  • Success rate: 100% (10/10 papers extracted)
  • Perfect papers: 2 (all metrics = 1.00)

Landmark paper tested: "Attention is All You Need" (Vaswani et al.) โ€” successfully extracted all 8 authors, transformer components, and BLEU scores.


Project Structure

research-paper-analyzer/
โ”œโ”€โ”€ research-paper-analyzer/
โ”‚   โ”œโ”€โ”€ app.py                    # Streamlit UI
โ”‚   โ”œโ”€โ”€ pdf_parser.py             # PyMuPDF extraction
โ”‚   โ”œโ”€โ”€ llm_extractor.py          # LLM extraction logic
โ”‚   โ”œโ”€โ”€ schema.py                 # Pydantic models
โ”‚   โ”œโ”€โ”€ evidence_matcher.py       # Fuzzy evidence linking
โ”‚   โ””โ”€โ”€ eval_metrics.py           # Consistency validation
โ”œโ”€โ”€ batch_deepseek_inline.py      # Batch evaluation script
โ”œโ”€โ”€ create_visualizations.py      # Metric visualization
โ”œโ”€โ”€ requirements.txt              # Python dependencies
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ batch_eval_results/           # Evaluation results
โ”‚   โ”œโ”€โ”€ results.csv               # Metrics table
โ”‚   โ”œโ”€โ”€ visualizations/           # 8 analysis charts
โ”‚   โ””โ”€โ”€ summary/                  # Detailed reports
โ”œโ”€โ”€ samples/                      # Test papers + results
โ””โ”€โ”€ datastore/                    # Cache + intermediate data

Development

Running Tests

# Unit tests (TODO: expand coverage)
pytest tests/

# Integration test on sample paper
python test_consistency.py

Adding a New LLM Backend

  1. Implement BaseLLMExtractor interface in llm_extractor.py
  2. Add model config to schema.py
  3. Update run_now.py with new model option

Contributing

See CONTRIBUTING.md for:

  • Code style (Black, isort)
  • PR checklist
  • Issue templates
  • Architecture decisions

Known Limitations

Current Scope

  • โŒ No OCR support โ€” requires digital PDFs (not scanned images)
  • โŒ No figure extraction โ€” text-only for now
  • โŒ English papers only โ€” no multilingual support yet
  • โš ๏ธ Free-tier rate limits โ€” 16 req/min on OpenRouter (manageable for batch)

Improvement Areas

  • ๐ŸŸก Summary alignment (58%) โ€” threshold tuning needed
  • ๐ŸŸก Complex table parsing โ€” nested tables occasionally missed
  • ๐ŸŸก Citation extraction โ€” not yet implemented

Non-Issues

  • โœ… Numeric consistency โ€” validated at 100% (production-ready)
  • โœ… Schema compliance โ€” 100% across all tests
  • โœ… Evidence grounding โ€” 81% precision (excellent)

Roadmap

v1.1 (Current)

  • Core extraction pipeline
  • Evidence grounding
  • Numeric consistency validation
  • Batch evaluation system
  • Comprehensive benchmarks

v1.2 (Next)

  • OCR support (scanned PDFs)
  • Figure caption extraction
  • Citation graph parsing
  • Multi-paper comparison UI
  • Active learning for uncertain extractions

v2.0 (Future)

  • Multilingual support (non-English papers)
  • Table structure extraction
  • Equation parsing (LaTeX)
  • Real-time collaboration (multi-user annotation)
  • API service deployment (FastAPI + Docker)

Citation

If you use this tool in your research, please cite:

@software{research_paper_analyzer_2024,
  author = {Bhavesh Bytess},
  title = {Research Paper Analyzer: Evidence-Grounded PDF Extraction},
  year = {2024},
  url = {https://github.com/BhaveshBytess/research-paper-analyzer}
}

License

MIT License - see LICENSE for details.


Acknowledgments

  • PyMuPDF for robust PDF parsing
  • OpenRouter for LLM API access
  • DeepSeek for high-quality extraction
  • Streamlit for rapid UI prototyping

Contact & Support

Maintained by: Bhavesh Bytess
Status: Active development, production-validated, seeking contributors



Last Updated: 2025-11-03
Version: 1.1.0
Production Status: โœ… Validated (100% success rate on 10 papers)

BhaveshBytess/Research-Paper-Analyzer | GitHunt