GitHunt
PS

pszemraj/codedupes

Detect duplicate & unused Python code via AST hashing, Jaccard similarity, and semantic embeddings (ModernBERT, C2LLM, EmbeddingGemma). CLI + Python API w hybrid synthesis

codedupes

codedupes detects duplicate and potentially unused Python code with:

  • Traditional AST/token matching (exact + Jaccard near-duplicate)
  • Semantic matching with model-profile embeddings (default gte-modernbert-base)
  • Heuristic unused-code detection

Install

pip install "codedupes @ git+https://github.com/pszemraj/codedupes.git"

Optional GPU extras:

pip install "codedupes[gpu] @ git+https://github.com/pszemraj/codedupes.git"

Requires Python 3.11+. Details are in
docs/install.md

Quick Start

codedupes check ./src
codedupes search ./src "normalize request payload"
codedupes info

codedupes check defaults to a hybrid-first report:

  • one combined duplicate list (Hybrid Duplicates)
  • likely dead code (potentially_unused)

Use --show-all to include raw traditional + raw semantic duplicate lists.

Documentation

Primary docs live under docs/:

Notes and limits

  • Call graph and unused detection are heuristic and conservative by default.
  • Semantic model-profile defaults and task behavior are defined in
    docs/model-profiles.md.
  • Analysis defaults (semantic candidate scope, tiny-traditional filtering, hybrid gates) are defined in
    docs/analysis-defaults.md.
  • Semantic analysis may download model weights on first use.
  • Extraction skips common artifact/cache directories by default (__pycache__, .venv, etc).

Languages

Python100.0%

Contributors

Apache License 2.0
Created February 14, 2026
Updated February 15, 2026
pszemraj/codedupes | GitHunt