GitHunt
MA

mateogon/knowledge-os

Standalone Knowledge OS for importing, chunking, embedding, and semantically searching books and documents.

Knowledge OS

Standalone MVP for the ingestion layer described in PLAN.md.

Current scope

  • Import EPUB, MOBI, and AZW3
  • Normalize non-EPUB sources to EPUB with Calibre
  • Extract ordered chapter text into library/<book_id>/content
  • Persist metadata.json, source artifacts, book.full.txt, and SQLite records
  • Re-import safely without duplicating books
  • Chunk imported books into derived/chunks.jsonl
  • Benchmark embedding models against chunked books
  • Current recommended benchmark candidates: all-MiniLM-L6-v2 and google/embeddinggemma-300m

Setup

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e . -r requirements-dev.txt

Optional embeddings stack:

python -m pip install -r requirements-embeddings.txt

Optional GUI stack:

python -m pip install -r requirements-gui.txt

Usage

python main.py import /absolute/path/to/book.epub
python main.py chunk <book_id>
python main.py embed <book_id>
python main.py search <book_id> "trauma does not exist"
python main.py search-all trauma
python main.py
python main.py benchmark-embeddings <book_id> sentence-transformers/all-MiniLM-L6-v2
python main.py benchmark-embeddings <book_id> google/embeddinggemma-300m
python main.py benchmark-retrieval <book_id> sentence-transformers/all-MiniLM-L6-v2 benchmarks/retrieval_queries.json

Optional env vars:

  • KOS_LIBRARY_PATH
  • KOS_DB_PATH
  • KOS_CALIBRE_PATH

Chunk outputs:

  • library/<book_id>/derived/chunks.jsonl
  • library/<book_id>/derived/chunks_embeddings_*.jsonl
  • library/<book_id>/derived/index_state.json
  • library/<book_id>/derived/embedding_benchmark_*.json
  • library/<book_id>/derived/retrieval_benchmark_*.json

Default Calibre path on Windows:

  • C:\Program Files\Calibre2\ebook-convert.exe

Languages

Python100.0%

Contributors

Created March 9, 2026
Updated March 9, 2026
mateogon/knowledge-os | GitHunt