MA
mateogon/knowledge-os
Standalone Knowledge OS for importing, chunking, embedding, and semantically searching books and documents.
Knowledge OS
Standalone MVP for the ingestion layer described in PLAN.md.
Current scope
- Import
EPUB,MOBI, andAZW3 - Normalize non-EPUB sources to EPUB with Calibre
- Extract ordered chapter text into
library/<book_id>/content - Persist
metadata.json, source artifacts,book.full.txt, and SQLite records - Re-import safely without duplicating books
- Chunk imported books into
derived/chunks.jsonl - Benchmark embedding models against chunked books
- Current recommended benchmark candidates:
all-MiniLM-L6-v2andgoogle/embeddinggemma-300m
Setup
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e . -r requirements-dev.txtOptional embeddings stack:
python -m pip install -r requirements-embeddings.txtOptional GUI stack:
python -m pip install -r requirements-gui.txtUsage
python main.py import /absolute/path/to/book.epub
python main.py chunk <book_id>
python main.py embed <book_id>
python main.py search <book_id> "trauma does not exist"
python main.py search-all trauma
python main.py
python main.py benchmark-embeddings <book_id> sentence-transformers/all-MiniLM-L6-v2
python main.py benchmark-embeddings <book_id> google/embeddinggemma-300m
python main.py benchmark-retrieval <book_id> sentence-transformers/all-MiniLM-L6-v2 benchmarks/retrieval_queries.jsonOptional env vars:
KOS_LIBRARY_PATHKOS_DB_PATHKOS_CALIBRE_PATH
Chunk outputs:
library/<book_id>/derived/chunks.jsonllibrary/<book_id>/derived/chunks_embeddings_*.jsonllibrary/<book_id>/derived/index_state.jsonlibrary/<book_id>/derived/embedding_benchmark_*.jsonlibrary/<book_id>/derived/retrieval_benchmark_*.json
Default Calibre path on Windows:
C:\Program Files\Calibre2\ebook-convert.exe