oguzhankir/omnichunk

Chunk code, prose, and markup files with structure awareness.

omnichunk is a Python library that splits files into smaller pieces while keeping useful context:

Code: respects function/class boundaries, includes scope and import information
Markdown: respects headings and sections
JSON/YAML/TOML: splits by top-level keys/sections
HTML/XML: splits by elements
Mixed files: handles notebooks and Python files with long docstrings

Each chunk includes:

The original text slice
Byte and line ranges for lossless reconstruction
Context (scope, entities, headings, imports, siblings)
Optional contextualized_text for embeddings

The library is deterministic and works without external APIs.

Installation

pip install omnichunk

Optional extras:

pip install omnichunk[tiktoken]        # tiktoken tokenizer support
pip install omnichunk[transformers]    # HuggingFace tokenizer support
pip install omnichunk[all-languages]   # Extended language grammars
pip install omnichunk[langchain]       # LangChain Document export support
pip install omnichunk[llamaindex]      # LlamaIndex Document export support
pip install omnichunk[profiling]       # py-spy / line-profiler helpers
pip install omnichunk[rust]            # maturin tooling for Rust backend PoC
pip install omnichunk[dev]             # Development tools
pip install omnichunk[pinecone]        # Vector DB adapter extra (no client lib)
pip install omnichunk[weaviate]        # Vector DB adapter extra (no client lib)
pip install omnichunk[supabase]        # Vector DB adapter extra (no client lib)
pip install omnichunk[vectordb]        # Meta-group for all vector export extras (empty deps)
pip install omnichunk[semantic]        # Marker extra (semantic stack uses core numpy only)
pip install omnichunk[graph]           # Marker extra (GraphRAG uses existing chunk entities)

Examples

Runnable scripts and Jupyter notebooks live under examples/. They cover chunking, hierarchical trees, incremental diffs, token budgets, semantic boundaries, GraphRAG, vector export shapes, and the plugin API. See examples/README.md for how to run them.

CLI

omnichunk ./src --glob "**/*.py" --max-size 512 --size-unit chars --format jsonl > chunks.jsonl
omnichunk app.py --max-size 256 --size-unit chars --stats
omnichunk app.py --max-size 256 --size-unit chars --nws-backend python
omnichunk README.md --format csv --output chunks.csv

Quick start

One-shot API

from omnichunk import chunk

code = """
import os

def hello(name: str) -> str:
    return f"hello {name}"
"""

chunks = chunk("example.py", code, max_chunk_size=128, size_unit="chars")

for c in chunks:
    print(c.index, c.byte_range, c.context.breadcrumb)
    print(c.contextualized_text)

Reusable `Chunker`

from omnichunk import Chunker

chunker = Chunker(
    max_chunk_size=1024,
    min_chunk_size=80,
    tokenizer="cl100k_base",
    context_mode="full",
    overlap=0.1,
    overlap_lines=1,
)

chunks = chunker.chunk("api.py", source_code)

for c in chunker.stream("large.py", large_source):
    consume(c)

Async API

import asyncio
from omnichunk import Chunker

chunker = Chunker(max_chunk_size=1024, size_unit="tokens")

# Single file async
chunks = asyncio.run(chunker.achunk("api.py", source_code))

# Async streaming
async def process():
    async for chunk in chunker.astream("large.py", large_source):
        consume(chunk)

# Async batch (concurrent)
results = asyncio.run(chunker.abatch(
    [
        {"filepath": "a.py", "code": code_a},
        {"filepath": "b.ts", "code": code_b},
    ],
    concurrency=8,
))

batch_results = chunker.batch(
    [
        {"filepath": "a.py", "code": code_a},
        {"filepath": "b.ts", "code": code_b},
        {"filepath": "README.md", "code": readme_md},
    ],
    concurrency=8,
)

directory_results = chunker.chunk_directory(
    "./src",
    glob="**/*.py",
    exclude=["**/tests/**"],
    concurrency=8,
)

all_chunks = [chunk for result in directory_results for chunk in result.chunks]

jsonl_payload = chunker.to_jsonl(all_chunks)
csv_payload = chunker.to_csv(all_chunks)

stats = chunker.chunk_stats(all_chunks, size_unit="chars")
quality = chunker.quality_scores(
    all_chunks,
    min_chunk_size=80,
    max_chunk_size=1024,
    size_unit="chars",
)

langchain_docs = chunker.to_langchain_docs(all_chunks)
llamaindex_docs = chunker.to_llamaindex_docs(all_chunks)

# Vector DB–ready rows (you compute embeddings elsewhere)
from omnichunk import chunks_to_pinecone_vectors, chunks_to_supabase_rows

emb = [[0.1, 0.2, 0.3] for _ in all_chunks]  # same length as chunks
pinecone_batch = chunks_to_pinecone_vectors(all_chunks, emb, namespace="my_ns")
weaviate_batch = chunker.to_weaviate_objects(all_chunks, emb, class_name="Doc")
supabase_rows = chunks_to_supabase_rows(all_chunks, emb)

Semantic chunking

Embedding boundaries are user-supplied (semantic_embed_fn). Omnichunk never calls an external API.

import numpy as np
from omnichunk import Chunker

def embed(texts):
    # Replace with your actual embedding model
    return np.random.default_rng(0).standard_normal((len(texts), 384))

chunker = Chunker(max_chunk_size=512, size_unit="tokens")
essay = "Your prose here…"
chunks = chunker.semantic_chunk("essay.md", essay, embed_fn=embed)

For code and other non-prose content types, structural engines are used even if semantic=True.

Topic shift detection

from omnichunk.semantic import detect_topic_shifts, split_sentences

text = "Your document…"
sentences_with_offsets = split_sentences(text)
sentences = [s for s, _, _ in sentences_with_offsets]
shifts = detect_topic_shifts(sentences, window=5, threshold=0.4)

GraphRAG: entity-chunk graph

from omnichunk import Chunker, build_chunk_graph

source = "class MyClass:\n    pass\n"
chunks = Chunker().chunk("repo.py", source)
graph = build_chunk_graph(chunks)
print(graph.entity_chunks("MyClass"))       # chunk indices containing MyClass
print(graph.chunk_neighbors(0))             # chunks sharing entities with chunk 0
data = graph.to_dict()                      # JSON-serializable

Hierarchical chunking (multi-level RAG)

from omnichunk import Chunker

chunker = Chunker(size_unit="chars")
source = "..."  # your file contents
tree = chunker.hierarchical_chunk(
    "service.py", source,
    levels=[64, 256, 1024],   # leaf → root
)

small_chunks = tree.leaves()   # embed and index these
large_chunks = tree.roots()    # pass these to LLM as context
parent = tree.parent(small_chunks[0])  # navigate up

Incremental / differential chunking

from omnichunk import Chunker

chunker = Chunker(max_chunk_size=512, size_unit="chars")
new_source = "..."  # updated file contents
diff = chunker.chunk_diff(
    "api.py",
    new_source,
    previous_chunks=old_chunks,
)
# diff.added        → upsert to vector DB
# diff.removed_ids  → delete from vector DB
# diff.unchanged    → skip re-embedding

Token budget optimizer

from omnichunk.budget import TokenBudgetOptimizer

optimizer = TokenBudgetOptimizer(budget=4096, strategy="greedy")
result = optimizer.select(retrieved_chunks, scores=relevance_scores)
# result.selected → pass to LLM

Vector database export (serialization)

Adapters produce plain dicts/lists only—no Pinecone, Weaviate, or Supabase client is installed by these extras. You compute embeddings yourself and pass parallel lists:

chunks_to_pinecone_vectors / Chunker.to_pinecone_vectors — id, values, metadata (+ optional namespace per row)
chunks_to_weaviate_objects / Chunker.to_weaviate_objects — class, vector, properties
chunks_to_supabase_rows / Chunker.to_supabase_rows — content, embedding, plus flat metadata columns

Plugin API

from omnichunk import register_parser, register_formatter, Chunker

def my_parse(filepath: str, content: str):
    # Return a tree-sitter-like tree, or None to use the built-in parser.
    return None

register_parser("python", my_parse, overwrite=True)

def my_fmt(chunks):
    return str(len(chunks))

register_formatter("count", my_fmt)

File API

from omnichunk import chunk_file

chunks = chunk_file("path/to/file.py")

Directory API

from omnichunk import chunk_directory

results = chunk_directory("./src", glob="**/*.py", max_chunk_size=512, size_unit="chars")

for result in results:
    if result.error:
        print("error", result.filepath, result.error)
    else:
        print(result.filepath, len(result.chunks))

Chunk model

Every Chunk includes raw content, exact offsets, and rich context:

text: exact source slice (lossless reconstruction)
contextualized_text: embedding-ready representation
byte_range, line_range
context: scope, entities, siblings, imports, headings, section metadata
token_count, char_count, nws_count

Supported content

Code

Python
JavaScript / TypeScript
Rust
Go
Java
C / C++ / C#
Ruby / PHP / Kotlin / Swift (grammar-dependent)

Prose

Markdown
Plaintext

Markdown fenced blocks are delegated by language:

fenced code (python, ts, etc.) routes to CodeEngine
fenced markup (json, yaml, toml, html, xml) routes to MarkupEngine

Markup

JSON
YAML
TOML
HTML / XML

Hybrid

Python with heavy docstrings
Notebook-style # %% cell files

Architecture

src/omnichunk/
├── chunker.py
├── cli.py
├── quality.py
├── serialization.py
├── types.py
├── engine/
│   ├── router.py
│   ├── code_engine.py
│   ├── prose_engine.py
│   ├── markup_engine.py
│   └── hybrid_engine.py
├── parser/
│   ├── tree_sitter.py
│   ├── markdown_parser.py
│   ├── html_parser.py
│   └── languages.py
├── context/
│   ├── entities.py
│   ├── scope.py
│   ├── siblings.py
│   ├── imports.py
│   └── format.py
├── sizing/
│   ├── nws.py
│   ├── tokenizers.py
│   └── counter.py
└── windowing/
    ├── greedy.py
    ├── merge.py
    ├── split.py
    └── overlap.py

Determinism & integrity guarantees

omnichunk is built to preserve source fidelity:

Chunk boundaries are deterministic
Empty/whitespace-only chunks are dropped
Chunks are contiguous and non-overlapping in source order
Byte range integrity is validated in tests:

original_bytes = source.encode("utf-8")
for chunk in chunks:
    assert original_bytes[chunk.byte_range.start:chunk.byte_range.end].decode("utf-8") == chunk.text

Testing

Run the test suite:

pytest -q

Run benchmark scenarios:

python benchmarks/run_benchmarks.py
python benchmarks/run_comparisons.py
python benchmarks/run_quality_report.py
python benchmarks/run_large_corpus.py --mode mega-python --repeat 120
python benchmarks/run_hotspot_profile.py --mode mega-python --repeat 120 --limit 30

Run repository checks:

python scripts/check_ai_rules_sync.py
python scripts/check_benchmarks.py
python scripts/check_benchmarks.py --run-quality

Current suite covers:

API usage (chunk, chunk_file, Chunker)
Code/prose/markup/hybrid behavior
Context metadata (imports, siblings, scope, headings)
Sizing/tokenization/NWS logic
Overlap behavior
Edge cases (empty input, unicode, malformed syntax, range contiguity)

Contributing

Contribution and project process files:

CONTRIBUTING.md
CODE_OF_CONDUCT.md
SECURITY.md
GOVERNANCE.md
MAINTAINERS.md
ROADMAP.md
ARCHITECTURE.md

Install dev tooling and run pre-commit hooks:

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

Notes

Tree-sitter grammars are resolved dynamically and cached per language.
If a parser is unavailable, the system degrades gracefully with fallback heuristics.
contextualized_text is optimized for embedding quality while preserving raw text separately.

oguzhankir/omnichunk

Installation

Examples

CLI

Quick start

One-shot API

Reusable `Chunker`

Async API

Semantic chunking

Topic shift detection

GraphRAG: entity-chunk graph

Hierarchical chunking (multi-level RAG)

Incremental / differential chunking

Token budget optimizer

Vector database export (serialization)

Plugin API

File API

Directory API

Chunk model

Supported content

Code

Prose

Markup

Hybrid

Architecture

Determinism & integrity guarantees

Testing

Contributing

Notes

On this page

Languages

Contributors

Latest Release

oguzhankir/omnichunk

Installation

Examples

CLI

Quick start

One-shot API

Reusable Chunker

Async API

Semantic chunking

Topic shift detection

GraphRAG: entity-chunk graph

Hierarchical chunking (multi-level RAG)

Incremental / differential chunking

Token budget optimizer

Vector database export (serialization)

Plugin API

File API

Directory API

Chunk model

Supported content

Code

Prose

Markup

Hybrid

Architecture

Determinism & integrity guarantees

Testing

Contributing

Notes

On this page

Languages

Contributors

Latest Release

Reusable `Chunker`