oguzhankir/omnichunk
Structure-aware text chunking library for code, prose, and markup files. Intelligently splits files into context-rich chunks while preserving semantic boundaries. Supports 15+ programming languages, deterministic output, and zero external dependencies. Perfect for RAG systems, code analysis, and LLM context optimization.
Chunk code, prose, and markup files with structure awareness.
omnichunk is a Python library that splits files into smaller pieces while keeping useful context:
- Code: respects function/class boundaries, includes scope and import information
- Markdown: respects headings and sections
- JSON/YAML/TOML: splits by top-level keys/sections
- HTML/XML: splits by elements
- Mixed files: handles notebooks and Python files with long docstrings
Each chunk includes:
- The original text slice
- Byte and line ranges for lossless reconstruction
- Context (scope, entities, headings, imports, siblings)
- Optional
contextualized_textfor embeddings
The library is deterministic and works without external APIs.
Installation
pip install omnichunkOptional extras:
pip install omnichunk[tiktoken] # tiktoken tokenizer support
pip install omnichunk[transformers] # HuggingFace tokenizer support
pip install omnichunk[all-languages] # Extended language grammars
pip install omnichunk[langchain] # LangChain Document export support
pip install omnichunk[llamaindex] # LlamaIndex Document export support
pip install omnichunk[profiling] # py-spy / line-profiler helpers
pip install omnichunk[rust] # maturin tooling for Rust backend PoC
pip install omnichunk[dev] # Development tools
pip install omnichunk[pinecone] # Vector DB adapter extra (no client lib)
pip install omnichunk[weaviate] # Vector DB adapter extra (no client lib)
pip install omnichunk[supabase] # Vector DB adapter extra (no client lib)
pip install omnichunk[vectordb] # Meta-group for all vector export extras (empty deps)
pip install omnichunk[semantic] # Marker extra (semantic stack uses core numpy only)
pip install omnichunk[graph] # Marker extra (GraphRAG uses existing chunk entities)Examples
Runnable scripts and Jupyter notebooks live under examples/. They cover chunking, hierarchical trees, incremental diffs, token budgets, semantic boundaries, GraphRAG, vector export shapes, and the plugin API. See examples/README.md for how to run them.
CLI
omnichunk ./src --glob "**/*.py" --max-size 512 --size-unit chars --format jsonl > chunks.jsonl
omnichunk app.py --max-size 256 --size-unit chars --stats
omnichunk app.py --max-size 256 --size-unit chars --nws-backend python
omnichunk README.md --format csv --output chunks.csvQuick start
One-shot API
from omnichunk import chunk
code = """
import os
def hello(name: str) -> str:
return f"hello {name}"
"""
chunks = chunk("example.py", code, max_chunk_size=128, size_unit="chars")
for c in chunks:
print(c.index, c.byte_range, c.context.breadcrumb)
print(c.contextualized_text)Reusable Chunker
from omnichunk import Chunker
chunker = Chunker(
max_chunk_size=1024,
min_chunk_size=80,
tokenizer="cl100k_base",
context_mode="full",
overlap=0.1,
overlap_lines=1,
)
chunks = chunker.chunk("api.py", source_code)
for c in chunker.stream("large.py", large_source):
consume(c)Async API
import asyncio
from omnichunk import Chunker
chunker = Chunker(max_chunk_size=1024, size_unit="tokens")
# Single file async
chunks = asyncio.run(chunker.achunk("api.py", source_code))
# Async streaming
async def process():
async for chunk in chunker.astream("large.py", large_source):
consume(chunk)
# Async batch (concurrent)
results = asyncio.run(chunker.abatch(
[
{"filepath": "a.py", "code": code_a},
{"filepath": "b.ts", "code": code_b},
],
concurrency=8,
))batch_results = chunker.batch(
[
{"filepath": "a.py", "code": code_a},
{"filepath": "b.ts", "code": code_b},
{"filepath": "README.md", "code": readme_md},
],
concurrency=8,
)
directory_results = chunker.chunk_directory(
"./src",
glob="**/*.py",
exclude=["**/tests/**"],
concurrency=8,
)
all_chunks = [chunk for result in directory_results for chunk in result.chunks]
jsonl_payload = chunker.to_jsonl(all_chunks)
csv_payload = chunker.to_csv(all_chunks)
stats = chunker.chunk_stats(all_chunks, size_unit="chars")
quality = chunker.quality_scores(
all_chunks,
min_chunk_size=80,
max_chunk_size=1024,
size_unit="chars",
)
langchain_docs = chunker.to_langchain_docs(all_chunks)
llamaindex_docs = chunker.to_llamaindex_docs(all_chunks)
# Vector DB–ready rows (you compute embeddings elsewhere)
from omnichunk import chunks_to_pinecone_vectors, chunks_to_supabase_rows
emb = [[0.1, 0.2, 0.3] for _ in all_chunks] # same length as chunks
pinecone_batch = chunks_to_pinecone_vectors(all_chunks, emb, namespace="my_ns")
weaviate_batch = chunker.to_weaviate_objects(all_chunks, emb, class_name="Doc")
supabase_rows = chunks_to_supabase_rows(all_chunks, emb)Semantic chunking
Embedding boundaries are user-supplied (semantic_embed_fn). Omnichunk never calls an external API.
import numpy as np
from omnichunk import Chunker
def embed(texts):
# Replace with your actual embedding model
return np.random.default_rng(0).standard_normal((len(texts), 384))
chunker = Chunker(max_chunk_size=512, size_unit="tokens")
essay = "Your prose here…"
chunks = chunker.semantic_chunk("essay.md", essay, embed_fn=embed)For code and other non-prose content types, structural engines are used even if semantic=True.
Topic shift detection
from omnichunk.semantic import detect_topic_shifts, split_sentences
text = "Your document…"
sentences_with_offsets = split_sentences(text)
sentences = [s for s, _, _ in sentences_with_offsets]
shifts = detect_topic_shifts(sentences, window=5, threshold=0.4)GraphRAG: entity-chunk graph
from omnichunk import Chunker, build_chunk_graph
source = "class MyClass:\n pass\n"
chunks = Chunker().chunk("repo.py", source)
graph = build_chunk_graph(chunks)
print(graph.entity_chunks("MyClass")) # chunk indices containing MyClass
print(graph.chunk_neighbors(0)) # chunks sharing entities with chunk 0
data = graph.to_dict() # JSON-serializableHierarchical chunking (multi-level RAG)
from omnichunk import Chunker
chunker = Chunker(size_unit="chars")
source = "..." # your file contents
tree = chunker.hierarchical_chunk(
"service.py", source,
levels=[64, 256, 1024], # leaf → root
)
small_chunks = tree.leaves() # embed and index these
large_chunks = tree.roots() # pass these to LLM as context
parent = tree.parent(small_chunks[0]) # navigate upIncremental / differential chunking
from omnichunk import Chunker
chunker = Chunker(max_chunk_size=512, size_unit="chars")
new_source = "..." # updated file contents
diff = chunker.chunk_diff(
"api.py",
new_source,
previous_chunks=old_chunks,
)
# diff.added → upsert to vector DB
# diff.removed_ids → delete from vector DB
# diff.unchanged → skip re-embeddingToken budget optimizer
from omnichunk.budget import TokenBudgetOptimizer
optimizer = TokenBudgetOptimizer(budget=4096, strategy="greedy")
result = optimizer.select(retrieved_chunks, scores=relevance_scores)
# result.selected → pass to LLMVector database export (serialization)
Adapters produce plain dicts/lists only—no Pinecone, Weaviate, or Supabase client is installed by these extras. You compute embeddings yourself and pass parallel lists:
chunks_to_pinecone_vectors/Chunker.to_pinecone_vectors—id,values,metadata(+ optionalnamespaceper row)chunks_to_weaviate_objects/Chunker.to_weaviate_objects—class,vector,propertieschunks_to_supabase_rows/Chunker.to_supabase_rows—content,embedding, plus flat metadata columns
Plugin API
Register custom parsers or formatters at import time (no edits to omnichunk core):
from omnichunk import register_parser, register_formatter, Chunker
def my_parse(filepath: str, content: str):
# Return a tree-sitter-like tree, or None to use the built-in parser.
return None
register_parser("python", my_parse, overwrite=True)
def my_fmt(chunks):
return str(len(chunks))
register_formatter("count", my_fmt)File API
from omnichunk import chunk_file
chunks = chunk_file("path/to/file.py")Directory API
from omnichunk import chunk_directory
results = chunk_directory("./src", glob="**/*.py", max_chunk_size=512, size_unit="chars")
for result in results:
if result.error:
print("error", result.filepath, result.error)
else:
print(result.filepath, len(result.chunks))Chunk model
Every Chunk includes raw content, exact offsets, and rich context:
text: exact source slice (lossless reconstruction)contextualized_text: embedding-ready representationbyte_range,line_rangecontext: scope, entities, siblings, imports, headings, section metadatatoken_count,char_count,nws_count
Supported content
Code
- Python
- JavaScript / TypeScript
- Rust
- Go
- Java
- C / C++ / C#
- Ruby / PHP / Kotlin / Swift (grammar-dependent)
Prose
- Markdown
- Plaintext
Markdown fenced blocks are delegated by language:
- fenced code (
python,ts, etc.) routes toCodeEngine - fenced markup (
json,yaml,toml,html,xml) routes toMarkupEngine
Markup
- JSON
- YAML
- TOML
- HTML / XML
Hybrid
- Python with heavy docstrings
- Notebook-style
# %%cell files
Architecture
src/omnichunk/
├── chunker.py
├── cli.py
├── quality.py
├── serialization.py
├── types.py
├── engine/
│ ├── router.py
│ ├── code_engine.py
│ ├── prose_engine.py
│ ├── markup_engine.py
│ └── hybrid_engine.py
├── parser/
│ ├── tree_sitter.py
│ ├── markdown_parser.py
│ ├── html_parser.py
│ └── languages.py
├── context/
│ ├── entities.py
│ ├── scope.py
│ ├── siblings.py
│ ├── imports.py
│ └── format.py
├── sizing/
│ ├── nws.py
│ ├── tokenizers.py
│ └── counter.py
└── windowing/
├── greedy.py
├── merge.py
├── split.py
└── overlap.py
Determinism & integrity guarantees
omnichunk is built to preserve source fidelity:
- Chunk boundaries are deterministic
- Empty/whitespace-only chunks are dropped
- Chunks are contiguous and non-overlapping in source order
- Byte range integrity is validated in tests:
original_bytes = source.encode("utf-8")
for chunk in chunks:
assert original_bytes[chunk.byte_range.start:chunk.byte_range.end].decode("utf-8") == chunk.textTesting
Run the test suite:
pytest -qRun benchmark scenarios:
python benchmarks/run_benchmarks.py
python benchmarks/run_comparisons.py
python benchmarks/run_quality_report.py
python benchmarks/run_large_corpus.py --mode mega-python --repeat 120
python benchmarks/run_hotspot_profile.py --mode mega-python --repeat 120 --limit 30Run repository checks:
python scripts/check_ai_rules_sync.py
python scripts/check_benchmarks.py
python scripts/check_benchmarks.py --run-qualityCurrent suite covers:
- API usage (
chunk,chunk_file,Chunker) - Code/prose/markup/hybrid behavior
- Context metadata (imports, siblings, scope, headings)
- Sizing/tokenization/NWS logic
- Overlap behavior
- Edge cases (empty input, unicode, malformed syntax, range contiguity)
Contributing
Contribution and project process files:
CONTRIBUTING.mdCODE_OF_CONDUCT.mdSECURITY.mdGOVERNANCE.mdMAINTAINERS.mdROADMAP.mdARCHITECTURE.md
Install dev tooling and run pre-commit hooks:
pip install -e .[dev]
pre-commit install
pre-commit run --all-filesNotes
- Tree-sitter grammars are resolved dynamically and cached per language.
- If a parser is unavailable, the system degrades gracefully with fallback heuristics.
contextualized_textis optimized for embedding quality while preserving rawtextseparately.
