GitHunt
OG

oguzhankir/omnichunk

Structure-aware text chunking library for code, prose, and markup files. Intelligently splits files into context-rich chunks while preserving semantic boundaries. Supports 15+ programming languages, deterministic output, and zero external dependencies. Perfect for RAG systems, code analysis, and LLM context optimization.

omnichunk

PyPI CI Python License

Chunk code, prose, and markup files with structure awareness.

omnichunk is a Python library that splits files into smaller pieces while keeping useful context:

  • Code: respects function/class boundaries, includes scope and import information
  • Markdown: respects headings and sections
  • JSON/YAML/TOML: splits by top-level keys/sections
  • HTML/XML: splits by elements
  • Mixed files: handles notebooks and Python files with long docstrings

Each chunk includes:

  • The original text slice
  • Byte and line ranges for lossless reconstruction
  • Context (scope, entities, headings, imports, siblings)
  • Optional contextualized_text for embeddings

The library is deterministic and works without external APIs.

Installation

pip install omnichunk

Optional extras:

pip install omnichunk[tiktoken]        # tiktoken tokenizer support
pip install omnichunk[transformers]    # HuggingFace tokenizer support
pip install omnichunk[all-languages]   # Extended language grammars
pip install omnichunk[langchain]       # LangChain Document export support
pip install omnichunk[llamaindex]      # LlamaIndex Document export support
pip install omnichunk[profiling]       # py-spy / line-profiler helpers
pip install omnichunk[rust]            # maturin tooling for Rust backend PoC
pip install omnichunk[dev]             # Development tools
pip install omnichunk[pinecone]        # Vector DB adapter extra (no client lib)
pip install omnichunk[weaviate]        # Vector DB adapter extra (no client lib)
pip install omnichunk[supabase]        # Vector DB adapter extra (no client lib)
pip install omnichunk[vectordb]        # Meta-group for all vector export extras (empty deps)
pip install omnichunk[semantic]        # Marker extra (semantic stack uses core numpy only)
pip install omnichunk[graph]           # Marker extra (GraphRAG uses existing chunk entities)

Examples

Runnable scripts and Jupyter notebooks live under examples/. They cover chunking, hierarchical trees, incremental diffs, token budgets, semantic boundaries, GraphRAG, vector export shapes, and the plugin API. See examples/README.md for how to run them.

CLI

omnichunk ./src --glob "**/*.py" --max-size 512 --size-unit chars --format jsonl > chunks.jsonl
omnichunk app.py --max-size 256 --size-unit chars --stats
omnichunk app.py --max-size 256 --size-unit chars --nws-backend python
omnichunk README.md --format csv --output chunks.csv

Quick start

One-shot API

from omnichunk import chunk

code = """
import os

def hello(name: str) -> str:
    return f"hello {name}"
"""

chunks = chunk("example.py", code, max_chunk_size=128, size_unit="chars")

for c in chunks:
    print(c.index, c.byte_range, c.context.breadcrumb)
    print(c.contextualized_text)

Reusable Chunker

from omnichunk import Chunker

chunker = Chunker(
    max_chunk_size=1024,
    min_chunk_size=80,
    tokenizer="cl100k_base",
    context_mode="full",
    overlap=0.1,
    overlap_lines=1,
)

chunks = chunker.chunk("api.py", source_code)

for c in chunker.stream("large.py", large_source):
    consume(c)

Async API

import asyncio
from omnichunk import Chunker

chunker = Chunker(max_chunk_size=1024, size_unit="tokens")

# Single file async
chunks = asyncio.run(chunker.achunk("api.py", source_code))

# Async streaming
async def process():
    async for chunk in chunker.astream("large.py", large_source):
        consume(chunk)

# Async batch (concurrent)
results = asyncio.run(chunker.abatch(
    [
        {"filepath": "a.py", "code": code_a},
        {"filepath": "b.ts", "code": code_b},
    ],
    concurrency=8,
))
batch_results = chunker.batch(
    [
        {"filepath": "a.py", "code": code_a},
        {"filepath": "b.ts", "code": code_b},
        {"filepath": "README.md", "code": readme_md},
    ],
    concurrency=8,
)

directory_results = chunker.chunk_directory(
    "./src",
    glob="**/*.py",
    exclude=["**/tests/**"],
    concurrency=8,
)

all_chunks = [chunk for result in directory_results for chunk in result.chunks]

jsonl_payload = chunker.to_jsonl(all_chunks)
csv_payload = chunker.to_csv(all_chunks)

stats = chunker.chunk_stats(all_chunks, size_unit="chars")
quality = chunker.quality_scores(
    all_chunks,
    min_chunk_size=80,
    max_chunk_size=1024,
    size_unit="chars",
)

langchain_docs = chunker.to_langchain_docs(all_chunks)
llamaindex_docs = chunker.to_llamaindex_docs(all_chunks)

# Vector DB–ready rows (you compute embeddings elsewhere)
from omnichunk import chunks_to_pinecone_vectors, chunks_to_supabase_rows

emb = [[0.1, 0.2, 0.3] for _ in all_chunks]  # same length as chunks
pinecone_batch = chunks_to_pinecone_vectors(all_chunks, emb, namespace="my_ns")
weaviate_batch = chunker.to_weaviate_objects(all_chunks, emb, class_name="Doc")
supabase_rows = chunks_to_supabase_rows(all_chunks, emb)

Semantic chunking

Embedding boundaries are user-supplied (semantic_embed_fn). Omnichunk never calls an external API.

import numpy as np
from omnichunk import Chunker

def embed(texts):
    # Replace with your actual embedding model
    return np.random.default_rng(0).standard_normal((len(texts), 384))

chunker = Chunker(max_chunk_size=512, size_unit="tokens")
essay = "Your prose here…"
chunks = chunker.semantic_chunk("essay.md", essay, embed_fn=embed)

For code and other non-prose content types, structural engines are used even if semantic=True.

Topic shift detection

from omnichunk.semantic import detect_topic_shifts, split_sentences

text = "Your document…"
sentences_with_offsets = split_sentences(text)
sentences = [s for s, _, _ in sentences_with_offsets]
shifts = detect_topic_shifts(sentences, window=5, threshold=0.4)

GraphRAG: entity-chunk graph

from omnichunk import Chunker, build_chunk_graph

source = "class MyClass:\n    pass\n"
chunks = Chunker().chunk("repo.py", source)
graph = build_chunk_graph(chunks)
print(graph.entity_chunks("MyClass"))       # chunk indices containing MyClass
print(graph.chunk_neighbors(0))             # chunks sharing entities with chunk 0
data = graph.to_dict()                      # JSON-serializable

Hierarchical chunking (multi-level RAG)

from omnichunk import Chunker

chunker = Chunker(size_unit="chars")
source = "..."  # your file contents
tree = chunker.hierarchical_chunk(
    "service.py", source,
    levels=[64, 256, 1024],   # leaf → root
)

small_chunks = tree.leaves()   # embed and index these
large_chunks = tree.roots()    # pass these to LLM as context
parent = tree.parent(small_chunks[0])  # navigate up

Incremental / differential chunking

from omnichunk import Chunker

chunker = Chunker(max_chunk_size=512, size_unit="chars")
new_source = "..."  # updated file contents
diff = chunker.chunk_diff(
    "api.py",
    new_source,
    previous_chunks=old_chunks,
)
# diff.added        → upsert to vector DB
# diff.removed_ids  → delete from vector DB
# diff.unchanged    → skip re-embedding

Token budget optimizer

from omnichunk.budget import TokenBudgetOptimizer

optimizer = TokenBudgetOptimizer(budget=4096, strategy="greedy")
result = optimizer.select(retrieved_chunks, scores=relevance_scores)
# result.selected → pass to LLM

Vector database export (serialization)

Adapters produce plain dicts/lists only—no Pinecone, Weaviate, or Supabase client is installed by these extras. You compute embeddings yourself and pass parallel lists:

  • chunks_to_pinecone_vectors / Chunker.to_pinecone_vectorsid, values, metadata (+ optional namespace per row)
  • chunks_to_weaviate_objects / Chunker.to_weaviate_objectsclass, vector, properties
  • chunks_to_supabase_rows / Chunker.to_supabase_rowscontent, embedding, plus flat metadata columns

Plugin API

Register custom parsers or formatters at import time (no edits to omnichunk core):

from omnichunk import register_parser, register_formatter, Chunker

def my_parse(filepath: str, content: str):
    # Return a tree-sitter-like tree, or None to use the built-in parser.
    return None

register_parser("python", my_parse, overwrite=True)

def my_fmt(chunks):
    return str(len(chunks))

register_formatter("count", my_fmt)

File API

from omnichunk import chunk_file

chunks = chunk_file("path/to/file.py")

Directory API

from omnichunk import chunk_directory

results = chunk_directory("./src", glob="**/*.py", max_chunk_size=512, size_unit="chars")

for result in results:
    if result.error:
        print("error", result.filepath, result.error)
    else:
        print(result.filepath, len(result.chunks))

Chunk model

Every Chunk includes raw content, exact offsets, and rich context:

  • text: exact source slice (lossless reconstruction)
  • contextualized_text: embedding-ready representation
  • byte_range, line_range
  • context: scope, entities, siblings, imports, headings, section metadata
  • token_count, char_count, nws_count

Supported content

Code

  • Python
  • JavaScript / TypeScript
  • Rust
  • Go
  • Java
  • C / C++ / C#
  • Ruby / PHP / Kotlin / Swift (grammar-dependent)

Prose

  • Markdown
  • Plaintext

Markdown fenced blocks are delegated by language:

  • fenced code (python, ts, etc.) routes to CodeEngine
  • fenced markup (json, yaml, toml, html, xml) routes to MarkupEngine

Markup

  • JSON
  • YAML
  • TOML
  • HTML / XML

Hybrid

  • Python with heavy docstrings
  • Notebook-style # %% cell files

Architecture

src/omnichunk/
├── chunker.py
├── cli.py
├── quality.py
├── serialization.py
├── types.py
├── engine/
│   ├── router.py
│   ├── code_engine.py
│   ├── prose_engine.py
│   ├── markup_engine.py
│   └── hybrid_engine.py
├── parser/
│   ├── tree_sitter.py
│   ├── markdown_parser.py
│   ├── html_parser.py
│   └── languages.py
├── context/
│   ├── entities.py
│   ├── scope.py
│   ├── siblings.py
│   ├── imports.py
│   └── format.py
├── sizing/
│   ├── nws.py
│   ├── tokenizers.py
│   └── counter.py
└── windowing/
    ├── greedy.py
    ├── merge.py
    ├── split.py
    └── overlap.py

Determinism & integrity guarantees

omnichunk is built to preserve source fidelity:

  • Chunk boundaries are deterministic
  • Empty/whitespace-only chunks are dropped
  • Chunks are contiguous and non-overlapping in source order
  • Byte range integrity is validated in tests:
original_bytes = source.encode("utf-8")
for chunk in chunks:
    assert original_bytes[chunk.byte_range.start:chunk.byte_range.end].decode("utf-8") == chunk.text

Testing

Run the test suite:

pytest -q

Run benchmark scenarios:

python benchmarks/run_benchmarks.py
python benchmarks/run_comparisons.py
python benchmarks/run_quality_report.py
python benchmarks/run_large_corpus.py --mode mega-python --repeat 120
python benchmarks/run_hotspot_profile.py --mode mega-python --repeat 120 --limit 30

Run repository checks:

python scripts/check_ai_rules_sync.py
python scripts/check_benchmarks.py
python scripts/check_benchmarks.py --run-quality

Current suite covers:

  • API usage (chunk, chunk_file, Chunker)
  • Code/prose/markup/hybrid behavior
  • Context metadata (imports, siblings, scope, headings)
  • Sizing/tokenization/NWS logic
  • Overlap behavior
  • Edge cases (empty input, unicode, malformed syntax, range contiguity)

Contributing

Contribution and project process files:

  • CONTRIBUTING.md
  • CODE_OF_CONDUCT.md
  • SECURITY.md
  • GOVERNANCE.md
  • MAINTAINERS.md
  • ROADMAP.md
  • ARCHITECTURE.md

Install dev tooling and run pre-commit hooks:

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

Notes

  • Tree-sitter grammars are resolved dynamically and cached per language.
  • If a parser is unavailable, the system degrades gracefully with fallback heuristics.
  • contextualized_text is optimized for embedding quality while preserving raw text separately.