GitHunt
MS

MSaifAsif/py-cqengine

Python CQEngine

PyCQEngine

High-performance in-memory NoSQL indexing engine for Python object collections, powered by Rust.

The project is in development phase and is provided as-is for now.

Performance: Sub-microsecond point lookups. 100x+ faster than list comprehensions for selective queries on 1,000,000+ objects.

Features

  • πŸš€ Blazing Fast: Rust-backed hash & BTree indexing with sub-1ΞΌs point lookups
  • πŸ”’ Thread-Safe: Lock-free concurrent indexing using DashMap + parking_lot
  • πŸ’‘ Simple API: Intuitive query DSL β€” eq, and_, or_, in_, gt, lt, between
  • ⚑ Fused Materialization: Query + object retrieval in a single Rustβ†’Python call
  • 🌲 Range Queries: BTree indexes for gt / gte / lt / lte / between
  • πŸ”„ Parallel Execution: Rayon-powered parallel index operations with GIL release
  • πŸ“¦ Batch Ingestion: add_many() for efficient bulk loading (~330K obj/s)
  • πŸ—‘οΈ Memory Lifecycle: remove(), remove_many(), clear(), __del__ support
  • 🎯 Zero-Cost Counting: count() and first(n) without materializing objects
  • πŸ’Ύ LRU Query Cache: Automatic caching of repeated queries (1,000 entries)
  • πŸ”— Weak References: Opt-in use_weakrefs=True mode β€” objects auto-cleaned when Python GC'd

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Python Application              β”‚
β”‚  (User Code + Query DSL)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚ PyO3 FFI Boundary
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Rust Core Engine                β”‚
β”‚  β€’ CollectionManager (Object Registry)  β”‚
β”‚  β€’ HashIndex  (DashMap β€” O(1) eq)       β”‚
β”‚  β€’ BTreeIndex (BTreeMap β€” range scans)  β”‚
β”‚  β€’ Fused query_*_objects() methods      β”‚
β”‚  β€’ Rayon parallel intersection/union    β”‚
β”‚  β€’ LRU query cache (parking_lot Mutex)  β”‚
β”‚  β€’ GIL Release (True parallelism)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Principles:

  1. Attribute Extraction: Lambda extractors run once during add(), bypassing Python's tp_getattro overhead during queries
  2. Fused Materialization: Queries execute + materialize objects in a single FFI call, eliminating the IDs→Python→Rust roundtrip
  3. GIL Release: Index operations release the GIL for true multi-core parallelism
  4. Static Dispatch: IndexKind enum avoids vtable overhead for hot-path lookups

Installation

Prerequisites

  • Python 3.11+
  • Rust 1.70+ (install via rustup)

From Source

# Clone the repository
git clone https://github.com/yourusername/py-cqengine.git
cd py-cqengine

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install maturin
pip install maturin

# Build and install
maturin develop --release

Quick Start

from pycqengine import IndexedCollection, Attribute, eq, and_, gt, between

class Car:
    def __init__(self, vin, brand, price):
        self.vin = vin
        self.brand = brand
        self.price = price

# Step 1: Define Attributes (lambda extractors)
VIN = Attribute("vin", lambda c: c.vin)
BRAND = Attribute("brand", lambda c: c.brand)
PRICE = Attribute("price", lambda c: c.price)

# Step 2: Setup Collection
cars = IndexedCollection()
cars.add_index(VIN)                          # Hash index (default)
cars.add_index(BRAND)                        # Hash index
cars.add_index(PRICE, index_type="btree")    # BTree index for range queries

# Step 3: Load Data (use add_many for batch efficiency)
cars.add_many([
    Car(1, "Tesla", 50000),
    Car(2, "Ford", 30000),
    Car(3, "Tesla", 60000),
    Car(4, "BMW", 45000),
])

# Step 4: Query
results = cars.retrieve(eq(BRAND, "Tesla"))
for car in results:
    print(f"VIN: {car.vin}, Brand: {car.brand}, Price: ${car.price}")

# Count without materializing objects
count = cars.retrieve(eq(BRAND, "Tesla")).count()  # ~0.9ΞΌs

# First N results
top3 = cars.retrieve(eq(BRAND, "Tesla")).first(3)  # ~1.2ΞΌs

Query DSL

Equality Query

from pycqengine import eq

# Find all Teslas
results = cars.retrieve(eq(BRAND, "Tesla"))

AND Query (Intersection)

from pycqengine import and_, eq, gt

# Find Teslas priced above $55,000
results = cars.retrieve(and_(
    eq(BRAND, "Tesla"),
    gt(PRICE, 55000)
))

OR Query (Union)

from pycqengine import or_, eq

# Find Tesla or Ford vehicles
results = cars.retrieve(or_(
    eq(BRAND, "Tesla"),
    eq(BRAND, "Ford")
))

IN Query (Membership)

from pycqengine import in_

# Find vehicles from specific brands
results = cars.retrieve(in_(BRAND, ["Tesla", "Ford", "BMW"]))

Range Queries (requires BTree index)

from pycqengine import gt, gte, lt, lte, between

# Price > 40,000
results = cars.retrieve(gt(PRICE, 40000))

# Price >= 30,000
results = cars.retrieve(gte(PRICE, 30000))

# Price < 50,000
results = cars.retrieve(lt(PRICE, 50000))

# 30,000 <= Price <= 50,000 (inclusive)
results = cars.retrieve(between(PRICE, 30000, 50000))

Memory Management

# Remove a single object
cars.remove(car_obj)

# Remove multiple objects
cars.remove_many([car1, car2, car3])

# Clear entire collection
cars.clear()

Weak References

By default, IndexedCollection holds strong references to objects, keeping them alive as long as the collection exists. Enable weak reference mode to let Python's GC reclaim objects when no other references exist:

# Opt-in weak reference mode
cars = IndexedCollection(use_weakrefs=True)
cars.add_index(BRAND)
cars.add_index(PRICE, index_type="btree")

car = Car(1, "Tesla", 50000)
cars.add(car)

# Object is retrievable while reference exists
assert list(cars.retrieve(eq(BRAND, "Tesla"))) == [car]

# Drop the reference β€” Python GC can reclaim it
del car

# Explicit garbage collection
cleaned = cars.gc()       # Returns number of dead refs cleaned
print(cars.alive_count)   # Number of still-alive objects

# Dead refs are also cleaned lazily during queries
results = list(cars.retrieve(eq(BRAND, "Tesla")))  # Returns [] β€” dead ref auto-cleaned

Notes:

  • Objects that don't support weakrefs (tuples, ints, etc.) automatically fall back to strong refs
  • Query performance has zero overhead in weakref mode
  • Build throughput is ~13% slower (weakref creation + reverse index population)
  • gc() and alive_count scan all objects β€” suitable for periodic maintenance, not hot loops

Performance

Benchmarked on macOS ARM64 (Apple Silicon), Python 3.14, Rust 1.93.

100K Objects

Scenario Median Results vs Python
Point lookup (eq VIN) 0.8 ΞΌs 1 3,290x
count() eq(BRAND) 0.9 ΞΌs 12,500 2,377x
first(10) eq(BRAND) 1.2 ΞΌs 10 β€”
AND 2-way list() 94 ΞΌs 4,167 22x
AND 3-way list() 19 ΞΌs 833 110x
AND 4-way (empty result) 2.4 ΞΌs 0 923x
OR 2-way list() 535 ΞΌs 25,000 6.0x
IN 3-val list() 773 ΞΌs 37,500 4.6x
gt(PRICE, 40000) list() 1,173 ΞΌs 59,000 1.7x
between(30k-40k) list() 425 ΞΌs 21,000 7.2x
count() gt(PRICE) 0.6 ΞΌs 59,000 4,087x
between(narrow) list() 102 ΞΌs 5,000 26.7x
AND(eq+gt) mixed list() 173 ΞΌs 8,500 12.3x
Build time 0.30s β€” 334K obj/s

Scaling to 1M Objects

Scenario 100K 500K 1M
Point lookup 0.8ΞΌs (3,290x) 0.8ΞΌs (16,824x) 0.8ΞΌs (33,654x)
count() eq 0.9ΞΌs (2,377x) 1.0ΞΌs (11,441x) 0.9ΞΌs (23,313x)
AND 3-way 19ΞΌs (110x) 97ΞΌs (114x) 215ΞΌs (104x)
AND 4-way empty 2.4ΞΌs (923x) 2.3ΞΌs (4,794x) 2.3ΞΌs (9,663x)
count() gt 0.6ΞΌs (4,087x) 0.6ΞΌs (22,686x) 0.6ΞΌs (43,840x)
between(narrow) 102ΞΌs (26.7x) 537ΞΌs (26.6x) 1,527ΞΌs (18.9x)
Build throughput 334K obj/s 337K obj/s 331K obj/s

Point lookups, counts, and empty-result queries are O(1) β€” speedup scales linearly with collection size.
Selective queries (AND, narrow range) remain 10–100x+ faster at all scales.

Development

Project Structure

py-cqengine/
β”œβ”€β”€ src/                    # Rust source code
β”‚   β”œβ”€β”€ lib.rs             # PyO3 module initialization
β”‚   β”œβ”€β”€ types.rs           # TypedValue enum (str/int/float/bool)
β”‚   β”œβ”€β”€ collection.rs      # CollectionManager + query methods
β”‚   β”œβ”€β”€ index.rs           # Index trait (lookup, insert, remove)
β”‚   β”œβ”€β”€ hash_index.rs      # DashMap-based O(1) equality index
β”‚   └── btree_index.rs     # BTreeMap-based range index
β”œβ”€β”€ python/pycqengine/     # Python package
β”‚   β”œβ”€β”€ __init__.py        # Public API exports
β”‚   β”œβ”€β”€ core.py            # IndexedCollection + ResultSet
β”‚   β”œβ”€β”€ attribute.py       # Attribute extractor
β”‚   └── query.py           # Query DSL (eq, and_, or_, in_, gt, between...)
β”œβ”€β”€ tests/                 # Python tests (119 tests)
β”œβ”€β”€ benchmarks/            # Performance benchmarks
β”œβ”€β”€ Cargo.toml             # Rust dependencies
└── pyproject.toml         # Python package config

Build Commands

# Development build (with debug symbols)
maturin develop

# Release build (optimized)
maturin develop --release

# Run Python tests
python -m pytest tests/ -v

# Run benchmarks
python benchmarks/run_all.py                           # Standard (100K)
python benchmarks/run_all.py --sizes 100000,500000     # Multi-scale
python benchmarks/run_all.py --quick                   # Fast iteration
python benchmarks/run_all.py --json                    # Save JSON for diffing

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.