nittygritty-zzy/quantmini
QuantMini is a high-performance quantitative trading data pipeline that ingests financial market data from Polygon.io and converts it to Qlib binary format. It provides an alpha expression framework, integrates with ML models (LightGBM, XGBoost, CatBoost), and includes trading strategies for building ML-driven quantitative trading systems.
High-Performance Data Pipeline for Financial Market Data
A production-ready data pipeline for processing Polygon.io S3 flat files into optimized formats for quantitative analysis and machine learning.
๐ฏ Key Features
- Command-Line Interface: Complete CLI for all operations (
quantminicommand) - Adaptive Processing: Automatically scales from 24GB workstations to 100GB+ servers
- 70%+ Compression: Optimized Parquet and binary formats
- Sub-Second Queries: Partitioned data lake with predicate pushdown
- Incremental Updates: Process only new data using watermarks
- Apple Silicon Optimized: 2-3x faster on M1/M2/M3 chips
- Production Ready: Monitoring, alerting, validation, and error recovery
๐ Performance
| Mode | Memory | Throughput | With Optimizations |
|---|---|---|---|
| Streaming | < 32GB | 100K rec/s | 500K rec/s |
| Batch | 32-64GB | 200K rec/s | 1M rec/s |
| Parallel | > 64GB | 500K rec/s | 2M rec/s |
๐ Quick Start
Prerequisites
- macOS (Apple Silicon or Intel) or Linux
- Python 3.10+
- 24GB+ RAM (recommended: 32GB+)
- 1TB+ storage (SSD recommended)
- Polygon.io account with S3 flat files access
Installation
- Install uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh- Clone and setup:
git clone <repository-url>
cd quantmini
# Create project structure
./create_structure.sh
# Create and activate virtual environment
uv venv
source .venv/bin/activate # On macOS/Linux- Install dependencies:
uv pip install qlib polygon boto3 aioboto3 polars duckdb pyarrow psutil pyyaml- Configure credentials:
cp config/credentials.yaml.example config/credentials.yaml
# Edit config/credentials.yaml with your Polygon API keys- Run system profiler:
python -m src.core.system_profiler
# This will create config/system_profile.yamlFirst Run
# Initialize configuration
quantmini config init
# Edit credentials (add your Polygon.io API keys)
nano config/credentials.yaml
# Run daily pipeline
quantmini pipeline daily --data-type stocks_daily
# Or backfill historical data
quantmini pipeline run --data-type stocks_daily --start-date 2024-01-01 --end-date 2024-12-31
# Query data
quantmini data query --data-type stocks_daily \
--symbols AAPL MSFT \
--fields date close volume \
--start-date 2024-01-01 --end-date 2024-01-31See CLI.md for complete CLI documentation.
๐ Project Structure (Medallion Architecture)
quantmini/
โโโ config/ # Configuration files
โโโ src/ # Source code
โ โโโ core/ # System profiling, memory monitoring
โ โโโ download/ # S3 downloaders
โ โโโ ingest/ # Data ingestion (landing โ bronze)
โ โโโ storage/ # Parquet storage management
โ โโโ features/ # Feature engineering (bronze โ silver)
โ โโโ transform/ # Binary conversion (silver โ gold)
โ โโโ query/ # Query engine
โ โโโ orchestration/ # Pipeline orchestration
โโโ data/ # Data storage (not in git)
โ โโโ landing/ # Landing layer: raw source data
โ โ โโโ polygon-s3/ # CSV.GZ files from S3
โ โโโ bronze/ # Bronze layer: validated Parquet
โ โโโ silver/ # Silver layer: feature-enriched Parquet
โ โโโ gold/ # Gold layer: ML-ready formats
โ โ โโโ qlib/ # Qlib binary format
โ โโโ metadata/ # Watermarks, indexes
โโโ scripts/ # Command-line scripts
โโโ tests/ # Test suite
โโโ docs/ # Documentation
๐ง Configuration
Edit config/pipeline_config.yaml to customize:
- Processing mode:
adaptive,streaming,batch, orparallel - Data types: Enable/disable stocks, options, daily, minute data
- Compression: Choose
snappy(fast) orzstd(better compression) - Features: Configure which features to compute
- Optimizations: Enable Apple Silicon, async downloads, etc.
See Installation Guide for configuration details.
๐ Documentation
- Architecture Overview: System architecture and design
- Data Pipeline: Pipeline architecture details
- Changelog: Version history and updates
- Contributing Guide: Development guidelines
- Full documentation: https://quantmini.readthedocs.io/
๐งช Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test suite
pytest tests/unit/
pytest tests/integration/
pytest tests/performance/๐ Monitoring
Access monitoring dashboards:
# View health status
python scripts/check_health.py
# View performance metrics
cat logs/performance/performance_metrics.json
# Generate report
python scripts/generate_report.py๐ Data Types
The pipeline processes four types of data from Polygon.io:
- Stock Daily Aggregates: Daily OHLCV for all US stocks
- Stock Minute Aggregates: Minute-level data per symbol
- Options Daily Aggregates: Daily options data per underlying
- Options Minute Aggregates: Minute-level options data (all contracts)
๐จ Architecture (Medallion Pattern)
Landing Layer Bronze Layer Silver Layer Gold Layer
(Raw Sources) (Validated) (Enriched) (ML-Ready)
โ โ โ โ
S3 CSV.GZ Files โ Validated Parquet โ Feature-Enriched โ Qlib Binary
(Polygon) (Schema Check) (Indicators) (Backtesting)
Adaptive Ingestion: Streaming/Batch/Parallel based on available memory
Feature Engineering: DuckDB/Polars for calculated indicators
Binary Conversion: Optimized for ML training and backtesting
๐ฆ Pipeline Stages (Medallion Architecture)
- Landing: Async S3 downloads to
landing/polygon-s3/ - Bronze: Ingest and validate to
bronze/- schema enforcement, type checking - Silver: Enrich with features to
silver/- calculated indicators, returns, alpha - Gold: Convert to ML formats in
gold/qlib/- optimized for backtesting - Query: Fast access via DuckDB/Polars from any layer
Data Quality Progression: Landing (raw) โ Bronze (validated) โ Silver (enriched) โ Gold (ML-ready)
๐ Security
- Never commit
config/credentials.yaml(in .gitignore) - Store credentials in environment variables for production
- Use AWS Secrets Manager or similar for cloud deployments
- Rotate API keys regularly
๐ Troubleshooting
Memory Errors
# Reduce memory usage
export MAX_MEMORY_GB=16
# Force streaming mode
export PIPELINE_MODE=streamingS3 Rate Limits
# Reduce concurrent downloads
# Edit config/pipeline_config.yaml:
# optimizations.async_downloads.max_concurrent: 4Slow Performance
# Enable profiling
# Edit config/pipeline_config.yaml:
# monitoring.profiling.enabled: true
# Run and check logs/performance/See the full documentation for more troubleshooting tips.
๐ค Contributing
See Contributing Guide for development guidelines.
๐ License
MIT License - see LICENSE file for details
๐ Acknowledgments
- Polygon.io: S3 flat files data source
- Qlib: Quantitative investment framework
- Polars: High-performance DataFrame library
- DuckDB: Embedded analytical database
๐ง Support
- Documentation: https://quantmini.readthedocs.io/
- Issues: GitHub Issues
- Email: zheyuan28@gmail.com
Built with: Python 3.10+, uv, qlib, polygon, polars, duckdb, pyarrow
Optimized for: macOS (Apple Silicon M1/M2/M3), 24GB+ RAM, SSD storage