HIPO (HIgh Performance Multi-Cloud K8's Cluster Orchestration) - Multi-Cloud Kubernetes ML Platform

A modular and scalable infrastructure for deploying machine learning and LLM models across Multiple cloud providers for managing Kubernetes Clusters (MCO- Multi Cloud - Multi Cluster K8's Operation).

NOTE: Still under active Open source contribution software development

Overview

This project provides a comprehensive infrastructure for ML/LLM workflows, including:

Multi-cloud Kubernetes orchestration (AWS EKS and GCP GKE)
GPU-optimized autoscaling for ML/LLM workloads based on GPU Metrics monitoring
Unified Control Plan with K8's SIG CAPI Controller for KaaS - Kubernetes as a Service
Global API gateway with intelligent routing of incoming Requests to Optimize for Cost, Latency Performance
Fault tolerance with cross-cloud failover
Cost optimization across cloud providers
Model training, evaluation, and serving
Streamlit-based UI for MCO platform management
Comprehensive observability with metrics and logging

Project Structure

├── config/             # Kubernetes and model configurations 
├── data/               # Data files
├── logs/               # Log files
├── models/             # Saved models
├── src/                # Source code
│   ├── api/            # API server
│   ├── autoscaling/    # GPU autoscaling components
│   ├── cloud/          # Cloud provider implementations
│   ├── config/         # Configuration management
│   ├── data/           # Data loading and preprocessing
│   ├── gateway/        # API gateway and routing
│   ├── kubernetes/     # Kubernetes orchestration
│   ├── models/         # Model implementations
│   ├── observability/  # Metrics and tracing
│   ├── pipelines/      # Pipeline orchestration
│   ├── secrets/        # Secret management 
│   ├── security/       # Encryption and security
│   ├── ui/             # Streamlit web interface
│   └── utils/          # Utility functions 
└── tests/              # Unit tests

System Design and Architecture:

Installation

Clone the repository:

git clone https://github.com/akramIOT/new_hipo.git

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Configuration

The platform uses YAML files for configuration of Kubernetes resources, cloud providers, and models:

# kubernetes_config.yaml example
apiVersion: v1
kind: ConfigMap
metadata:
  name: hipo-config
  namespace: ml-models
data:
  log_level: "INFO"
  monitoring_enabled: "true"
  auto_scaling_enabled: "true"
  default_replicas: "2"
  gpu_resource_limit: "1"

Training a Model

python -m src.main --mode train --config config/default_config.yaml --data data/train.csv --model my_model

Making Predictions

python -m src.main --mode predict --config config/default_config.yaml --data data/test.csv --model my_model.pkl --output predictions.csv

Running the API Server

python -m src.main --mode serve --config config/default_config.yaml --model my_model.pkl --port 5000

Running the Streamlit UI

# Make sure streamlit is installed
pip install streamlit

# Run the UI
python src/ui/run_ui.py
# or use streamlit directly
streamlit run src/ui/app.py

API Gateway Endpoints

GET /api/v1/models: List available models
GET /api/v1/models/<model_name>: Get model information
POST /api/v1/models/<model_name>/predict: Make predictions using the model
POST /api/v1/models/<model_name>/generate: Generate text with LLM models
POST /api/v1/models/<model_name>/embed: Get embeddings for input text
POST /api/v1/models/<model_name>/evaluate: Evaluate model performance
GET /api/v1/health: Health check endpoint
GET /api/v1/metrics: Platform metrics endpoint
GET /api/v1/model-weights: List available model weights
POST /api/v1/model-weights/<model_name>: Upload model weights
GET /api/v1/model-weights/<model_name>/<version>: Download model weights

Streamlit UI Features

The platform includes a comprehensive Streamlit-based UI with the following features:

Dashboard: Overall platform status, resource usage, and cost metrics
Model Deployment: Interface for deploying ML/LLM models across cloud providers
Model Inference: Test interface for running inference with deployed models
Configuration: Management of cloud providers, Kubernetes, and model configurations
Monitoring: Real-time metrics, logs, and alerting dashboard
Logs: Searchable log viewer with filtering capabilities

CI/CD Pipeline

The project uses GitHub Actions for continuous integration and deployment. The CI/CD pipeline includes:

Automated linting and code quality checks
Unit and integration testing across multiple Python versions
Security scanning with Bandit and Safety
Python package building and publishing
Docker image building and publishing
Automated deployment to development and production environments

For details about the CI/CD setup and release process, see CI/CD Guide and CI/CD Updates.

Local CI Validation

You can run CI checks locally using the provided validation script:

# Make the script executable if needed
chmod +x scripts/validate_ci.sh

# Run the validation
./scripts/validate_ci.sh

This will check your environment, run code quality tools, and validate configurations before you push your changes.

GitHub Secrets Management

This project requires several GitHub Secrets to be configured for the CI/CD pipeline to function properly. These include:

AWS credentials for deployment and testing
Docker Hub credentials for image publishing
PyPI credentials for package publishing
Codecov token for coverage reporting

For details on setting up the required secrets, see GitHub Secrets Setup.

Development

Adding a New Model

To add a new model, create a new class that inherits from ModelBase:

from src.models.model_base import ModelBase

class MyModel(ModelBase):
    def __init__(self, model_name, **kwargs):
        super().__init__(model_name, **kwargs)
        # Initialize your model

    def train(self, X, y, **kwargs):
        # Implement training logic

    def predict(self, X):
        # Implement prediction logic

    def evaluate(self, X, y):
        # Implement evaluation logic

Creating a Pipeline

from src.pipelines.pipeline import Pipeline

# Create a pipeline
pipeline = Pipeline('my_pipeline')

# Add steps
pipeline.add_step('load_data', load_data_function, data_path='data/train.csv')
pipeline.add_step('preprocess', preprocess_function)
pipeline.add_step('train_model', train_model_function, model_name='my_model')

# Run the pipeline
results = pipeline.run()

Managing Model Weights

The platform includes a secure model weights management system that works across multiple cloud providers:

from src.secrets.secret_manager import SecretManager
from src.cloud.factory import CloudProviderFactory

# Set up cloud providers
factory = CloudProviderFactory()
cloud_providers = {
    "aws": factory.create_provider("aws", aws_config),
    "gcp": factory.create_provider("gcp", gcp_config)
}

# Create secret manager
secret_manager = SecretManager(config, cloud_providers)
secret_manager.start()

# Upload model weights
secret_manager.upload_model_weights("llama-7b", "/path/to/model/weights")

# List available models
models = secret_manager.list_available_models()
for model_name, versions in models.items():
    print(f"{model_name}: {versions}")

# Download latest model weights
secret_manager.download_model_weights("llama-7b", "/output/path")

# Download specific version
secret_manager.download_model_weights("llama-7b", "/output/path", version="20230815")

# Clean up
secret_manager.stop()

Key features of the model weights management system:

Multi-cloud storage: Transparently store and sync weights across AWS S3, GCP Cloud Storage, and other providers
Versioning: Maintain multiple versions of model weights with automatic versioning
Secure access: Fully integrated with the secret management system for secure credentials
Checksumming: Automatic validation of weights integrity during transfers
Cross-cloud replication: Replicate weights across clouds for reliability and high availability
Encryption: End-to-end encryption for model weights

License

MIT License

For questions, suggestions, or contributions, please contact: sheriff.akram.usa@gmail.com

akramIOT/new_hipo