prakhardewangan2005-hash/FleetForge-Hardware-Platform-Validation
Fleet-level infrastructure validation & reliability analysis toolkit with per-node checks, aggregation, and blast-radius insights for go/no-go decisions.
FleetForge โ Hardware Platform Validation Toolkit
FleetForge is a stage-aware hardware validation framework for Linux-based datacenter servers.
It models real NPI (New Product Introduction) gates and produces structured JSON evidence
for infra, hardware, and production readiness decisions.
๐งช Google Colab (Reproducible Demo)
Colab notebook used to build and validate this project:
๐ https://colab.research.google.com/drive/1J8ElWi3FAXbB2ITDPgsa536UbQJvBt7c#scrollTo=FBwL7xjStacy
๐ Why FleetForge?
Modern infra failures are rarely single-host issues. They are platform, firmware, or rollout-level problems.
FleetForge helps answer:
- Which hardware component failed?
- Is this a single-host or platform-wide issue?
- Is the failure acceptable in bring-up but blocking for production?
๐ง NPI Lifecycle Model
FleetForge models hardware readiness as stage-gated validation:
- Bring-up Validation
- Pre-production Qualification
- Production Readiness
- Post-deployment Verification
๐ Safe-by-Default Design
- Only safe, read-only checks run by default
- Unsafe / experimental checks never run accidentally
- Explicit opt-in required using flags
- Supports
--dry-runto preview execution
๐ Repository Structure
FleetForge/
โโโ docs/
โโโ fleetforge/
โ โโโ core/
โ โ โโโ policy.py # Stage & safety policy engine
โ โ โโโ runner.py # Stage execution logic
โ โโโ checks/
โ โ โโโ storage/
โ โ โ โโโ fio_quick.py # Disk smoke test (unsafe)
โ โ โโโ network/
โ โ โโโ iperf_smoke.py # NIC throughput smoke test (unsafe)
โ โโโ stages/
โ โโโ preprod_qualification.yaml
โ โโโ prod_readiness.yaml
โโโ out/
โโโ runbooks/
โโโ fleetforge_cli.py
โโโ requirements.txt
โโโ README.md
๐งช Unsafe / Experimental Checks (Opt-in)
These checks never run by accident:
-
storage.fio_quick
Disk I/O smoke test (can generate load) -
network.iperf_smoke
NIC throughput smoke test (requires iperf target)
They must be explicitly enabled:
--enable-exp storage.fio_quick
--enable-exp network.iperf_smokeโถ๏ธ Usage
Dry Run (recommended)
python fleetforge_cli.py run \
--stage preprod_qualification \
--dry-run \
--enable-exp storage.fio_quick \
--enable-exp network.iperf_smoke \
--out out/preprod.jsonFull Production Readiness Run
python fleetforge_cli.py run \
--stage prod_readiness \
--enable-exp storage.fio_quick \
--enable-exp network.iperf_smoke \
--out out/prod.json๐ฆ Outputs
FleetForge produces machine-readable JSON artifacts:
out/preprod.jsonout/prod.json
These are designed to plug directly into:
- CI pipelines
- Infra dashboards
- Capacity & reliability reviews
๐ Runbooks
FleetForge links failures to actionable runbooks in runbooks/.
Examples:
- Disk SMART / NVMe health failures
- NIC speed / duplex mismatch
- Throughput regressions
๐ฅ Philosophy
โFail fast in bring-up.
Fail loud before production.
Never fail silently in the field.โ
FleetForge enforces hardware truth before scale.