GitHunt
PR

prakhardewangan2005-hash/FleetForge-Hardware-Platform-Validation

Fleet-level infrastructure validation & reliability analysis toolkit with per-node checks, aggregation, and blast-radius insights for go/no-go decisions.

FleetForge โ€” Hardware Platform Validation Toolkit

FleetForge is a stage-aware hardware validation framework for Linux-based datacenter servers.
It models real NPI (New Product Introduction) gates and produces structured JSON evidence
for infra, hardware, and production readiness decisions.


๐Ÿงช Google Colab (Reproducible Demo)

Colab notebook used to build and validate this project:
๐Ÿ‘‰ https://colab.research.google.com/drive/1J8ElWi3FAXbB2ITDPgsa536UbQJvBt7c#scrollTo=FBwL7xjStacy


๐Ÿš€ Why FleetForge?

Modern infra failures are rarely single-host issues. They are platform, firmware, or rollout-level problems.

FleetForge helps answer:

  • Which hardware component failed?
  • Is this a single-host or platform-wide issue?
  • Is the failure acceptable in bring-up but blocking for production?

๐Ÿง  NPI Lifecycle Model

FleetForge models hardware readiness as stage-gated validation:

  1. Bring-up Validation
  2. Pre-production Qualification
  3. Production Readiness
  4. Post-deployment Verification

๐Ÿ” Safe-by-Default Design

  • Only safe, read-only checks run by default
  • Unsafe / experimental checks never run accidentally
  • Explicit opt-in required using flags
  • Supports --dry-run to preview execution

๐Ÿ“ Repository Structure

FleetForge/
โ”œโ”€โ”€ docs/
โ”œโ”€โ”€ fleetforge/
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ policy.py                 # Stage & safety policy engine
โ”‚   โ”‚   โ””โ”€โ”€ runner.py                 # Stage execution logic
โ”‚   โ”œโ”€โ”€ checks/
โ”‚   โ”‚   โ”œโ”€โ”€ storage/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ fio_quick.py          # Disk smoke test (unsafe)
โ”‚   โ”‚   โ””โ”€โ”€ network/
โ”‚   โ”‚       โ””โ”€โ”€ iperf_smoke.py        # NIC throughput smoke test (unsafe)
โ”‚   โ””โ”€โ”€ stages/
โ”‚       โ”œโ”€โ”€ preprod_qualification.yaml
โ”‚       โ””โ”€โ”€ prod_readiness.yaml
โ”œโ”€โ”€ out/
โ”œโ”€โ”€ runbooks/
โ”œโ”€โ”€ fleetforge_cli.py
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

๐Ÿงช Unsafe / Experimental Checks (Opt-in)

These checks never run by accident:

  • storage.fio_quick
    Disk I/O smoke test (can generate load)

  • network.iperf_smoke
    NIC throughput smoke test (requires iperf target)

They must be explicitly enabled:

--enable-exp storage.fio_quick
--enable-exp network.iperf_smoke

โ–ถ๏ธ Usage

python fleetforge_cli.py run \
  --stage preprod_qualification \
  --dry-run \
  --enable-exp storage.fio_quick \
  --enable-exp network.iperf_smoke \
  --out out/preprod.json

Full Production Readiness Run

python fleetforge_cli.py run \
  --stage prod_readiness \
  --enable-exp storage.fio_quick \
  --enable-exp network.iperf_smoke \
  --out out/prod.json

๐Ÿ“ฆ Outputs

FleetForge produces machine-readable JSON artifacts:

  • out/preprod.json
  • out/prod.json

These are designed to plug directly into:

  • CI pipelines
  • Infra dashboards
  • Capacity & reliability reviews

๐Ÿ“˜ Runbooks

FleetForge links failures to actionable runbooks in runbooks/.

Examples:

  • Disk SMART / NVMe health failures
  • NIC speed / duplex mismatch
  • Throughput regressions

๐Ÿ”ฅ Philosophy

โ€œFail fast in bring-up.
Fail loud before production.
Never fail silently in the field.โ€

FleetForge enforces hardware truth before scale.