GitHunt

openphysical

The missing eval harness for physical AI.

Benchmark robotics foundation models, convert datasets, compute stats. No framework lock-in.

Python 3.10+
License
PyPI
Tests
CI
MuJoCo
Hugging Face


pip install openphysical
openphysical eval random --benchmark mujoco-locomotion --output results.json

Install · Quickstart · CLI Reference · Python API · Benchmarks · Roadmap · Contributing


The Problem

Evaluating robot learning models is broken.

LLMs have lm-eval-harness -- one tool, 60+ benchmarks, standardized metrics, a public leaderboard. Robotics has nothing equivalent. Every project is its own island:

  • OpenVLA evaluates on LIBERO with its own scripts. Octo evaluates on WidowX with its own scripts. pi0 evaluates on ALOHA with its own scripts. Results are not comparable.
  • Data formats are fragmented -- Open X-Embodiment uses RLDS (requires TensorFlow). Robomimic uses HDF5. LeRobot uses its own v3.0 format. PI's openpi requires LeRobot format for training. Converting between them means writing custom scripts.
  • Benchmarks exist but eval tooling doesn't -- LIBERO has 130 tasks, MetaWorld has 50, ManiSkill has dozens more. But none of them ship a model-agnostic evaluation harness. You must write your own rollout loop, metric aggregation, and reporting for each one.
  • No standardized result format -- every paper reports numbers differently. No two labs use the same episode counts, seeds, or success criteria.

openphysical is an attempt to fix this. Starting with what works today and building toward the evaluation standard that physical AI needs.

What Works Today

openphysical is a CLI tool and Python library that currently does three things well:

1. Evaluation and benchmarking -- Run any policy against standardized MuJoCo task suites, compute structured metrics (mean/std/min/max reward, success rate, step counts), output JSON. Define custom benchmark suites in TOML.

openphysical eval random --benchmark mujoco-locomotion --output results.json
openphysical eval ./models/ppo_pendulum.zip --benchmark mujoco-control --output results.json

2. Dataset conversion -- Convert between HDF5 (robomimic), RLDS/TFRecord (Open X-Embodiment), LeRobot v3.0, and Parquet. Compute normalization statistics. Fully tested, deterministic output.

openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./converted
openphysical stats dataset.hdf5 -f hdf5 -o stats.json

3. Simulation execution -- Run policies in MuJoCo environments, record video, load models from local files or Hugging Face Hub.

openphysical run random --env ant --save ant.mp4
openphysical run hf://user/my-model --env humanoid --save humanoid.mp4

What does NOT work yet

Being honest about the current limits:

  • VLA inference -- Adapter code exists for pi0, OpenVLA, and Octo, but the act() methods are stubs (return zeros). Real GPU inference with actual model weights has not been validated. VLA models cannot participate in the eval pipeline today.
  • Manipulation benchmarks -- Only MuJoCo locomotion/control environments are supported. LIBERO, MetaWorld, ManiSkill, and RoboCasa are not yet integrated.
  • Training -- openphysical does not train models. Use LeRobot, openpi, or robomimic for training.
  • Real robot support -- No hardware abstraction. This is simulation-only.
  • Leaderboard -- No hosted leaderboard yet.

Install

pip install openphysical

Python 3.10+. No Docker. CPU works out of the box.

Optional extras
pip install openphysical[sb3]       # Stable Baselines3 model loading
pip install openphysical[hub]       # Hugging Face Hub downloads
pip install openphysical[convert]   # Dataset conversion (HDF5, RLDS, LeRobot, Parquet)
pip install openphysical[all]       # Everything

Quickstart

Benchmark a model:

openphysical eval random --benchmark mujoco-locomotion --output results.json

Convert a dataset:

openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./converted

Run a policy and record video:

openphysical run random --env ant --save ant.mp4

Run a trained model from Hugging Face:

openphysical run hf://user/my-robot-model --env half-cheetah --save cheetah.mp4

Features

Feature Description
Eval Structured benchmarking Evaluate any policy against standardized suites with JSON output
Benchmarks Built-in + custom mujoco-locomotion, mujoco-control, mujoco-full, or define TOML suites
Metrics Standardized computation Mean/std/min/max reward, success rate, step counts, aggregate scores
Convert Dataset pipelines HDF5, RLDS/TFRecord to LeRobot v3.0 or Parquet, with full metadata
Stats Norm stats Compute normalization statistics (mean/std/min/max/quantiles) from any format
Run One-command execution Run any policy in any MuJoCo environment with video output
Models Model loading Local .zip, Hugging Face hf://, or built-in (random, zero)
Envs 11 environments Ant, HalfCheetah, Hopper, Humanoid, Swimmer, Walker, Reacher, Pendulum, and more
Extend Protocol-based No inheritance required -- just implement reset() and act()

Supported Environments

Alias Gymnasium ID
ant Ant-v5
half-cheetah HalfCheetah-v5
hopper Hopper-v5
humanoid Humanoid-v5
humanoid-standup HumanoidStandup-v5
swimmer Swimmer-v5
walker Walker2d-v5
reacher Reacher-v5
pendulum InvertedPendulum-v5
double-pendulum InvertedDoublePendulum-v5
pusher Pusher-v5

Any valid Gymnasium environment ID also works directly: openphysical run random --env Hopper-v5

CLI Reference

openphysical eval -- Evaluate against benchmarks
Usage: openphysical eval [OPTIONS] MODEL

Options:
  -b, --benchmark TEXT    Benchmark name or path to TOML file  [required]
  -n, --episodes INTEGER  Override episodes per task
  --max-steps INTEGER     Override max steps per task
  --seed INTEGER          Override seed
  -o, --output PATH       Save results to JSON file
openphysical eval random --benchmark mujoco-control
openphysical eval ./models/ppo_pendulum.zip --benchmark mujoco-control --output results.json
openphysical convert -- Convert robot datasets
Usage: openphysical convert [OPTIONS] SOURCE

Options:
  --from TEXT             Source format (hdf5, rlds)  [required]
  --to TEXT               Target format (parquet, lerobot)  [required]
  -o, --output PATH       Output directory
  --state-key TEXT        HDF5 observation key  [default: joint_positions]
  --task-key TEXT         HDF5 attribute key for task label
  --robot-type TEXT       Robot type for metadata  [default: unknown]
  --fps INTEGER           Dataset FPS  [default: 10]
openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./converted
openphysical convert ./rlds_dataset/ --from rlds --to lerobot --output ./converted
openphysical stats -- Compute normalization statistics
Usage: openphysical stats [OPTIONS] SOURCE

Options:
  -f, --format TEXT       Source format (hdf5, rlds)  [required]
  --state-key TEXT        HDF5 observation key  [default: joint_positions]
  --task-key TEXT         HDF5 attribute key for task label
  --include-images        Include image features in stats
  -o, --output PATH       Save stats to JSON file
openphysical stats dataset.hdf5 -f hdf5 -o stats.json
openphysical stats ./rlds_dataset/ -f rlds -o stats.json
openphysical run -- Run a policy in simulation
Usage: openphysical run [OPTIONS] MODEL

Options:
  -e, --env TEXT          Environment name or gymnasium ID  [required]
  -n, --episodes INTEGER  Number of episodes  [default: 1]
  --max-steps INTEGER     Max steps per episode  [default: 1000]
  --render                Open live render window
  -o, --save PATH         Save video to file
  --fps INTEGER           Video FPS  [default: 30]
  --seed INTEGER          Random seed
More commands
openphysical benchmarks          # List built-in benchmark suites
openphysical list models         # List built-in policy types
openphysical list envs           # List environment aliases
openphysical info ant            # Inspect observation/action spaces
openphysical world list          # List MuJoCo world files
openphysical world info f.xml    # Show world file details

Benchmarks

Three built-in benchmark suites ship with openphysical:

Suite Tasks Environments
mujoco-locomotion 6 HalfCheetah, Hopper, Walker2d, Ant, Humanoid, Swimmer
mujoco-control 4 InvertedPendulum, InvertedDoublePendulum, Reacher, Pusher
mujoco-full 10 All of the above

Custom benchmarks are defined as TOML files:

[benchmark]
name = "my-benchmark"
description = "Custom evaluation suite"
version = "1.0"

[[tasks]]
env_id = "HalfCheetah-v5"
name = "cheetah"
episodes = 3
reward_threshold = 4800.0
openphysical eval random --benchmark my_benchmark.toml

Model Resolution

When you run openphysical run <MODEL>, the model argument is resolved through this chain:

MODEL argument
  ├── hf://user/repo           → download from Hugging Face Hub → load as SB3
  ├── hf://user/repo/file.zip  → download specific file from HF Hub
  ├── hf://user/repo/f@rev     → pin to a specific revision
  ├── ./path/to/model.zip      → load local SB3 model
  ├── random                   → uniform random actions
  └── zero                     → zero actions

Python API

Evaluate against benchmarks

from openphysical.benchmark import create_benchmark_registry
from openphysical.eval import EvalRun, save_results
from openphysical.policy import RandomPolicy
from pathlib import Path

registry = create_benchmark_registry()
bench = registry.get("mujoco-control")

eval_run = EvalRun(
    benchmark=bench,
    policy_factory=lambda action_space: RandomPolicy(action_space),
    model_name="random",
)

results = eval_run.run()
print(f"Mean reward: {results.aggregate_mean_reward:.2f}")
print(f"Success rate: {results.aggregate_success_rate:.1%}")
save_results(results, Path("results.json"))

Run episodes

from openphysical.environment import GymnasiumEnv
from openphysical.policy import RandomPolicy
from openphysical.runner import EpisodeRunner
from openphysical.recorder import VideoRecorder
from pathlib import Path

env = GymnasiumEnv("Ant-v5", render_mode="rgb_array")
policy = RandomPolicy(env.action_space)
runner = EpisodeRunner(policy, env, max_steps=500, capture_frames=True)

results = runner.run(num_episodes=3)
for r in results:
    print(f"reward={r.total_reward:.2f} steps={r.steps}")

VideoRecorder.save(results[0].frames, Path("output.mp4"))
env.close()

Custom policies

Any class with reset() and act() works -- no inheritance required:

import numpy as np
from openphysical.types import Action, Observation, SpaceInfo

class MyPolicy:
    def __init__(self, action_space: SpaceInfo) -> None:
        self.space = action_space

    def reset(self) -> None:
        pass

    def act(self, observation: Observation) -> Action:
        return np.clip(observation[:self.space.shape[0]], self.space.low, self.space.high)
Dataset conversion

HDF5 to LeRobot v3.0:

from pathlib import Path
from openphysical.convert.readers import HDF5Reader
from openphysical.convert.lerobot_writer import LeRobotWriter

reader = HDF5Reader(Path("dataset.hdf5"), state_key="joint_positions")
episodes = list(reader.read())
meta = reader.infer_meta()

writer = LeRobotWriter(Path("./output"), meta=meta, fps=30)
writer.write(episodes)

RLDS/TFRecord to LeRobot v3.0:

from pathlib import Path
from openphysical.convert.rlds_reader import RLDSReader
from openphysical.convert.lerobot_writer import LeRobotWriter

reader = RLDSReader(Path("./rlds_dataset"))
episodes = list(reader.read())
meta = reader.infer_meta()

writer = LeRobotWriter(Path("./output"), meta=meta)
writer.write(episodes)

Compute normalization statistics:

from pathlib import Path
from openphysical.convert.readers import HDF5Reader
from openphysical.convert.stats import compute_norm_stats

reader = HDF5Reader(Path("dataset.hdf5"))
stats = compute_norm_stats(reader.read())
stats.save(Path("stats.json"))
print(stats.stats["state"].mean)
print(stats.stats["action"].std)
Load models from Hugging Face
from openphysical.hub import resolve_hf_model

model = resolve_hf_model("hf://user/my-robot-model/ppo_ant.zip")
print(model.local_path)

Architecture

src/openphysical/
    types.py              Protocol definitions (Policy, Environment) + data types
    config.py             Environment aliases and defaults
    policy.py             Built-in policies (RandomPolicy, ZeroPolicy)
    policies_sb3.py       Stable Baselines3 model loader
    policies_vla.py       VLA adapters (pi0, OpenVLA, Octo) [stubs -- see roadmap]
    environment.py        Gymnasium/MuJoCo wrapper
    registry.py           Model and environment registries
    runner.py             Episode execution engine
    recorder.py           Video recording (imageio/ffmpeg)
    hub.py                Hugging Face Hub integration
    benchmark.py          Benchmark/Task definitions + built-in suites
    benchmark_loader.py   TOML benchmark loading + resolution
    metrics.py            Metrics computation
    eval.py               Evaluation orchestrator
    cli.py                CLI (Click)
    convert/              Dataset conversion pipeline
        readers.py        HDF5Reader (robomimic convention)
        rlds_reader.py    RLDSReader (Google RLDS/TFRecord format)
        writers.py        ParquetWriter (Parquet + metadata)
        lerobot_writer.py LeRobotWriter (LeRobot v3.0 format)
        stats.py          Normalization statistics (Welford online algorithm)

Protocol-based architecture -- Policy, Environment, DatasetReader, DatasetWriter are all typing.Protocol classes. No inheritance trees. Just implement the right methods.

Roadmap

The goal is to become the lm-eval-harness for physical AI -- a model-agnostic evaluation tool that works across benchmarks, simulators, and model types.

Where we are

  • Eval framework with 3 MuJoCo benchmark suites, TOML-defined custom suites, structured JSON output
  • Dataset conversion across HDF5, RLDS, LeRobot v3.0, Parquet
  • CLI with 10 commands, all working
  • 408 tests, Protocol-based architecture

Where we're going

Priority Area Description Status
P0 LIBERO integration Add LIBERO's 130 manipulation tasks as a benchmark suite Not started
P0 MetaWorld integration Add MetaWorld's 50 tasks (MT10/MT50/ML10/ML45) Not started
P0 VLA eval pipeline Wire act_with_image() into the runner so VLA models can be evaluated Adapter code exists, pipeline integration needed
P1 ManiSkill integration Broad manipulation + locomotion tasks via SAPIEN Not started
P1 RoboCasa integration Kitchen manipulation tasks (365 tasks, 2500 scenes) Not started
P1 GPU inference End-to-end VLA inference -- pi0, OpenVLA, Octo with real weights Adapter stubs exist, needs GPU validation
P1 Standardized result format Versioned JSON schema for cross-paper comparison Partial (JSON output exists, schema not formalized)
P2 Leaderboard Hosted leaderboard for physical AI model comparison Not started
P2 More VLA adapters RT-2, Diffusion Policy, ACT, GR00T Adapter pattern exists
P2 Real-to-sim correlation SimplerEnv-style eval with ground truth comparison Research phase

Contributing

Robotics evaluation is fragmented. Every lab runs their own scripts, reports numbers differently, and results aren't comparable. We're building the tool to fix that.

The core framework is solid -- 408 tests, clean Protocol-based architecture. What we need is help expanding benchmark coverage and model support.

Highest-impact contributions right now:

  1. LIBERO / MetaWorld integration -- Wrap these benchmarks as openphysical benchmark suites so any model can be evaluated with openphysical eval MODEL --benchmark libero. The Environment protocol is reset() + step() + action_space + observation_space.

  2. VLA eval pipeline -- Connect act_with_image() to the episode runner so VLA models can participate in benchmarks. This is the critical missing piece.

  3. GPU testing -- If you have NVIDIA GPUs, help us validate VLA inference end-to-end with real pi0/OpenVLA/Octo weights.

  4. Standardized result schema -- Help define the JSON schema that papers should use to report results. Think: what fields would a leaderboard need?

git clone https://github.com/ArchishmanSengupta/openphysical.git
cd openphysical
make dev
make test

Open an issue or PR. No CLA, no bureaucracy.

Development

git clone https://github.com/ArchishmanSengupta/openphysical.git
cd openphysical
make dev        # install with dev dependencies
make test       # 408 tests
make lint       # ruff check + format check
make format     # auto-format

License

Apache-2.0


LLMs have lm-eval-harness. Physical AI has nothing. We're building it.

Star this repo if you think robotics evaluation should be standardized.