openphysical

The missing eval harness for physical AI.

Benchmark robotics foundation models, convert datasets, compute stats. No framework lock-in.

pip install openphysical
openphysical eval random --benchmark mujoco-locomotion --output results.json

Install · Quickstart · CLI Reference · Python API · Benchmarks · Roadmap · Contributing

The Problem

Evaluating robot learning models is broken.

LLMs have lm-eval-harness -- one tool, 60+ benchmarks, standardized metrics, a public leaderboard. Robotics has nothing equivalent. Every project is its own island:

OpenVLA evaluates on LIBERO with its own scripts. Octo evaluates on WidowX with its own scripts. pi0 evaluates on ALOHA with its own scripts. Results are not comparable.
Data formats are fragmented -- Open X-Embodiment uses RLDS (requires TensorFlow). Robomimic uses HDF5. LeRobot uses its own v3.0 format. PI's openpi requires LeRobot format for training. Converting between them means writing custom scripts.
Benchmarks exist but eval tooling doesn't -- LIBERO has 130 tasks, MetaWorld has 50, ManiSkill has dozens more. But none of them ship a model-agnostic evaluation harness. You must write your own rollout loop, metric aggregation, and reporting for each one.
No standardized result format -- every paper reports numbers differently. No two labs use the same episode counts, seeds, or success criteria.

openphysical is an attempt to fix this. Starting with what works today and building toward the evaluation standard that physical AI needs.

What Works Today

openphysical is a CLI tool and Python library that currently does three things well:

1. Evaluation and benchmarking -- Run any policy against standardized MuJoCo task suites, compute structured metrics (mean/std/min/max reward, success rate, step counts), output JSON. Define custom benchmark suites in TOML.

openphysical eval random --benchmark mujoco-locomotion --output results.json
openphysical eval ./models/ppo_pendulum.zip --benchmark mujoco-control --output results.json

2. Dataset conversion -- Convert between HDF5 (robomimic), RLDS/TFRecord (Open X-Embodiment), LeRobot v3.0, and Parquet. Compute normalization statistics. Fully tested, deterministic output.

openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./converted
openphysical stats dataset.hdf5 -f hdf5 -o stats.json

3. Simulation execution -- Run policies in MuJoCo environments, record video, load models from local files or Hugging Face Hub.

openphysical run random --env ant --save ant.mp4
openphysical run hf://user/my-model --env humanoid --save humanoid.mp4

What does NOT work yet

Being honest about the current limits:

VLA inference -- Adapter code exists for pi0, OpenVLA, and Octo, but the act() methods are stubs (return zeros). Real GPU inference with actual model weights has not been validated. VLA models cannot participate in the eval pipeline today.
Manipulation benchmarks -- Only MuJoCo locomotion/control environments are supported. LIBERO, MetaWorld, ManiSkill, and RoboCasa are not yet integrated.
Training -- openphysical does not train models. Use LeRobot, openpi, or robomimic for training.
Real robot support -- No hardware abstraction. This is simulation-only.
Leaderboard -- No hosted leaderboard yet.

Install

pip install openphysical

Python 3.10+. No Docker. CPU works out of the box.

Optional extras

pip install openphysical[sb3]       # Stable Baselines3 model loading
pip install openphysical[hub]       # Hugging Face Hub downloads
pip install openphysical[convert]   # Dataset conversion (HDF5, RLDS, LeRobot, Parquet)
pip install openphysical[all]       # Everything

Quickstart

Benchmark a model:

openphysical eval random --benchmark mujoco-locomotion --output results.json

Convert a dataset:

openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./converted

Run a policy and record video:

openphysical run random --env ant --save ant.mp4

Run a trained model from Hugging Face:

openphysical run hf://user/my-robot-model --env half-cheetah --save cheetah.mp4

Features

	Feature	Description
Eval	Structured benchmarking	Evaluate any policy against standardized suites with JSON output
Benchmarks	Built-in + custom	`mujoco-locomotion`, `mujoco-control`, `mujoco-full`, or define TOML suites
Metrics	Standardized computation	Mean/std/min/max reward, success rate, step counts, aggregate scores
Convert	Dataset pipelines	HDF5, RLDS/TFRecord to LeRobot v3.0 or Parquet, with full metadata
Stats	Norm stats	Compute normalization statistics (mean/std/min/max/quantiles) from any format
Run	One-command execution	Run any policy in any MuJoCo environment with video output
Models	Model loading	Local `.zip`, Hugging Face `hf://`, or built-in (`random`, `zero`)
Envs	11 environments	Ant, HalfCheetah, Hopper, Humanoid, Swimmer, Walker, Reacher, Pendulum, and more
Extend	Protocol-based	No inheritance required -- just implement `reset()` and `act()`

Supported Environments

Alias	Gymnasium ID
`ant`	Ant-v5
`half-cheetah`	HalfCheetah-v5
`hopper`	Hopper-v5
`humanoid`	Humanoid-v5
`humanoid-standup`	HumanoidStandup-v5
`swimmer`	Swimmer-v5
`walker`	Walker2d-v5
`reacher`	Reacher-v5
`pendulum`	InvertedPendulum-v5
`double-pendulum`	InvertedDoublePendulum-v5
`pusher`	Pusher-v5

Any valid Gymnasium environment ID also works directly: openphysical run random --env Hopper-v5

CLI Reference

openphysical eval -- Evaluate against benchmarks

Usage: openphysical eval [OPTIONS] MODEL

Options:
  -b, --benchmark TEXT    Benchmark name or path to TOML file  [required]
  -n, --episodes INTEGER  Override episodes per task
  --max-steps INTEGER     Override max steps per task
  --seed INTEGER          Override seed
  -o, --output PATH       Save results to JSON file

openphysical eval random --benchmark mujoco-control
openphysical eval ./models/ppo_pendulum.zip --benchmark mujoco-control --output results.json

openphysical convert -- Convert robot datasets

Usage: openphysical convert [OPTIONS] SOURCE

Options:
  --from TEXT             Source format (hdf5, rlds)  [required]
  --to TEXT               Target format (parquet, lerobot)  [required]
  -o, --output PATH       Output directory
  --state-key TEXT        HDF5 observation key  [default: joint_positions]
  --task-key TEXT         HDF5 attribute key for task label
  --robot-type TEXT       Robot type for metadata  [default: unknown]
  --fps INTEGER           Dataset FPS  [default: 10]

openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./converted
openphysical convert ./rlds_dataset/ --from rlds --to lerobot --output ./converted

openphysical stats -- Compute normalization statistics

Usage: openphysical stats [OPTIONS] SOURCE

Options:
  -f, --format TEXT       Source format (hdf5, rlds)  [required]
  --state-key TEXT        HDF5 observation key  [default: joint_positions]
  --task-key TEXT         HDF5 attribute key for task label
  --include-images        Include image features in stats
  -o, --output PATH       Save stats to JSON file

openphysical stats dataset.hdf5 -f hdf5 -o stats.json
openphysical stats ./rlds_dataset/ -f rlds -o stats.json

openphysical run -- Run a policy in simulation

Usage: openphysical run [OPTIONS] MODEL

Options:
  -e, --env TEXT          Environment name or gymnasium ID  [required]
  -n, --episodes INTEGER  Number of episodes  [default: 1]
  --max-steps INTEGER     Max steps per episode  [default: 1000]
  --render                Open live render window
  -o, --save PATH         Save video to file
  --fps INTEGER           Video FPS  [default: 30]
  --seed INTEGER          Random seed

More commands

openphysical benchmarks          # List built-in benchmark suites
openphysical list models         # List built-in policy types
openphysical list envs           # List environment aliases
openphysical info ant            # Inspect observation/action spaces
openphysical world list          # List MuJoCo world files
openphysical world info f.xml    # Show world file details

Benchmarks

Three built-in benchmark suites ship with openphysical:

Suite	Tasks	Environments
`mujoco-locomotion`	6	HalfCheetah, Hopper, Walker2d, Ant, Humanoid, Swimmer
`mujoco-control`	4	InvertedPendulum, InvertedDoublePendulum, Reacher, Pusher
`mujoco-full`	10	All of the above

Custom benchmarks are defined as TOML files:

[benchmark]
name = "my-benchmark"
description = "Custom evaluation suite"
version = "1.0"

[[tasks]]
env_id = "HalfCheetah-v5"
name = "cheetah"
episodes = 3
reward_threshold = 4800.0

openphysical eval random --benchmark my_benchmark.toml

Model Resolution

When you run openphysical run <MODEL>, the model argument is resolved through this chain:

MODEL argument
  ├── hf://user/repo           → download from Hugging Face Hub → load as SB3
  ├── hf://user/repo/file.zip  → download specific file from HF Hub
  ├── hf://user/repo/f@rev     → pin to a specific revision
  ├── ./path/to/model.zip      → load local SB3 model
  ├── random                   → uniform random actions
  └── zero                     → zero actions

Python API

Evaluate against benchmarks

from openphysical.benchmark import create_benchmark_registry
from openphysical.eval import EvalRun, save_results
from openphysical.policy import RandomPolicy
from pathlib import Path

registry = create_benchmark_registry()
bench = registry.get("mujoco-control")

eval_run = EvalRun(
    benchmark=bench,
    policy_factory=lambda action_space: RandomPolicy(action_space),
    model_name="random",
)

results = eval_run.run()
print(f"Mean reward: {results.aggregate_mean_reward:.2f}")
print(f"Success rate: {results.aggregate_success_rate:.1%}")
save_results(results, Path("results.json"))

Run episodes

from openphysical.environment import GymnasiumEnv
from openphysical.policy import RandomPolicy
from openphysical.runner import EpisodeRunner
from openphysical.recorder import VideoRecorder
from pathlib import Path

env = GymnasiumEnv("Ant-v5", render_mode="rgb_array")
policy = RandomPolicy(env.action_space)
runner = EpisodeRunner(policy, env, max_steps=500, capture_frames=True)

results = runner.run(num_episodes=3)
for r in results:
    print(f"reward={r.total_reward:.2f} steps={r.steps}")

VideoRecorder.save(results[0].frames, Path("output.mp4"))
env.close()

Custom policies

Any class with reset() and act() works -- no inheritance required:

import numpy as np
from openphysical.types import Action, Observation, SpaceInfo

class MyPolicy:
    def __init__(self, action_space: SpaceInfo) -> None:
        self.space = action_space

    def reset(self) -> None:
        pass

    def act(self, observation: Observation) -> Action:
        return np.clip(observation[:self.space.shape[0]], self.space.low, self.space.high)

Dataset conversion

HDF5 to LeRobot v3.0:

from pathlib import Path
from openphysical.convert.readers import HDF5Reader
from openphysical.convert.lerobot_writer import LeRobotWriter

reader = HDF5Reader(Path("dataset.hdf5"), state_key="joint_positions")
episodes = list(reader.read())
meta = reader.infer_meta()

writer = LeRobotWriter(Path("./output"), meta=meta, fps=30)
writer.write(episodes)

RLDS/TFRecord to LeRobot v3.0:

from pathlib import Path
from openphysical.convert.rlds_reader import RLDSReader
from openphysical.convert.lerobot_writer import LeRobotWriter

reader = RLDSReader(Path("./rlds_dataset"))
episodes = list(reader.read())
meta = reader.infer_meta()

writer = LeRobotWriter(Path("./output"), meta=meta)
writer.write(episodes)

Compute normalization statistics:

from pathlib import Path
from openphysical.convert.readers import HDF5Reader
from openphysical.convert.stats import compute_norm_stats

reader = HDF5Reader(Path("dataset.hdf5"))
stats = compute_norm_stats(reader.read())
stats.save(Path("stats.json"))
print(stats.stats["state"].mean)
print(stats.stats["action"].std)

Load models from Hugging Face

from openphysical.hub import resolve_hf_model

model = resolve_hf_model("hf://user/my-robot-model/ppo_ant.zip")
print(model.local_path)

Architecture

src/openphysical/
    types.py              Protocol definitions (Policy, Environment) + data types
    config.py             Environment aliases and defaults
    policy.py             Built-in policies (RandomPolicy, ZeroPolicy)
    policies_sb3.py       Stable Baselines3 model loader
    policies_vla.py       VLA adapters (pi0, OpenVLA, Octo) [stubs -- see roadmap]
    environment.py        Gymnasium/MuJoCo wrapper
    registry.py           Model and environment registries
    runner.py             Episode execution engine
    recorder.py           Video recording (imageio/ffmpeg)
    hub.py                Hugging Face Hub integration
    benchmark.py          Benchmark/Task definitions + built-in suites
    benchmark_loader.py   TOML benchmark loading + resolution
    metrics.py            Metrics computation
    eval.py               Evaluation orchestrator
    cli.py                CLI (Click)
    convert/              Dataset conversion pipeline
        readers.py        HDF5Reader (robomimic convention)
        rlds_reader.py    RLDSReader (Google RLDS/TFRecord format)
        writers.py        ParquetWriter (Parquet + metadata)
        lerobot_writer.py LeRobotWriter (LeRobot v3.0 format)
        stats.py          Normalization statistics (Welford online algorithm)

Protocol-based architecture -- Policy, Environment, DatasetReader, DatasetWriter are all typing.Protocol classes. No inheritance trees. Just implement the right methods.

Roadmap

The goal is to become the lm-eval-harness for physical AI -- a model-agnostic evaluation tool that works across benchmarks, simulators, and model types.

Where we are

Eval framework with 3 MuJoCo benchmark suites, TOML-defined custom suites, structured JSON output
Dataset conversion across HDF5, RLDS, LeRobot v3.0, Parquet
CLI with 10 commands, all working
408 tests, Protocol-based architecture

Where we're going

Priority	Area	Description	Status
P0	LIBERO integration	Add LIBERO's 130 manipulation tasks as a benchmark suite	Not started
P0	MetaWorld integration	Add MetaWorld's 50 tasks (MT10/MT50/ML10/ML45)	Not started
P0	VLA eval pipeline	Wire `act_with_image()` into the runner so VLA models can be evaluated	Adapter code exists, pipeline integration needed
P1	ManiSkill integration	Broad manipulation + locomotion tasks via SAPIEN	Not started
P1	RoboCasa integration	Kitchen manipulation tasks (365 tasks, 2500 scenes)	Not started
P1	GPU inference	End-to-end VLA inference -- pi0, OpenVLA, Octo with real weights	Adapter stubs exist, needs GPU validation
P1	Standardized result format	Versioned JSON schema for cross-paper comparison	Partial (JSON output exists, schema not formalized)
P2	Leaderboard	Hosted leaderboard for physical AI model comparison	Not started
P2	More VLA adapters	RT-2, Diffusion Policy, ACT, GR00T	Adapter pattern exists
P2	Real-to-sim correlation	SimplerEnv-style eval with ground truth comparison	Research phase

Contributing

Robotics evaluation is fragmented. Every lab runs their own scripts, reports numbers differently, and results aren't comparable. We're building the tool to fix that.

The core framework is solid -- 408 tests, clean Protocol-based architecture. What we need is help expanding benchmark coverage and model support.

Highest-impact contributions right now:

LIBERO / MetaWorld integration -- Wrap these benchmarks as openphysical benchmark suites so any model can be evaluated with openphysical eval MODEL --benchmark libero. The Environment protocol is reset() + step() + action_space + observation_space.
VLA eval pipeline -- Connect act_with_image() to the episode runner so VLA models can participate in benchmarks. This is the critical missing piece.
GPU testing -- If you have NVIDIA GPUs, help us validate VLA inference end-to-end with real pi0/OpenVLA/Octo weights.
Standardized result schema -- Help define the JSON schema that papers should use to report results. Think: what fields would a leaderboard need?

git clone https://github.com/ArchishmanSengupta/openphysical.git
cd openphysical
make dev
make test

Open an issue or PR. No CLA, no bureaucracy.

Development

git clone https://github.com/ArchishmanSengupta/openphysical.git
cd openphysical
make dev        # install with dev dependencies
make test       # 408 tests
make lint       # ruff check + format check
make format     # auto-format

License

Apache-2.0

LLMs have lm-eval-harness. Physical AI has nothing. We're building it.

Star this repo if you think robotics evaluation should be standardized.

ArchishmanSengupta/openphysical