ArchishmanSengupta/openphysical
Benchmark robot policies and convert datasets. No framework lock-in.
openphysical
The missing eval harness for physical AI.
Benchmark robotics foundation models, convert datasets, compute stats. No framework lock-in.
pip install openphysical
openphysical eval random --benchmark mujoco-locomotion --output results.jsonInstall · Quickstart · CLI Reference · Python API · Benchmarks · Roadmap · Contributing
The Problem
Evaluating robot learning models is broken.
LLMs have lm-eval-harness -- one tool, 60+ benchmarks, standardized metrics, a public leaderboard. Robotics has nothing equivalent. Every project is its own island:
- OpenVLA evaluates on LIBERO with its own scripts. Octo evaluates on WidowX with its own scripts. pi0 evaluates on ALOHA with its own scripts. Results are not comparable.
- Data formats are fragmented -- Open X-Embodiment uses RLDS (requires TensorFlow). Robomimic uses HDF5. LeRobot uses its own v3.0 format. PI's openpi requires LeRobot format for training. Converting between them means writing custom scripts.
- Benchmarks exist but eval tooling doesn't -- LIBERO has 130 tasks, MetaWorld has 50, ManiSkill has dozens more. But none of them ship a model-agnostic evaluation harness. You must write your own rollout loop, metric aggregation, and reporting for each one.
- No standardized result format -- every paper reports numbers differently. No two labs use the same episode counts, seeds, or success criteria.
openphysical is an attempt to fix this. Starting with what works today and building toward the evaluation standard that physical AI needs.
What Works Today
openphysical is a CLI tool and Python library that currently does three things well:
1. Evaluation and benchmarking -- Run any policy against standardized MuJoCo task suites, compute structured metrics (mean/std/min/max reward, success rate, step counts), output JSON. Define custom benchmark suites in TOML.
openphysical eval random --benchmark mujoco-locomotion --output results.json
openphysical eval ./models/ppo_pendulum.zip --benchmark mujoco-control --output results.json2. Dataset conversion -- Convert between HDF5 (robomimic), RLDS/TFRecord (Open X-Embodiment), LeRobot v3.0, and Parquet. Compute normalization statistics. Fully tested, deterministic output.
openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./converted
openphysical stats dataset.hdf5 -f hdf5 -o stats.json3. Simulation execution -- Run policies in MuJoCo environments, record video, load models from local files or Hugging Face Hub.
openphysical run random --env ant --save ant.mp4
openphysical run hf://user/my-model --env humanoid --save humanoid.mp4What does NOT work yet
Being honest about the current limits:
- VLA inference -- Adapter code exists for pi0, OpenVLA, and Octo, but the
act()methods are stubs (return zeros). Real GPU inference with actual model weights has not been validated. VLA models cannot participate in the eval pipeline today. - Manipulation benchmarks -- Only MuJoCo locomotion/control environments are supported. LIBERO, MetaWorld, ManiSkill, and RoboCasa are not yet integrated.
- Training -- openphysical does not train models. Use LeRobot, openpi, or robomimic for training.
- Real robot support -- No hardware abstraction. This is simulation-only.
- Leaderboard -- No hosted leaderboard yet.
Install
pip install openphysicalPython 3.10+. No Docker. CPU works out of the box.
Optional extras
pip install openphysical[sb3] # Stable Baselines3 model loading
pip install openphysical[hub] # Hugging Face Hub downloads
pip install openphysical[convert] # Dataset conversion (HDF5, RLDS, LeRobot, Parquet)
pip install openphysical[all] # EverythingQuickstart
Benchmark a model:
openphysical eval random --benchmark mujoco-locomotion --output results.jsonConvert a dataset:
openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./convertedRun a policy and record video:
openphysical run random --env ant --save ant.mp4Run a trained model from Hugging Face:
openphysical run hf://user/my-robot-model --env half-cheetah --save cheetah.mp4Features
| Feature | Description | |
|---|---|---|
| Eval | Structured benchmarking | Evaluate any policy against standardized suites with JSON output |
| Benchmarks | Built-in + custom | mujoco-locomotion, mujoco-control, mujoco-full, or define TOML suites |
| Metrics | Standardized computation | Mean/std/min/max reward, success rate, step counts, aggregate scores |
| Convert | Dataset pipelines | HDF5, RLDS/TFRecord to LeRobot v3.0 or Parquet, with full metadata |
| Stats | Norm stats | Compute normalization statistics (mean/std/min/max/quantiles) from any format |
| Run | One-command execution | Run any policy in any MuJoCo environment with video output |
| Models | Model loading | Local .zip, Hugging Face hf://, or built-in (random, zero) |
| Envs | 11 environments | Ant, HalfCheetah, Hopper, Humanoid, Swimmer, Walker, Reacher, Pendulum, and more |
| Extend | Protocol-based | No inheritance required -- just implement reset() and act() |
Supported Environments
| Alias | Gymnasium ID |
|---|---|
ant |
Ant-v5 |
half-cheetah |
HalfCheetah-v5 |
hopper |
Hopper-v5 |
humanoid |
Humanoid-v5 |
humanoid-standup |
HumanoidStandup-v5 |
swimmer |
Swimmer-v5 |
walker |
Walker2d-v5 |
reacher |
Reacher-v5 |
pendulum |
InvertedPendulum-v5 |
double-pendulum |
InvertedDoublePendulum-v5 |
pusher |
Pusher-v5 |
Any valid Gymnasium environment ID also works directly: openphysical run random --env Hopper-v5
CLI Reference
openphysical eval -- Evaluate against benchmarks
Usage: openphysical eval [OPTIONS] MODEL
Options:
-b, --benchmark TEXT Benchmark name or path to TOML file [required]
-n, --episodes INTEGER Override episodes per task
--max-steps INTEGER Override max steps per task
--seed INTEGER Override seed
-o, --output PATH Save results to JSON file
openphysical eval random --benchmark mujoco-control
openphysical eval ./models/ppo_pendulum.zip --benchmark mujoco-control --output results.jsonopenphysical convert -- Convert robot datasets
Usage: openphysical convert [OPTIONS] SOURCE
Options:
--from TEXT Source format (hdf5, rlds) [required]
--to TEXT Target format (parquet, lerobot) [required]
-o, --output PATH Output directory
--state-key TEXT HDF5 observation key [default: joint_positions]
--task-key TEXT HDF5 attribute key for task label
--robot-type TEXT Robot type for metadata [default: unknown]
--fps INTEGER Dataset FPS [default: 10]
openphysical convert dataset.hdf5 --from hdf5 --to lerobot --output ./converted
openphysical convert ./rlds_dataset/ --from rlds --to lerobot --output ./convertedopenphysical stats -- Compute normalization statistics
Usage: openphysical stats [OPTIONS] SOURCE
Options:
-f, --format TEXT Source format (hdf5, rlds) [required]
--state-key TEXT HDF5 observation key [default: joint_positions]
--task-key TEXT HDF5 attribute key for task label
--include-images Include image features in stats
-o, --output PATH Save stats to JSON file
openphysical stats dataset.hdf5 -f hdf5 -o stats.json
openphysical stats ./rlds_dataset/ -f rlds -o stats.jsonopenphysical run -- Run a policy in simulation
Usage: openphysical run [OPTIONS] MODEL
Options:
-e, --env TEXT Environment name or gymnasium ID [required]
-n, --episodes INTEGER Number of episodes [default: 1]
--max-steps INTEGER Max steps per episode [default: 1000]
--render Open live render window
-o, --save PATH Save video to file
--fps INTEGER Video FPS [default: 30]
--seed INTEGER Random seed
More commands
openphysical benchmarks # List built-in benchmark suites
openphysical list models # List built-in policy types
openphysical list envs # List environment aliases
openphysical info ant # Inspect observation/action spaces
openphysical world list # List MuJoCo world files
openphysical world info f.xml # Show world file detailsBenchmarks
Three built-in benchmark suites ship with openphysical:
| Suite | Tasks | Environments |
|---|---|---|
mujoco-locomotion |
6 | HalfCheetah, Hopper, Walker2d, Ant, Humanoid, Swimmer |
mujoco-control |
4 | InvertedPendulum, InvertedDoublePendulum, Reacher, Pusher |
mujoco-full |
10 | All of the above |
Custom benchmarks are defined as TOML files:
[benchmark]
name = "my-benchmark"
description = "Custom evaluation suite"
version = "1.0"
[[tasks]]
env_id = "HalfCheetah-v5"
name = "cheetah"
episodes = 3
reward_threshold = 4800.0openphysical eval random --benchmark my_benchmark.tomlModel Resolution
When you run openphysical run <MODEL>, the model argument is resolved through this chain:
MODEL argument
├── hf://user/repo → download from Hugging Face Hub → load as SB3
├── hf://user/repo/file.zip → download specific file from HF Hub
├── hf://user/repo/f@rev → pin to a specific revision
├── ./path/to/model.zip → load local SB3 model
├── random → uniform random actions
└── zero → zero actions
Python API
Evaluate against benchmarks
from openphysical.benchmark import create_benchmark_registry
from openphysical.eval import EvalRun, save_results
from openphysical.policy import RandomPolicy
from pathlib import Path
registry = create_benchmark_registry()
bench = registry.get("mujoco-control")
eval_run = EvalRun(
benchmark=bench,
policy_factory=lambda action_space: RandomPolicy(action_space),
model_name="random",
)
results = eval_run.run()
print(f"Mean reward: {results.aggregate_mean_reward:.2f}")
print(f"Success rate: {results.aggregate_success_rate:.1%}")
save_results(results, Path("results.json"))Run episodes
from openphysical.environment import GymnasiumEnv
from openphysical.policy import RandomPolicy
from openphysical.runner import EpisodeRunner
from openphysical.recorder import VideoRecorder
from pathlib import Path
env = GymnasiumEnv("Ant-v5", render_mode="rgb_array")
policy = RandomPolicy(env.action_space)
runner = EpisodeRunner(policy, env, max_steps=500, capture_frames=True)
results = runner.run(num_episodes=3)
for r in results:
print(f"reward={r.total_reward:.2f} steps={r.steps}")
VideoRecorder.save(results[0].frames, Path("output.mp4"))
env.close()Custom policies
Any class with reset() and act() works -- no inheritance required:
import numpy as np
from openphysical.types import Action, Observation, SpaceInfo
class MyPolicy:
def __init__(self, action_space: SpaceInfo) -> None:
self.space = action_space
def reset(self) -> None:
pass
def act(self, observation: Observation) -> Action:
return np.clip(observation[:self.space.shape[0]], self.space.low, self.space.high)Dataset conversion
HDF5 to LeRobot v3.0:
from pathlib import Path
from openphysical.convert.readers import HDF5Reader
from openphysical.convert.lerobot_writer import LeRobotWriter
reader = HDF5Reader(Path("dataset.hdf5"), state_key="joint_positions")
episodes = list(reader.read())
meta = reader.infer_meta()
writer = LeRobotWriter(Path("./output"), meta=meta, fps=30)
writer.write(episodes)RLDS/TFRecord to LeRobot v3.0:
from pathlib import Path
from openphysical.convert.rlds_reader import RLDSReader
from openphysical.convert.lerobot_writer import LeRobotWriter
reader = RLDSReader(Path("./rlds_dataset"))
episodes = list(reader.read())
meta = reader.infer_meta()
writer = LeRobotWriter(Path("./output"), meta=meta)
writer.write(episodes)Compute normalization statistics:
from pathlib import Path
from openphysical.convert.readers import HDF5Reader
from openphysical.convert.stats import compute_norm_stats
reader = HDF5Reader(Path("dataset.hdf5"))
stats = compute_norm_stats(reader.read())
stats.save(Path("stats.json"))
print(stats.stats["state"].mean)
print(stats.stats["action"].std)Load models from Hugging Face
from openphysical.hub import resolve_hf_model
model = resolve_hf_model("hf://user/my-robot-model/ppo_ant.zip")
print(model.local_path)Architecture
src/openphysical/
types.py Protocol definitions (Policy, Environment) + data types
config.py Environment aliases and defaults
policy.py Built-in policies (RandomPolicy, ZeroPolicy)
policies_sb3.py Stable Baselines3 model loader
policies_vla.py VLA adapters (pi0, OpenVLA, Octo) [stubs -- see roadmap]
environment.py Gymnasium/MuJoCo wrapper
registry.py Model and environment registries
runner.py Episode execution engine
recorder.py Video recording (imageio/ffmpeg)
hub.py Hugging Face Hub integration
benchmark.py Benchmark/Task definitions + built-in suites
benchmark_loader.py TOML benchmark loading + resolution
metrics.py Metrics computation
eval.py Evaluation orchestrator
cli.py CLI (Click)
convert/ Dataset conversion pipeline
readers.py HDF5Reader (robomimic convention)
rlds_reader.py RLDSReader (Google RLDS/TFRecord format)
writers.py ParquetWriter (Parquet + metadata)
lerobot_writer.py LeRobotWriter (LeRobot v3.0 format)
stats.py Normalization statistics (Welford online algorithm)
Protocol-based architecture -- Policy, Environment, DatasetReader, DatasetWriter are all typing.Protocol classes. No inheritance trees. Just implement the right methods.
Roadmap
The goal is to become the lm-eval-harness for physical AI -- a model-agnostic evaluation tool that works across benchmarks, simulators, and model types.
Where we are
- Eval framework with 3 MuJoCo benchmark suites, TOML-defined custom suites, structured JSON output
- Dataset conversion across HDF5, RLDS, LeRobot v3.0, Parquet
- CLI with 10 commands, all working
- 408 tests, Protocol-based architecture
Where we're going
| Priority | Area | Description | Status |
|---|---|---|---|
| P0 | LIBERO integration | Add LIBERO's 130 manipulation tasks as a benchmark suite | Not started |
| P0 | MetaWorld integration | Add MetaWorld's 50 tasks (MT10/MT50/ML10/ML45) | Not started |
| P0 | VLA eval pipeline | Wire act_with_image() into the runner so VLA models can be evaluated |
Adapter code exists, pipeline integration needed |
| P1 | ManiSkill integration | Broad manipulation + locomotion tasks via SAPIEN | Not started |
| P1 | RoboCasa integration | Kitchen manipulation tasks (365 tasks, 2500 scenes) | Not started |
| P1 | GPU inference | End-to-end VLA inference -- pi0, OpenVLA, Octo with real weights | Adapter stubs exist, needs GPU validation |
| P1 | Standardized result format | Versioned JSON schema for cross-paper comparison | Partial (JSON output exists, schema not formalized) |
| P2 | Leaderboard | Hosted leaderboard for physical AI model comparison | Not started |
| P2 | More VLA adapters | RT-2, Diffusion Policy, ACT, GR00T | Adapter pattern exists |
| P2 | Real-to-sim correlation | SimplerEnv-style eval with ground truth comparison | Research phase |
Contributing
Robotics evaluation is fragmented. Every lab runs their own scripts, reports numbers differently, and results aren't comparable. We're building the tool to fix that.
The core framework is solid -- 408 tests, clean Protocol-based architecture. What we need is help expanding benchmark coverage and model support.
Highest-impact contributions right now:
-
LIBERO / MetaWorld integration -- Wrap these benchmarks as openphysical benchmark suites so any model can be evaluated with
openphysical eval MODEL --benchmark libero. TheEnvironmentprotocol isreset()+step()+action_space+observation_space. -
VLA eval pipeline -- Connect
act_with_image()to the episode runner so VLA models can participate in benchmarks. This is the critical missing piece. -
GPU testing -- If you have NVIDIA GPUs, help us validate VLA inference end-to-end with real pi0/OpenVLA/Octo weights.
-
Standardized result schema -- Help define the JSON schema that papers should use to report results. Think: what fields would a leaderboard need?
git clone https://github.com/ArchishmanSengupta/openphysical.git
cd openphysical
make dev
make testOpen an issue or PR. No CLA, no bureaucracy.
Development
git clone https://github.com/ArchishmanSengupta/openphysical.git
cd openphysical
make dev # install with dev dependencies
make test # 408 tests
make lint # ruff check + format check
make format # auto-formatLicense
LLMs have lm-eval-harness. Physical AI has nothing. We're building it.
Star this repo if you think robotics evaluation should be standardized.