RoboTwin → MetaSim / Sapien integration#
RoboTwin is a 50-task
dual-arm tabletop benchmark built on SAPIEN 3.0.0b1 + mplib + curobo.
Tasks live under envs/<task>.py and each declares its own scene
setup, success criterion, and scripted-policy data collector via raw
SAPIEN API.
Status#
Full policy-reproduction pipeline (collect → train → eval): a RoboVerse user can collect RoboTwin expert demos, train a Diffusion Policy with RoboVerse’s own
roboverse_learn/il, and evaluate it closed-loop in the native RoboTwin env — the same three-step experience as RoboTwin, with a directly comparable success rate. See Policy reproduction. The data path runs through the unmodifieddata2zarr_dp.py; the eval rolls the policy through RoboTwin’s owntake_actioninterface.Object fidelity is exact, suite-wide: the bridge records each manipulated object’s real RoboTwin asset so the replay loads the same mesh/URDF — and it gets the exact instance, not just the category. For mesh objects it captures the
model_id; for URDF objects (pot/cabinet/laptop/microwave) it hooksrand_create_sapien_urdf_objto record the precise instance directory (RoboTwin picks a randommodelidper episode, excluding thevisual/dir) plus themodel_data.jsonscale. Objects created multiple times under the same name (e.g. two001_bottle, three blocks, three bottles input_bottles_dustbin) are disambiguated per-instance by creation order, so every object is kept — a name-keyed capture silently dropped all but one. A full re-collection found 13 of 50 tasks create same-named duplicates; all now replay every object.Rendering is 1:1 in geometry, with a documented engine residual: the side-by-side (
sidebyside.py) puts the native RoboTwin render next to the RoboVerse replay from an identical camera and the same bridge trajectory. Robot pose, object instances/positions/motion, table, and ground match frame-for-frame. Both render ray-traced with matched settings (32 samples, path depth 8); the only residual is a background colour tint, because the two SAPIEN builds (RoboTwin’s 3.0.0b1 vs MetaSim’s) use different default RT environment maps and neither sets one explicitly. This is an engine-build difference, not a reproduction error.All 50 tasks collect successfully (breadth): a full sweep (
tools/robotwin_integration/coverage_sweep.py) ran every registered RoboTwin task through the native code path + data bridge — 50/50 plan and check successfully and emit a dense bimanual trajectory (78–662 frames; some need up to seed 7). This is collection-success across the whole suite, not one hand-picked task.Replay parity is measured, not asserted: the parity harness (
tools/robotwin_integration/parity_robotwin.py) replays the native command-target stream on RoboVerse-SAPIEN3 and compares RoboVerse’s achieved joint state against RoboTwin’s achieved joint state (entity.get_qpos(), captured by the bridge — not the command target, which would be circular). Onbeat_block_hammerthe per-joint achieved delta converges with replay resolution: 0.44 → 0.088 → 0.027 → 0.0059 rad max (mean 0.033 → 0.0008 rad) at settle = 1/4/8/16. The residual is open-loop replay under-stepping, not a mapping error — same URDF, same backend family.Embodiment loads in RoboVerse: RoboTwin’s ALOHA-AgileX (
arx5_description_isaac.urdf, 38 DoF: dual 6-DoF arms with 2-finger mimic grippers + mobile base + sensor mast) loads and steps in MetaSim/Sapien3 after one small handler fix (fix/sapien3-passive-joints).Native passthrough is 1:1 by construction: with RoboTwin’s deps installed in a dedicated
robotwinconda env,RoboTwin/<task>resolves to the live native task (see_passthrough.py) — same sim, planner, andcheck_success()as upstream, the way the ManiSkill passthrough is identical to native ManiSkill. The two-env split is required because RoboTwin pins SAPIEN 3.0.0b1 / mplib 0.2.1 / curobo, which conflict with theroboverseenv’s SAPIEN.Mesh-faithful, 1:1-verified replay: the replay (
tools/robotwin_integration/mesh_replay_robotwin.py) loads the real RoboTwin object meshes (rigid GLB/OBJ; URDF articulations baked to textured GLB or driven as articulations when they move — doors/lids open), with ray-traced rendering (--rt) matching RoboTwin. A native-vs-RoboVerse side-by-side (sidebyside.py, ground truth fromnative_render.py --replay-bridge) confirms robot pose + object positions + motion + camera + RT lighting are frame-for-frame 1:1 (the native side replays the same bridge trajectory, so it is the identical episode, not a coincidental match).Genuine limitations (stated plainly): the bridge/replay path is open-loop state replay — a tight delta proves trajectory fidelity, not dynamical equivalence, and runs no planner/policy in RoboVerse. The separate physics object-parity (objects move by contact, not teleported) reaches ≤5 cm for ~26/46 tasks and diverges for complex contact (the open-loop limit). Pixel-level render parity is bounded by the engines (RT vs. RoboTwin’s exact lights); a moving URDF object renders untextured (sapien3’s articulation loader drops
.mtl).
1:1 visualization — all 50 tasks#
Every task rendered native RoboTwin (left) vs RoboVerse replay (right), same observer pose, frame-for-frame: the RoboVerse replay is driven by the same recorded bridge trajectory (native_render.py --replay-bridge), so robot pose + every object (mesh, instance, pose) line up 1:1 — only cross-engine texture shading differs.
Regenerate any of the 50 clips with one command (swap --task <name> for any task below):
# native RoboTwin (robotwin env) + RoboVerse replay (roboverse env), composited side-by-side
conda run -n roboverse python tools/robotwin_integration/sidebyside.py --task move_can_pot
# -> outputs/robotwin_coverage/sidebyside_move_can_pot.mp4
All 50 task names + regenerate the whole gallery
adjust_bottle · beat_block_hammer · blocks_ranking_rgb · blocks_ranking_size · click_alarmclock
click_bell · dump_bin_bigbin · grab_roller · handover_block · handover_mic
hanging_mug · lift_pot · move_can_pot · move_pillbottle_pad · move_playingcard_away
move_stapler_pad · open_laptop · open_microwave · pick_diverse_bottles · pick_dual_bottles
place_a2b_left · place_a2b_right · place_bread_basket · place_bread_skillet · place_burger_fries
place_can_basket · place_cans_plasticbox · place_container_plate · place_dual_shoes · place_empty_cup
place_fan · place_mouse_pad · place_object_basket · place_object_scale · place_object_stand
place_phone_stand · place_shoe · press_stapler · put_bottles_dustbin · put_object_cabinet
rotate_qrcode · scan_object · shake_bottle · shake_bottle_horizontally · stack_blocks_three
stack_blocks_two · stack_bowls_three · stack_bowls_two · stamp_seal · turn_switch
for t in \
adjust_bottle beat_block_hammer blocks_ranking_rgb blocks_ranking_size \
click_alarmclock click_bell dump_bin_bigbin grab_roller \
handover_block handover_mic hanging_mug lift_pot \
move_can_pot move_pillbottle_pad move_playingcard_away move_stapler_pad \
open_laptop open_microwave pick_diverse_bottles pick_dual_bottles \
place_a2b_left place_a2b_right place_bread_basket place_bread_skillet \
place_burger_fries place_can_basket place_cans_plasticbox place_container_plate \
place_dual_shoes place_empty_cup place_fan place_mouse_pad \
place_object_basket place_object_scale place_object_stand place_phone_stand \
place_shoe press_stapler put_bottles_dustbin put_object_cabinet \
rotate_qrcode scan_object shake_bottle shake_bottle_horizontally \
stack_blocks_three stack_blocks_two stack_bowls_three stack_bowls_two \
stamp_seal turn_switch ; do
conda run -n roboverse python tools/robotwin_integration/sidebyside.py --task $t
done
Grasp · tool · press (11)#
beat_block_hammer
click_bell
click_alarmclock
press_stapler
grab_roller
stamp_seal
rotate_qrcode
turn_switch
handover_block
handover_mic
move_playingcard_away
Place onto target (20)#
move_can_pot
move_pillbottle_pad
move_stapler_pad
place_a2b_left
place_a2b_right
place_bread_basket
place_bread_skillet
place_burger_fries
place_can_basket
place_cans_plasticbox
place_container_plate
place_dual_shoes
place_empty_cup
place_fan
place_mouse_pad
place_object_basket
place_object_scale
place_object_stand
place_phone_stand
place_shoe
Bottles · pick · shake (6)#
pick_diverse_bottles
pick_dual_bottles
shake_bottle
shake_bottle_horizontally
adjust_bottle
put_bottles_dustbin
Stack · rank (6)#
stack_blocks_two
stack_blocks_three
stack_bowls_two
stack_bowls_three
blocks_ranking_rgb
blocks_ranking_size
Articulated · container (URDF joints) (7)#
open_laptop
open_microwave
lift_pot
put_object_cabinet
dump_bin_bigbin
hanging_mug
scan_object
MetaSim fix that enables this#
The Sapien3Handler used to crash with KeyError when an active
URDF joint wasn’t enumerated in RobotCfg.actuators. That’s the rule
for most clean academic robots but it’s wrong for any
embodiment that bundles wheels, suspension, or a sensor mast — those
DoFs exist in the URDF but no one wants them in the actuator dict.
The fix (fix/sapien3-passive-joints) switches the lookup to
actuators.get(name) and skips undriven joints. default_joint_positions
gets the same treatment, defaulting to 0.0 for unenumerated joints.
Two-line change in _build_sapien, plus a regression test at
metasim/test/test_sapien3_passive_joints.py.
Asset layout#
Bundle |
Size |
Needed? |
|---|---|---|
|
220 MB |
Yes — robot URDFs + meshes for all 5 robots |
|
3.74 GB |
Yes for task scene actors (YCB-style) |
|
11 GB |
Domain-randomization training only |
Full dataset |
1.47 TB |
Demo trajectories + RL checkpoints — not needed for sim parity |
Self-contained replay (RoboTwin is deletable)#
The replay / side-by-side / object-parity pipeline does not need the upstream RoboTwin
checkout at runtime. Every asset a bridge references — object visual/collision meshes, URDF
instances, and the ALOHA-AgileX embodiment — is addressed by its RoboTwin-internal relpath
and resolved through one locator,
roboverse_pack/tasks/robotwin/_locator.py:
a local RoboTwin clone —
$ROBOTWIN_ASSETSor~/projects/robotwin(dev / fresh collection);otherwise the vendored mirror
roboverse_data/robotwin/(HuggingFaceRoboVerseOrg/roboverse_data), downloaded on demand — exactly like the mjlab / menagerie locators.
$ROBOTWIN_ASSETS is authoritative: set it to a non-existent path to force the mirror (this is
how the deletability test runs).
Vendor the referenced subset once (only what the 50 bridges use — ~1.65 GB objects + 0.78 GB embodiment + slim RGB-stripped trajectories, not the 1.47 TB full dataset):
# Against a RoboTwin clone, copy the referenced subset into roboverse_data/robotwin/
python tools/robotwin_integration/migrate_assets.py # writes manifest.json
# Replay with the clone "deleted" — resolves everything from the mirror:
ROBOTWIN_ASSETS=/nonexistent MUJOCO_GL=egl python \
tools/robotwin_integration/mesh_replay_robotwin.py \
--bridge roboverse_data/robotwin/bridges/move_can_pot.pkl --mode kinematic --video
To make a fresh, clone-less machine work, upload the populated mirror to the HF dataset
(roboverse_data/ is git-ignored; it is the HF-backed store, not committed):
huggingface-cli upload RoboVerseOrg/roboverse_data roboverse_data/robotwin robotwin --repo-type dataset
The embodiment cfg (roboverse_pack/robots/aloha_agilex_cfg.py) resolves through the same locator.
Policy reproduction (same experience as RoboTwin)#
A RoboVerse user can reproduce a RoboTwin policy result end to end — collect
expert demos, train an imitation policy, evaluate it closed-loop — using
RoboVerse’s own imitation-learning stack (roboverse_learn/il), the same
three-step collect → train → eval flow a RoboTwin user runs. The trained
policy is evaluated closed-loop in the native RoboTwin environment (via the
passthrough), so its success rate is directly comparable to RoboTwin’s own
learned-policy baseline (not the ~100% scripted expert planner).
Cross-task results (closed-loop, native RoboTwin, 20 held-out seeds, 400-step
budget): beat_block_hammer 42% (precision strike, validated over two runs
45%+40%), move_can_pot 30% (pick-place), click_bell 50% (simple
single-arm press). All land at RoboTwin’s own DP baseline level — simplest task
highest, as expected. Successful episodes trigger check_success early, so the
policy genuinely completes the task rather than replaying. (A 40-demo / 300-epoch
run overfits to 15%; data volume + RoboTwin-matched n_action_steps=6 close the
gap.) The policy is stochastic (DDPM sampling), so each 20-episode rate has
run-to-run variance (±10–20%).
Eval robustness (read before trusting a number). The DP eval renders the
head-camera every step with RoboTwin’s RT shader (required for train/eval obs
parity). That RT render path intermittently deadlocks headless in upstream
sapien — an episode then hangs to its --per-ep-timeout and is counted a failure.
Two safeguards in eval_dp_robotwin.sh keep this from corrupting a result: it
waits for any DP-training process to release the GPU before loading the policy
server (train→eval contention makes the first inference hang), and it aborts
after 3 consecutive no-result hangs with the server log rather than burning
N × timeout and reporting a misleading 0/N. A genuine all-0/N should always
be investigated as a harness/hang issue, never reported as a policy result; the
rates above are from clean runs where all 20 episodes returned a real
success/failure.
# 1. COLLECT — expert demos with head-camera RGB (robotwin env).
# One seed per subprocess under a timeout, so a headless-RT hang costs one
# seed, not the batch; gathers N distinct successful episodes.
bash tools/robotwin_integration/collect_demos_robust.sh \
--task beat_block_hammer \
--out-dir ~/projects/robotwin/data/_rv_bridge/bbh_train \
--want 40 --camera head_camera
# 2. TRAIN — RoboVerse Diffusion Policy on the RoboTwin demos (roboverse env).
# Converts bridge pkls -> demo dirs -> zarr (the *unmodified* data2zarr_dp.py)
# -> DP training, with the bimanual 14-D / 240x320 shape overrides.
bash tools/robotwin_integration/train_dp_robotwin.sh \
--task beat_block_hammer \
--bridge-dir ~/projects/robotwin/data/_rv_bridge/bbh_train \
--num 40 --epochs 300 --policy ddpm_unet
# 3. EVAL — closed-loop in native RoboTwin, one command (starts the policy
# server in the roboverse env, runs the env in the robotwin env, reports the
# success rate, tears the server down).
bash tools/robotwin_integration/eval_dp_robotwin.sh \
--task beat_block_hammer \
--ckpt il_outputs/ddpm_unet/beat_block_hammer/checkpoints/300.ckpt \
--num-eval 20 --start-seed 100
The state/action are non-circular: the policy’s state observation is
RoboTwin’s achieved joint qpos (real_vector), and the action it learns is the
command target (vector) — the same two signals the parity harness uses. The
eval rolls the policy out through env.take_action(action, 'qpos'), the exact
closed-loop interface RoboTwin’s own script/eval_policy.py uses (TOPP-
interpolates the 14-D waypoint, steps physics, fires eval_success on
check_success()).
Two implementation notes that make this work across the env split:
Env-decoupled eval. The DP model + its deps run in the
roboverseenv, but the only closed-loop RoboTwin env runs in therobotwinenv (conflicting SAPIEN/torch).dp_policy_server.py(roboverse env) serves inference over a socket andeval_robotwin_policy.py’sDPPolicy(robotwin env) is a thin client — mirroring RoboTwin’s own policy server/client split.eval_dp_robotwin.shhides this behind one command.numba is optional. The IL image dataset jit-compiles its sampler with numba, which fails to import on numpy ≥ 2.0; it now falls back to a pure-numpy path so training runs on a modern-numpy
roboverseenv.
An open-loop action-replay baseline is built into the same eval harness
(--policy replay --bridge <pkl>): it feeds RoboTwin’s recorded action stream
back through take_action (TOPP, not the original curobo plan). On
beat_block_hammer it reproduces success 4/5 closed-loop — a sharp datapoint that
also motivates the reactive DP policy (open-loop replay has no feedback
correction; a learned policy does).
Data bridge#
RoboTwin demos are single-embodiment bimanual: one articulation whose
14-D action [L_arm(6), L_grip, R_arm(6), R_grip] drives both arms.
RoboVerse expresses this as one name-keyed robot entry — the one-robot
case of the same *_v2 format the multi-agent loader uses (see the
multi-agent dataset docs). Because RoboTwin
and RoboVerse both run SAPIEN3, dof-position-target replay reproduces the
recorded motion closely.
The bridge is two halves, one per conda env, hand-off via a plain pickle:
Collect (
robotwinenv) —tools/robotwin_integration/collect_bridge.pydrives a native RoboTwin task (the same_passthroughfactory), retries seeds until one plans and checks successfully, and dumps per frame: the command-targetvectors, RoboTwin’s achieved qposreal_vectors(entity.get_qpos(), injected via a runtime hook onget_obs— no upstream edit), the achieved end-effector posesleft/right_endpose, the per-frame world pose of every scene objectobject_traj(rigid actors and URDF articulations viaget_all_articulations()), the articulation joint qposobject_joint_traj(so opening doors replay), and each object’s real mesh/URDF pathobject_meshes.Replay (
roboverseenv) —tools/robotwin_integration/mesh_replay_robotwin.pyconverts the trajectory to*_v2(sharedroboverse_pack.tasks.robotwin._convert) and replays the ALOHA-AgileX embodiment with the real object meshes on SAPIEN3 to video.--mode kinematicis faithful playback (robot + objects teleported to the recorded state each frame);--mode physicsdrives the robot by command targets and lets objects move by contact (for object-pose parity).--rtray-traces to match RoboTwin;--observer-cam --cam-pos/--cam-lookat/--fovyset a matched camera. (get_started/10_robotwin_aloha_replay.pyis the minimal get-started version with a primitive object proxy.)Measure parity (
roboverseenv) —tools/robotwin_integration/parity_robotwin.pyreports the per-joint delta between RoboVerse-achieved and RoboTwin-achieved qpos (--settle Nreplay resolution;--allsweeps every pickle).Verify 1:1 (both envs) —
tools/robotwin_integration/sidebyside.pybuilds a native-vs-RoboVerse proof video for any task: it renders the RoboTwin ground truth (native_render.py --replay-bridge, which drives the native env from the same bridge trajectory instead of re-planning) and the RoboVerse replay from an identical camera, and composites them frame-for-frame.
# 1. collect a demonstration natively, with achieved state + objects (robotwin env)
conda run -n robotwin env MUJOCO_GL=egl python \
tools/robotwin_integration/collect_bridge.py --task move_can_pot \
--out ~/projects/robotwin/data/_rv_bridge/move_can_pot.pkl
# 1b. (optional) sweep the whole 50-task suite -> coverage.json
conda run -n robotwin env MUJOCO_GL=egl SAPIEN_HEADLESS=1 python \
tools/robotwin_integration/coverage_sweep.py --max-seeds 8
# 2. mesh-faithful, ray-traced replay in RoboVerse (roboverse env)
MUJOCO_GL=egl python tools/robotwin_integration/mesh_replay_robotwin.py \
--bridge ~/projects/robotwin/data/_rv_bridge/move_can_pot.pkl --mode kinematic --video --rt
# 3. measure achieved-vs-achieved joint parity (roboverse env)
MUJOCO_GL=egl python tools/robotwin_integration/parity_robotwin.py \
--bridge ~/projects/robotwin/data/_rv_bridge/move_can_pot.pkl --settle 8
# 4. one-command native-vs-RoboVerse 1:1 side-by-side (roboverse env)
conda run -n roboverse python tools/robotwin_integration/sidebyside.py --task move_can_pot
Native passthrough#
roboverse_pack.tasks.robotwin._passthrough registers all 50 tasks under
RoboTwin/<name> with a lazy entry point. Registration never imports
RoboTwin (safe in any env); making the env imports the native task. Two
runtime quirks are handled in _make_robotwin_env: it chdirs to the
checkout (RoboTwin reads ./assets/... relatively at import) and aliases
warp.torch.* to the warp top level (curobo 0.7.8 expects the old
namespace that warp-lang ≥ 1.5 dropped). This only runs in an env where
RoboTwin’s deps (incl. a curobo built against an sm-matching CUDA nvcc)
are installed.
Setup (RoboTwin env + assets)#
mkdir -p ~/projects && cd ~/projects
git clone --depth 1 https://github.com/RoboTwin-Platform/RoboTwin.git robotwin
cd robotwin && bash script/_install.sh # deps + curobo (needs nvcc)
cd assets && python _download.py && unzip -q '*.zip' # embodiments + objects
Note: on recent GPUs (e.g. sm_120 / RTX 50-series) curobo must be built
with a matching CUDA nvcc (≥ 12.8); install cuda-nvcc of that version in
the env before pip install -e curobo. The embodiment locator
(roboverse_pack/robots/aloha_agilex_cfg.py) searches
~/projects/robotwin/assets/ or $ROBOTWIN_ASSETS. To just confirm the
embodiment loads (no RoboTwin deps needed), run
python -m tools.robotwin_integration.aloha_demo.