01/ Problem
Problem
Learn a stable bipedal gait that survives BipedalWalker hardcore-mode terrain: gaps, ladders, stumps, and uneven ground.
TQC + PPO + TRPO trained for hardcore-mode terrain.

Learn a stable bipedal gait that survives BipedalWalker hardcore-mode terrain: gaps, ladders, stumps, and uneven ground.
Train and benchmark TRPO (~25M steps), PPO (~25M steps), and TQC (5M steps) over continuous body, joint, and LIDAR observations with joint-velocity actions. Custom reward shaping wrappers to surface stable balanced gaits without reward hacking.
MuJoCo simulation only. Sim-to-real workflow transferable to Isaac Sim / Isaac Lab.
TQC produced a balanced natural gait at a fraction of the sample budget. PPO and TRPO both cleared hardcore-mode after reward shaping. 35% reduction in training convergence time vs baseline.