Tabula Rasa — Sisyphus Reinforcement Learning
Overview
ML2 course project at INN. A group project in name — all technical implementation was done independently. A reinforcement learning agent trained with PPO (Proximal Policy Optimization) across three transfer learning phases: 10M steps learning to walk in BipedalWalker-v3, 20M steps learning to push a boulder in a custom SisyphusWalker environment, and 30M steps learning to push the boulder up a progressively steepening slope. The project is inspired by Camus's interpretation of the Sisyphus myth — the agent endlessly struggles uphill, yet the training continues. The custom environment was built from scratch using Box2D physics, including boulder dynamics, a custom reward system, and an exponential slope curve that starts gentle and gets steeper towards the top.
My Contributions
- Originated the Sisyphus concept and pitched it to the group
- Trained the BipedalWalker agent from scratch over 10M steps with extensive hyperparameter tuning
- Built the custom SisyphusWalker Gymnasium environment (sisyphus_env.py) with boulder physics, custom reward system, and Box2D rendering
- Implemented transfer learning across all three training phases
- Implemented the slope from scratch — exponential curve with tuned friction (2.0) and boulder density (0.5) for optimal agent training
- Fixed rendering flicker by switching from manual Pygame to BipedalWalker's drawlist system
- Produced the project trailer from scratch (filmed twice), including glitch effects and custom AI voiceover — motivational Bane (Batman) voice via Fish.audio for narration, and Rene Morgan voice for system reboot sequences
- Wrote technical documentation, data flow diagram, README, and interim presentation
- Managed the group as team leader: GitHub setup, Discord, contract, meeting coordination
Tech Breakdown
Python, Stable-Baselines3 PPO, Farama Gymnasium (BipedalWalker-v3 + custom SisyphusWalker-v0), PyTorch (backend), Box2D (physics), Pygame (rendering), transfer learning across 3 phases, custom reward shaping, CheckpointCallback for model snapshots every 100k steps, TensorBoard logging, 32 parallel environments via make_vec_env