Zero-g robotics / Unity ML-Agents / imitation learning

Space Robot Rendezvous

A simulated servicing robot learns to rendezvous with a moving target satellite, enter station, and lightly touch a target point with a robotic end effector in zero gravity.

What this proves

Deep-RL control policy for orbital servicing behavior.

The project models a servicing vehicle, three-joint robotic arm, and moving target object in a 2D zero-gravity simulation. Unity ML-Agents trains a policy using PPO, behavioral cloning, and GAIL to coordinate station-keeping, joint torques, and soft end-effector contact.

Task
Rendezvous, station-keeping, and soft-touch satellite contact
Dynamics
Custom Euler integration with mass matrix, Coriolis terms, thrusts, and torques
Learning
PPO with behavioral cloning and GAIL imitation-learning signals
Criterion
Target contact with relative velocity below 0.01 m/s

Demo video

Policy in action during a soft-touch attempt.

The demo shows the trained policy moving toward the target object, matching station criteria, and applying arm torques toward the target point. The report discusses medium success rate and drift limitations.

Code-level details

Dynamics, observations, actions, and reward shaping.

Custom Dynamics

The simulation calculates mass/inertia terms, simplified Coriolis terms, reaction torques, Euler integration, base motion, and joint angular velocities.

Discrete Action Mapping

Actions map to base thrust commands of 0, +10, or -10 and arm/base torque commands of 0, +0.01, or -0.01 per decision step.

Observation Vector

The agent observes relative velocity, end-effector distance, target point, base-frame distance, vehicle and target state, forces, and torques.

Station Gate

Arm torques are applied only once the vehicle is within station tolerance and sufficiently close to the target object's velocity.

Imitation Data

Human demonstrations used keyboard control and a metronome-timed torque pattern to create more consistent soft-touch training data.

Training Result

The report found PPO with BC/GAIL and no extrinsic reward was the most successful tested paradigm, with about 35-50% success.

Simulation telemetry

State plots exported from the simulation.

Diagram of data interchange between math software and 3D simulation software Joint angle graph over time Base frame velocity graph over time Base forces graph over time

Future iteration

A strong simulation platform for reliability experiments.

  • Run larger policy batches across random seeds and quantify soft-touch success rate.
  • Separate station-keeping, arm approach, and contact into staged policies or curriculum phases.
  • Increase fidelity of collision/contact modeling and add target shape randomization.
  • Compare pure extrinsic reward, pure imitation, and hybrid reward designs under equal budgets.