This is a Gymnasium (OpenAI Gym) environment designed to train reinforcement learning (RL) agents to control a sailboat. The environment simulates the dynamics of a sailboat and allows the agent to learn tacking behavior to reach a target point.
Environment | Description |
Sailboat-v0 | The main environment with a continuous action space. |
SailboatDiscrete-v0 | A variation of the environment with a discrete action space. |
Motorboat-v0 | An easy test environment with a motorboat instead of a sailboat. |
You can install the latest release using pip:
pip install gym-sailing
Alternatively, if you prefer, you can clone the repository and install it locally.
Bare minimum code to run the environment:
import gymnasium as gym
import gym_sailing
env = gym.make("Sailboat-v0", render_mode="human")
observation, info = env.reset(seed=42)
for _ in range(1000):
action = env.action_space.sample() # this is where you would insert your policy
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
observation, info = env.reset()
To train an RL agent using stable-baselines3:
from stable_baselines3 import PPO
import gymnasium as gym
import gym_sailing
env = gym.make("Sailboat-v0")
model = PPO('MlpPolicy', env, verbose=1)
# Train the agent
# Test the trained model
observation, info = env.reset()
for _ in range(1000):
action, _ = model.predict(observation)
observation, reward, terminated, truncated, info = env.step(action)
The observation space includes:
- Boat Speed: The current speed of the boat.
Boat Heading: The angle of the boat relative to the wind, ranging from -$\pi$ to
$\pi$ . - Heading Rate: The rate of change of the boat's heading.
Course to Target: The angle between the boat's heading and the target, ranging from -$\pi$ to
$\pi$ . - Distance to Target: The normalized distance between the boat and the target.
The action space consists of:
- Rudder Angle: The angle of the rudder, ranging from -1 to 1 for Sailboat-v0 and Motorboat-v0, and {-1, 0, 1} for SailboatDiscrete-v0.
The default reward function includes:
- Alive Penalty: A penalty for each time step to encourage the agent to reach the target quickly.
- Target Reward: A reward for reaching the target.
- Course Penalty: A penalty for leaving the course area.
- Progress Reward: A reward for making progress towards the target, using the L8 norm, to encourage the agent to move upwind.
- The environment is terminated if the boat reaches the target or leaves the course area.
- The environment is truncated after 3000 steps.
Benchmarks using stable-baselines3 with default hyperparameters. Good policies that tack only once tend to achieve ~390 total reward for the sailboat environment. PPO seems to perform better, but SAC is also a good option, even converging faster.
Contributions are welcome. Please fork the repository and submit a pull request with your changes. For any questions or suggestions, feel free to open an issue.
Here are some features I'd like to add in the future:
- Add currents of different intensities and directions.
- Add wind shifts.
- Add wind gusts and lulls.
- Make the polar diagram more accurate, using the data from this paper: R. Binns, F. W. Bethwaite, and N. R. Saunders, “Development of A More Realistic Sailing Simulator,” High Performance Yacht Design Conference. RINA, pp. 243–250, Dec. 04, 2002. doi: 10.3940/rina.ya.2002.29.
This project was inspired by this fork:
This project is licensed under the MIT License - see the LICENSE file for details.