Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README with Public Dataset V3 #29

Open
wants to merge 53 commits into
base: rllib
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
c15dc6a
slippi db
vladfi1 Nov 5, 2021
e0c307a
Rename regime to env.
vladfi1 Nov 13, 2021
53bc9c5
Move secrets into json file.
vladfi1 Nov 13, 2021
cd2784f
Decompress with ray.
vladfi1 Nov 13, 2021
b724cd1
Upload dbox files to s3.
vladfi1 Nov 13, 2021
3c73239
In-memory dbox upload.
vladfi1 Nov 22, 2021
78750fd
Upload drive files to s3.
vladfi1 Nov 22, 2021
272a33c
Fix drive_to_s3.
vladfi1 Nov 22, 2021
2d91c8f
Fix flask app.
vladfi1 Nov 22, 2021
42fe063
Decompress raw uploads with ray.
vladfi1 Nov 22, 2021
b691e07
Preprocessing of slippi files.
vladfi1 Dec 9, 2021
c99d30f
Add invulnerable field.
vladfi1 Feb 12, 2022
5f0cb58
Move train filter into preprocessing.
vladfi1 Feb 12, 2022
5481a01
Update peppi version in dockerfile.
vladfi1 Feb 12, 2022
6b6049e
Update cluster yamls.
vladfi1 Feb 12, 2022
499732b
Generate parquet files for machine learning.
vladfi1 Feb 12, 2022
5115c2c
Add lots of notebooks with cleaned outputs.
vladfi1 May 4, 2022
e7018d3
Create datasets as tar archives with pd.DataFrame metadata.
vladfi1 May 4, 2022
dc589f4
Document new dataset format.
vladfi1 May 6, 2022
99f0bee
Train using new data format.
vladfi1 May 12, 2022
6022519
WIP: refactor so that eval works
vladfi1 May 12, 2022
a1a94ac
Get training working again.
vladfi1 May 28, 2022
ccb1e9d
Python 3.8 compatibility.
vladfi1 May 29, 2022
58fab46
Minor bug in eval.
vladfi1 May 29, 2022
4c6b96e
Fix parameter restoring.
vladfi1 May 30, 2022
defc3ac
Don't overlap trajectories, it breaks initial state.
vladfi1 May 30, 2022
4bb2f5b
Headless (exi+ffw) option for Dolphin.
vladfi1 May 30, 2022
8a02964
Update run_dolphin script.
vladfi1 Jun 3, 2022
db4213c
Use metadata to filter replays.
vladfi1 Jun 3, 2022
de5d198
Minimal rllib training.
vladfi1 May 28, 2022
3e91952
wip
vladfi1 Jun 2, 2022
d724248
[minor] type signature
vladfi1 Jun 3, 2022
60acad0
Checkpointing and restoring.
vladfi1 Jun 5, 2022
9822580
Rllib eval script.
vladfi1 Jun 5, 2022
232369f
notebooks
vladfi1 Jun 25, 2022
0159c52
Be robust to slippi EnetDisconnect.
vladfi1 Jun 26, 2022
45798cb
Be robust to slippi EnetDisconnect.
vladfi1 Jun 26, 2022
7ee0edf
Validate player ports.
vladfi1 Jun 28, 2022
f04a39a
Profile multiple dolphins on a single cpu core.
vladfi1 Jun 29, 2022
67bcfea
Fix deps.
vladfi1 Jun 29, 2022
48210da
More options for rllib/tune.
vladfi1 Jun 26, 2022
c8d0b75
Update inference profiling script.
vladfi1 Jul 8, 2022
d445daf
Profile on multiple cpu cores.
vladfi1 Jul 9, 2022
036b8b4
Non-ray dolphin profiling.
vladfi1 Jul 10, 2022
5b64cbc
Option to use ray for profiling dolphin_one_cpu.
vladfi1 Jul 10, 2022
3df493f
Re-enable single-core profiling.
vladfi1 Jul 10, 2022
d007522
Wandb logging.
vladfi1 Jul 10, 2022
4be13b9
Run rllib's ppo algorithm with wandb on aws.
vladfi1 Jul 11, 2022
1436990
Support a2c, impala and ppo.
vladfi1 Jul 11, 2022
4da15ee
WIP: use custom model from slippi_ai.
vladfi1 Jul 12, 2022
cfb803b
Update README with Public Dataset V3
lzardy Aug 26, 2022
5d646c2
Update README.md
lzardy Aug 26, 2022
21a97ee
Update README.md
lzardy Aug 26, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Ignore everything
*

# allow only these
!requirements.txt
!docker/
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,10 +133,14 @@ experiments/

# don't track replays
data/AllCompressed/
data/untracked/

# storage for b9 scripts
run_scripts/

# shhhhhhh
secrets.sh
secrets.json

.vscode
saved_models/
57 changes: 41 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,6 @@ An easy way to get started is with Google Cloud Platform. Launch a VM with the '
pip install --user -r requirements.txt
```

A dataset of processed and compressed slippi replays is available at https://drive.google.com/u/0/uc?id=1O6Njx85-2Te7VAZP6zP51EHa1oIFmS1B. It is a tarball of zipped pickled slp files. Use

```bash
gdown 'https://drive.google.com/u/0/uc?id=1O6Njx85-2Te7VAZP6zP51EHa1oIFmS1B' -O data/
tar -xf data/AllCompressed.tar
```

to expand it. The folder `data/AllCompressed/` will now contain many zipped pickled and formatted slippi replay files. Another useful dataset is https://drive.google.com/uc?id=1ZIfDgkdQdu-ldCx_34e-VxYJwQCpV-i3 which only contains fox dittos.

We use Sacred and MongoDB for logging experiments. While Sacred is installed through requirements.txt, MongoDB needs to be installed separately. Instructions for installing MongoDB on Debian 9 are available here: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-debian/. These commands worked for us:

```bash
Expand All @@ -43,28 +34,62 @@ python scripts/train.py

To view the results, you can use [omniboard](https://github.com/vivekratnavel/omniboard), although several [other options](https://github.com/IDSIA/sacred#frontends) are available.

### Processing a preexisting dataset of raw slippi replay files
## Training Data

An preexisting dataset of raw slippi replays is available at https://drive.google.com/file/d/1ab6ovA46tfiPZ2Y3a_yS1J3k3656yQ8f (27G, unzips to 200G). You can place this in the `data/` folder using `gdown <drive link> <destination>`.
### Old Format

A dataset of processed and compressed slippi replays is available at https://drive.google.com/u/0/uc?id=1O6Njx85-2Te7VAZP6zP51EHa1oIFmS1B. It is a tarball of zipped pickled slp files. Use

```bash
gdown 'https://drive.google.com/u/0/uc?id=1O6Njx85-2Te7VAZP6zP51EHa1oIFmS1B' -O data/
tar -xf data/AllCompressed.tar
```

to expand it. The folder `data/AllCompressed/` will now contain many zipped pickled and formatted slippi replay files. Another useful dataset is https://drive.google.com/uc?id=1ZIfDgkdQdu-ldCx_34e-VxYJwQCpV-i3 which only contains fox dittos.

### Processing a preexisting dataset of raw slippi replay files
An preexisting dataset of raw slippi replays is available at https://drive.google.com/file/d/1VqRECRNL8Zy4BFQVIHvoVGtfjz4fi9KC (28G, unzips to 222G). You can place this in the `data/` folder using `gdown <drive link> <destination>`.

The code relies on a small (~3 MB) sql database which is 'melee_public_slp_dataset.sqlite3' in the `data/` folder.

For updates on this raw slippi replay dataset, the sql database, or the dataset of processed and compressed slippi replays, check the ai channel of the Slippi discord.

### New Format

The old data format had a few issues:
- It was potentially insecure due to the use of pickle.
- It used nests of numpy arrays, lacking any structure or specification.
- Being based on pickle, it was tied to the python language.

The new data format is based on the language-agnostic [Arrow](https://arrow.apache.org/) library and serialization via [Parquet](https://parquet.apache.org/). You can download the new dataset [here](https://slp-replays.s3.amazonaws.com/prod/datasets/pq/games.tar) as a tar archive. It contains 195536 files filtered to be valid singles replays; see `slippi_db.preprocessing.is_training_replay` for what that means. An associated metadata file, also in parquet format, is available [here](https://slp-replays.s3.amazonaws.com/prod/datasets/pq/meta.pq). The metadata file can be loaded as a pandas DataFrame:

```python
import pandas as pd
df = pd.read_parquet('meta.pq')
print(df.columns)
```

To access the game files, you can unzip the tar, or mount it directly using [ratarmount](https://github.com/mxmlnkn/ratarmount). The tar is a flat directory with filenames equal to the md5 hash of the original .slp replay, corresponding to the "key" column in the metadata. Each file is a gzip-compressed parquet table with a single column called "root".

```python
import pyarrow.parquet as pq
table = pq.read_table(game_path)
game = table['root'].combine_chunks() # pyarrow.StructArray
game[0].as_py() # nested python dictionary representing the first frame
```

See `slippi_ai/types.py` for utility functions that can manipulate pyarrow objects and convert them to the usual python nests of numpy arrays that are used in machine learning.

## Configuration

Example command configurations:

```bash
python scripts/train.py with dataset.subset=fox_dittos network.name=frame_stack_mlp
python scripts/train.py with network.name=frame_stack_mlp
```

These are some available options:
```
dataset.subset=
all (default)
fox_dittos

network.name=
mlp (default)
frame_stack_mlp
Expand Down
197 changes: 197 additions & 0 deletions clusters/rllib.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# An unique identifier for the head node and workers of this cluster.
cluster_name: slippi-rllib

# The maximum number of workers nodes to launch in addition to the head
# node.
# Max vCPUs is 384: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Limits
max_workers: 10

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
image: "vladfi/slippi-ai:rllib"
# image: "rayproject/ray-ml:latest-cpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
# - --shm-size=500M

# Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# Allow Ray to automatically detect GPUs

# worker_image: "rayproject/ray-ml:latest-cpu"
# worker_run_options: []
worker_run_options:
- --shm-size=500M

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
type: aws
region: us-east-1
# Availability zone(s), comma-separated, that nodes may be launched in.
# Nodes are currently spread between zones by a round-robin approach,
# however this implementation detail should not be relied upon.
# availability_zone: us-west-2a,us-west-2b
# Whether to allow node reuse. If set to False, nodes will be terminated
# instead of stopped. If not present, the default is True.
cache_stopped_nodes: False

# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ec2-user
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
# ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
# InstanceType: t3.micro
# InstanceType: m5.large
InstanceType: m5.xlarge # 4 CPU, 16 GB ram
ImageId: ami-0329a504ac63e1224 # minimal-ray
# You can provision additional disk space with a conf as follows
BlockDeviceMappings:
- DeviceName: /dev/xvda # matches device name in ami
Ebs:
VolumeSize: 10
VolumeType: gp3
# Additional options in the boto docs.
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 2
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 10
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {"CPU": 2, "worker": 1}
# Provider-specific config for this node type, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m6i.large # 2 CPU, 8 GB ram
ImageId: ami-0329a504ac63e1224 # minimal-ray
# You can provision additional disk space with a conf as follows
BlockDeviceMappings:
- DeviceName: /dev/xvda # matches device name in ami
Ebs:
VolumeSize: 8
VolumeType: gp3
# Run workers on spot by default. Comment this out to use on-demand.
# NOTE: If relying on spot instances, it is best to specify multiple different instance
# types to avoid interruption when one instance type is experiencing heightened demand.
# Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
# InstanceMarketOptions:
# MarketType: spot
# Additional options can be found in the boto docs, e.g.
# SpotOptions:
# MaxPrice: MAX_HOURLY_PRICE
# Additional options in the boto docs.

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
# workaround .gitignore not being used in hash_file_contents
"/slippi-ai/slippi_ai": "slippi_ai",
"/slippi-ai/scripts": "scripts",
"/slippi-ai/slippi_db": "slippi_db",
# "/slippi-ai/benchmarks": "benchmarks",
"/slippi-ai/setup.cfg": "setup.cfg",
"/slippi-ai/pyproject.toml": "pyproject.toml",
# "/root/secrets.json": "secrets.json",
"/root/.aws": "~/.aws",
"/root/.wandb": "~/.wandb",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
- "__pycache__/"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands:
# We need to move things out of / to work around a __pycache__ issue:
# https://ray-distributed.slack.com/archives/CN2RGCHRR/p1638582892031600
- rm -rf /install/slippi-ai && cp -r /slippi-ai /install/slippi-ai
- pip install -e /install/slippi-ai
- aws s3 cp s3://slippi-data/SSBM.iso /install/SSBM.iso

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}
Binary file removed data/melee_public_slp_dataset.sqlite3
Binary file not shown.
22 changes: 22 additions & 0 deletions docker/db.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM python:3.9-buster

RUN pip install ray[default]

RUN apt update
RUN apt install -y p7zip-full rsync

# RUN pip install tensorflow

# slippi-specific

WORKDIR /install
COPY requirements.txt .
RUN pip install -r requirements.txt

# build the wheel externally with `maturin build` in the peppy-py repo
# ARG PEPPI_PY_WHL=peppi_py-0.4.3-cp39-abi3-linux_x86_64.whl
# COPY docker/$PEPPI_PY_WHL .
# RUN pip install $PEPPI_PY_WHL

# use peppi-py from pypi
RUN pip install peppi-py
39 changes: 39 additions & 0 deletions docker/rllib.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
FROM ubuntu:22.04

ENV INSTALL=/install
WORKDIR $INSTALL

RUN apt update
RUN apt install -y p7zip-full rsync python3-pip micro
RUN apt install python-is-python3

# big python deps
RUN pip install -U pip
RUN pip install tensorflow

# dolphin
RUN apt install -y libasound2 libegl1 libgl1 libgdk-pixbuf-2.0-0 libpangocairo-1.0-0 libusb-1.0-0
RUN pip install gdown
RUN gdown "https://drive.google.com/uc?id=1qrXsPiRD4_-voFXxIMBk2EMQ_iF8O9Me" -O dolphin
RUN chmod +x dolphin
RUN ./dolphin --appimage-extract
RUN rm dolphin
ENV DOLPHIN_PATH=$INSTALL/squashfs-root/usr/bin/

# https://serverfault.com/questions/949991/how-to-install-tzdata-on-a-ubuntu-docker-image
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
RUN apt install -y tzdata

# aws tools (to pull SSBM.iso)
RUN apt install -y awscli

# ray
ARG RAY_WHEEL="https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
RUN pip install -U "ray[rllib] @ $RAY_WHEEL"
# RUN pip install ray[rllib]

# slippi-ai
RUN apt install -y git
COPY requirements.txt .
RUN pip install -r requirements.txt
Loading