vladfi1 · lzardy · Nov 5, 2021 · Nov 13, 2021 · Nov 13, 2021 · Nov 13, 2021
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,6 @@
+# Ignore everything
+*
+
+# allow only these
+!requirements.txt
+!docker/
diff --git a/.gitignore b/.gitignore
@@ -133,10 +133,14 @@ experiments/
 
 # don't track replays
 data/AllCompressed/
+data/untracked/
 
 # storage for b9 scripts
 run_scripts/
 
+# shhhhhhh
 secrets.sh
+secrets.json
+
 .vscode
 saved_models/
diff --git a/README.md b/README.md
@@ -10,15 +10,6 @@ An easy way to get started is with Google Cloud Platform. Launch a VM with the '
 pip install --user -r requirements.txt
 ```
 
-A dataset of processed and compressed slippi replays is available at https://drive.google.com/u/0/uc?id=1O6Njx85-2Te7VAZP6zP51EHa1oIFmS1B. It is a tarball of zipped pickled slp files. Use
-
-```bash
-gdown 'https://drive.google.com/u/0/uc?id=1O6Njx85-2Te7VAZP6zP51EHa1oIFmS1B' -O data/
-tar -xf data/AllCompressed.tar
-```
-
-to expand it. The folder `data/AllCompressed/` will now contain many zipped pickled and formatted slippi replay files. Another useful dataset is https://drive.google.com/uc?id=1ZIfDgkdQdu-ldCx_34e-VxYJwQCpV-i3 which only contains fox dittos.
-
 We use Sacred and MongoDB for logging experiments. While Sacred is installed through requirements.txt, MongoDB needs to be installed separately. Instructions for installing MongoDB on Debian 9 are available here: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-debian/. These commands worked for us:
 
 ```bash
@@ -43,28 +34,62 @@ python scripts/train.py
 
 To view the results, you can use [omniboard](https://github.com/vivekratnavel/omniboard), although several [other options](https://github.com/IDSIA/sacred#frontends) are available.
 
-### Processing a preexisting dataset of raw slippi replay files
+## Training Data
 
-An preexisting dataset of raw slippi replays is available at https://drive.google.com/file/d/1ab6ovA46tfiPZ2Y3a_yS1J3k3656yQ8f (27G, unzips to 200G). You can place this in the `data/` folder using `gdown <drive link> <destination>`.
+### Old Format
+
+A dataset of processed and compressed slippi replays is available at https://drive.google.com/u/0/uc?id=1O6Njx85-2Te7VAZP6zP51EHa1oIFmS1B. It is a tarball of zipped pickled slp files. Use
+
+```bash
+gdown 'https://drive.google.com/u/0/uc?id=1O6Njx85-2Te7VAZP6zP51EHa1oIFmS1B' -O data/
+tar -xf data/AllCompressed.tar
+```
+
+to expand it. The folder `data/AllCompressed/` will now contain many zipped pickled and formatted slippi replay files. Another useful dataset is https://drive.google.com/uc?id=1ZIfDgkdQdu-ldCx_34e-VxYJwQCpV-i3 which only contains fox dittos.
+
+### Processing a preexisting dataset of raw slippi replay files
+An preexisting dataset of raw slippi replays is available at https://drive.google.com/file/d/1VqRECRNL8Zy4BFQVIHvoVGtfjz4fi9KC (28G, unzips to 222G). You can place this in the `data/` folder using `gdown <drive link> <destination>`.
 
 The code relies on a small (~3 MB) sql database which is 'melee_public_slp_dataset.sqlite3' in the `data/` folder.
 
 For updates on this raw slippi replay dataset, the sql database, or the dataset of processed and compressed slippi replays, check the ai channel of the Slippi discord.
 
+### New Format
+
+The old data format had a few issues:
+- It was potentially insecure due to the use of pickle.
+- It used nests of numpy arrays, lacking any structure or specification.
+- Being based on pickle, it was tied to the python language.
+
+The new data format is based on the language-agnostic [Arrow](https://arrow.apache.org/) library and serialization via [Parquet](https://parquet.apache.org/). You can download the new dataset [here](https://slp-replays.s3.amazonaws.com/prod/datasets/pq/games.tar) as a tar archive. It contains 195536 files filtered to be valid singles replays; see `slippi_db.preprocessing.is_training_replay` for what that means. An associated metadata file, also in parquet format, is available [here](https://slp-replays.s3.amazonaws.com/prod/datasets/pq/meta.pq). The metadata file can be loaded as a pandas DataFrame:
+
+```python
+import pandas as pd
+df = pd.read_parquet('meta.pq')
+print(df.columns)
+```
+
+To access the game files, you can unzip the tar, or mount it directly using [ratarmount](https://github.com/mxmlnkn/ratarmount). The tar is a flat directory with filenames equal to the md5 hash of the original .slp replay, corresponding to the "key" column in the metadata. Each file is a gzip-compressed parquet table with a single column called "root".
+
+```python
+import pyarrow.parquet as pq
+table = pq.read_table(game_path)
+game = table['root'].combine_chunks()  # pyarrow.StructArray
+game[0].as_py()  # nested python dictionary representing the first frame
+```
+
+See `slippi_ai/types.py` for utility functions that can manipulate pyarrow objects and convert them to the usual python nests of numpy arrays that are used in machine learning.
+
 ## Configuration
 
 Example command configurations:
 
 ```bash
-python scripts/train.py with dataset.subset=fox_dittos network.name=frame_stack_mlp
+python scripts/train.py with network.name=frame_stack_mlp
 ```
 
 These are some available options:
 ```
-dataset.subset=
-    all (default)
-    fox_dittos
-
 network.name=
     mlp (default)
     frame_stack_mlp

diff --git a/clusters/rllib.yaml b/clusters/rllib.yaml
@@ -0,0 +1,197 @@
+# An unique identifier for the head node and workers of this cluster.
+cluster_name: slippi-rllib
+
+# The maximum number of workers nodes to launch in addition to the head
+# node.
+# Max vCPUs is 384: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Limits
+max_workers: 10
+
+# The autoscaler will scale up the cluster faster with higher upscaling speed.
+# E.g., if the task requires adding more nodes then autoscaler will gradually
+# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
+# This number should be > 0.
+upscaling_speed: 1.0
+
+# This executes all commands on all nodes in the docker container,
+# and opens all the necessary ports to support the Ray cluster.
+# Empty string means disabled.
+docker:
+    image: "vladfi/slippi-ai:rllib"
+    # image: "rayproject/ray-ml:latest-cpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
+    container_name: "ray_container"
+    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
+    # if no cached version is present.
+    pull_before_run: True
+    run_options:   # Extra options to pass into "docker run"
+        - --ulimit nofile=65536:65536
+        # - --shm-size=500M
+
+    # Example of running a GPU head with CPU workers
+    # head_image: "rayproject/ray-ml:latest-gpu"
+    # Allow Ray to automatically detect GPUs
+
+    # worker_image: "rayproject/ray-ml:latest-cpu"
+    # worker_run_options: []
+    worker_run_options:
+        - --shm-size=500M
+
+# If a node is idle for this many minutes, it will be removed.
+idle_timeout_minutes: 5
+
+# Cloud-provider specific configuration.
+provider:
+    type: aws
+    region: us-east-1
+    # Availability zone(s), comma-separated, that nodes may be launched in.
+    # Nodes are currently spread between zones by a round-robin approach,
+    # however this implementation detail should not be relied upon.
+    # availability_zone: us-west-2a,us-west-2b
+    # Whether to allow node reuse. If set to False, nodes will be terminated
+    # instead of stopped. If not present, the default is True.
+    cache_stopped_nodes: False
+
+# How Ray will authenticate with newly launched nodes.
+auth:
+    ssh_user: ec2-user
+# By default Ray creates a new private keypair, but you can also use your own.
+# If you do so, make sure to also set "KeyName" in the head and worker node
+# configurations below.
+#    ssh_private_key: /path/to/your/key.pem
+
+# Tell the autoscaler the allowed node types and the resources they provide.
+# The key is the name of the node type, which is just for debugging purposes.
+# The node config specifies the launch config and physical instance type.
+available_node_types:
+    ray.head.default:
+        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
+        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
+        # You can also set custom resources.
+        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
+        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
+        resources: {}
+        # Provider-specific config for this node type, e.g. instance type. By default
+        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
+        # For more documentation on available fields, see:
+        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
+        node_config:
+            # InstanceType: t3.micro
+            # InstanceType: m5.large
+            InstanceType: m5.xlarge # 4 CPU, 16 GB ram
+            ImageId: ami-0329a504ac63e1224 # minimal-ray
+            # You can provision additional disk space with a conf as follows
+            BlockDeviceMappings:
+                - DeviceName: /dev/xvda # matches device name in ami
+                  Ebs:
+                      VolumeSize: 10
+                      VolumeType: gp3
+            # Additional options in the boto docs.
+    ray.worker.default:
+        # The minimum number of worker nodes of this type to launch.
+        # This number should be >= 0.
+        min_workers: 2
+        # The maximum number of worker nodes of this type to launch.
+        # This takes precedence over min_workers.
+        max_workers: 10
+        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
+        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
+        # You can also set custom resources.
+        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
+        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
+        resources: {"CPU": 2, "worker": 1}
+        # Provider-specific config for this node type, e.g. instance type. By default
+        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
+        # For more documentation on available fields, see:
+        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
+        node_config:
+            InstanceType: m6i.large # 2 CPU, 8 GB ram
+            ImageId: ami-0329a504ac63e1224 # minimal-ray
+            # You can provision additional disk space with a conf as follows
+            BlockDeviceMappings:
+                - DeviceName: /dev/xvda # matches device name in ami
+                  Ebs:
+                      VolumeSize: 8
+                      VolumeType: gp3
+            # Run workers on spot by default. Comment this out to use on-demand.
+            # NOTE: If relying on spot instances, it is best to specify multiple different instance
+            # types to avoid interruption when one instance type is experiencing heightened demand.
+            # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
+            # InstanceMarketOptions:
+            #     MarketType: spot
+            #     Additional options can be found in the boto docs, e.g.
+            #       SpotOptions:
+            #           MaxPrice: MAX_HOURLY_PRICE
+            # Additional options in the boto docs.
+
+# Specify the node type of the head node (as configured above).
+head_node_type: ray.head.default
+
+# Files or directories to copy to the head and worker nodes. The format is a
+# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
+file_mounts: {
+#    "/path1/on/remote/machine": "/path1/on/local/machine",
+#    "/path2/on/remote/machine": "/path2/on/local/machine",
+    # workaround .gitignore not being used in hash_file_contents
+    "/slippi-ai/slippi_ai": "slippi_ai",
+    "/slippi-ai/scripts": "scripts",
+    "/slippi-ai/slippi_db": "slippi_db",
+    # "/slippi-ai/benchmarks": "benchmarks",
+    "/slippi-ai/setup.cfg": "setup.cfg",
+    "/slippi-ai/pyproject.toml": "pyproject.toml",
+    # "/root/secrets.json": "secrets.json",
+    "/root/.aws": "~/.aws",
+    "/root/.wandb": "~/.wandb",
+}
+
+# Files or directories to copy from the head node to the worker nodes. The format is a
+# list of paths. The same path on the head node will be copied to the worker node.
+# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
+# you should just use file_mounts. Only use this if you know what you're doing!
+cluster_synced_files: []
+
+# Whether changes to directories in file_mounts or cluster_synced_files in the head node
+# should sync to the worker node continuously
+file_mounts_sync_continuously: False
+
+# Patterns for files to exclude when running rsync up or rsync down
+rsync_exclude:
+    - "**/.git"
+    - "**/.git/**"
+    - "__pycache__/"
+
+# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
+# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
+# as a value, the behavior will match git's behavior for finding and using .gitignore files.
+rsync_filter:
+    - ".gitignore"
+
+# List of commands that will be run before `setup_commands`. If docker is
+# enabled, these commands will run outside the container and before docker
+# is setup.
+initialization_commands: []
+
+# List of shell commands to run to set up nodes.
+setup_commands:
+    # We need to move things out of / to work around a __pycache__ issue:
+    # https://ray-distributed.slack.com/archives/CN2RGCHRR/p1638582892031600
+    - rm -rf /install/slippi-ai && cp -r /slippi-ai /install/slippi-ai
+    - pip install -e /install/slippi-ai
+    - aws s3 cp s3://slippi-data/SSBM.iso /install/SSBM.iso
+
+# Custom commands that will be run on the head node after common setup.
+head_setup_commands: []
+
+# Custom commands that will be run on worker nodes after common setup.
+worker_setup_commands: []
+
+# Command to start ray on the head node. You don't need to change this.
+head_start_ray_commands:
+    - ray stop
+    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
+
+# Command to start ray on worker nodes. You don't need to change this.
+worker_start_ray_commands:
+    - ray stop
+    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
+
+head_node: {}
+worker_nodes: {}
diff --git a/data/melee_public_slp_dataset.sqlite3 b/data/melee_public_slp_dataset.sqlite3
diff --git a/docker/db.dockerfile b/docker/db.dockerfile
@@ -0,0 +1,22 @@
+FROM python:3.9-buster
+
+RUN pip install ray[default]
+
+RUN apt update
+RUN apt install -y p7zip-full rsync
+
+# RUN pip install tensorflow
+
+# slippi-specific
+
+WORKDIR /install
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+
+# build the wheel externally with `maturin build` in the peppy-py repo
+# ARG PEPPI_PY_WHL=peppi_py-0.4.3-cp39-abi3-linux_x86_64.whl
+# COPY docker/$PEPPI_PY_WHL .
+# RUN pip install $PEPPI_PY_WHL
+
+# use peppi-py from pypi
+RUN pip install peppi-py
diff --git a/docker/rllib.dockerfile b/docker/rllib.dockerfile
@@ -0,0 +1,39 @@
+FROM ubuntu:22.04
+
+ENV INSTALL=/install
+WORKDIR $INSTALL
+
+RUN apt update
+RUN apt install -y p7zip-full rsync python3-pip micro
+RUN apt install python-is-python3
+
+# big python deps
+RUN pip install -U pip
+RUN pip install tensorflow
+
+# dolphin
+RUN apt install -y libasound2 libegl1 libgl1 libgdk-pixbuf-2.0-0 libpangocairo-1.0-0 libusb-1.0-0
+RUN pip install gdown
+RUN gdown "https://drive.google.com/uc?id=1qrXsPiRD4_-voFXxIMBk2EMQ_iF8O9Me" -O dolphin
+RUN chmod +x dolphin
+RUN ./dolphin --appimage-extract
+RUN rm dolphin
+ENV DOLPHIN_PATH=$INSTALL/squashfs-root/usr/bin/
+
+# https://serverfault.com/questions/949991/how-to-install-tzdata-on-a-ubuntu-docker-image
+ARG DEBIAN_FRONTEND=noninteractive
+ENV TZ=Etc/UTC
+RUN apt install -y tzdata
+
+# aws tools (to pull SSBM.iso)
+RUN apt install -y awscli
+
+# ray
+ARG RAY_WHEEL="https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
+RUN pip install -U "ray[rllib] @ $RAY_WHEEL"
+# RUN pip install ray[rllib]
+
+# slippi-ai
+RUN apt install -y git
+COPY requirements.txt .
+RUN pip install -r requirements.txt