Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Trial images build to the CI #1457

Merged
merged 10 commits into from
Mar 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
.git
.gitignore
docs
examples
manifests
pkg/ui/*/frontend/node_modules
pkg/ui/*/frontend/build
Expand Down
12 changes: 4 additions & 8 deletions examples/v1beta1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -348,7 +348,8 @@ It will stop port-forward process and delete minikube cluster.
docker.io/kubeflowkatib/mxnet-mnist
```

- Pytorch mnist example with saving metrics to the file, [source](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/file-metrics-collector/mnist.py).
- Pytorch mnist example with saving metrics to the file or print them to the StdOut,
[source](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py).

```
docker.io/kubeflowkatib/pytorch-mnist
Expand All @@ -372,13 +373,8 @@ docker.io/kubeflowkatib/enas-cnn-cifar10-cpu
docker.io/kubeflowkatib/darts-cnn-cifar10
```

- Pytorch operator mnist example, [source](https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/mnist.py).

```
gcr.io/kubeflow-ci/pytorch-dist-mnist-test
```

- Tf operator mnist example, [source](https://github.com/kubeflow/tf-operator/blob/master/examples/v1/mnist_with_summaries/mnist_with_summaries.py).
- TF operator mnist example with writing summary data,
[source](https://github.com/kubeflow/tf-operator/blob/master/examples/v1/mnist_with_summaries/mnist_with_summaries.py).

```
gcr.io/kubeflow-ci/tf-mnist-with-summaries
Expand Down
8 changes: 5 additions & 3 deletions examples/v1beta1/custom-metricscollector-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,14 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-e294a90
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
command:
- "python"
- "/opt/mnist/src/mnist.py"
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
- "--log-path=/katib/mnist.log"
- "--lr=${trialParameters.learningRate}"
- "--momentum=${trialParameters.momentum}"
restartPolicy: Never
13 changes: 0 additions & 13 deletions examples/v1beta1/file-metrics-collector/Dockerfile

This file was deleted.

10 changes: 6 additions & 4 deletions examples/v1beta1/file-metricscollector-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,14 @@ spec:
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-e294a90
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
command:
- "python"
- "/opt/mnist/src/mnist.py"
- "--epochs=1"
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=2"
- "--log-path=/katib/mnist.log"
- "--lr=${trialParameters.learningRate}"
- "--momentum=${trialParameters.momentum}"
restartPolicy: Never
11 changes: 2 additions & 9 deletions examples/v1beta1/mxnet-mnist/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,13 +1,6 @@
FROM ubuntu:16.04
FROM mxnet/python:latest_cpu_native_py3

RUN apt-get update && \
apt-get install -y wget python3-dev gcc && \
wget https://bootstrap.pypa.io/get-pip.py && \
python3 get-pip.py

RUN pip3 install mxnet

ADD . /opt/mxnet-mnist
ADD examples/v1beta1/mxnet-mnist /opt/mxnet-mnist
WORKDIR /opt/mxnet-mnist

RUN chgrp -R 0 /opt/mxnet-mnist \
Expand Down
3 changes: 1 addition & 2 deletions examples/v1beta1/nas/darts-cnn-cifar10/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ FROM pytorch/pytorch:1.0-cuda${cuda_version}-cudnn${cudnn_version}-runtime

ENV TARGET_DIR /opt/nas/darts-cnn-cifar10


ADD . ${TARGET_DIR}
ADD examples/v1beta1/nas/darts-cnn-cifar10 ${TARGET_DIR}
WORKDIR ${TARGET_DIR}

RUN chgrp -R 0 ${TARGET_DIR} \
Expand Down
18 changes: 3 additions & 15 deletions examples/v1beta1/nas/enas-cnn-cifar10/Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -1,24 +1,12 @@
FROM tensorflow/tensorflow:1.12.0
FROM tensorflow/tensorflow:1.15.4-py3

ENV TARGET_DIR /opt/nas/enas-cnn-cifar10

# Install system packages
RUN apt-get update && apt-get install -y software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y --no-install-recommends \
python3-setuptools \
python3-dev \
python3-pip \
git \
graphviz \
wget

ADD . ${TARGET_DIR}
ADD examples/v1beta1/nas/enas-cnn-cifar10 ${TARGET_DIR}
WORKDIR ${TARGET_DIR}

RUN pip3 install --upgrade pip
RUN pip3 install --upgrade --no-cache-dir -r requirements-cpu.txt
RUN pip3 install --upgrade -r requirements.txt
ENV PYTHONPATH ${TARGET_DIR}

RUN chgrp -R 0 ${TARGET_DIR} \
Expand Down
27 changes: 3 additions & 24 deletions examples/v1beta1/nas/enas-cnn-cifar10/Dockerfile.gpu
Original file line number Diff line number Diff line change
@@ -1,33 +1,12 @@
ARG cuda_version=10.0
ARG cudnn_version=7
FROM nvidia/cuda:${cuda_version}-cudnn${cudnn_version}-devel
FROM tensorflow/tensorflow:1.15.4-gpu-py3

ENV TARGET_DIR /opt/nas/enas-cnn-cifar10

# Install system packages
RUN apt-get update && apt-get install -y software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y --no-install-recommends \
bzip2 \
g++ \
git \
graphviz \
libgl1-mesa-glx \
libhdf5-dev \
openmpi-bin \
python3 \
python3-pip \
python3-setuptools \
python3-dev \
wget && \
rm -rf /var/lib/apt/lists/*

ADD . ${TARGET_DIR}
ADD examples/v1beta1/nas/enas-cnn-cifar10 ${TARGET_DIR}
WORKDIR ${TARGET_DIR}

RUN pip3 install --upgrade pip
RUN pip3 install --upgrade --no-cache-dir -r requirements-gpu.txt
RUN pip3 install --upgrade -r requirements.txt
ENV PYTHONPATH ${TARGET_DIR}

RUN chgrp -R 0 ${TARGET_DIR} \
Expand Down
2 changes: 0 additions & 2 deletions examples/v1beta1/nas/enas-cnn-cifar10/requirements-cpu.txt

This file was deleted.

2 changes: 0 additions & 2 deletions examples/v1beta1/nas/enas-cnn-cifar10/requirements-gpu.txt

This file was deleted.

1 change: 1 addition & 0 deletions examples/v1beta1/nas/enas-cnn-cifar10/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
keras==2.2.4
14 changes: 14 additions & 0 deletions examples/v1beta1/pytorch-mnist/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime

ADD examples/v1beta1/pytorch-mnist /opt/pytorch-mnist
WORKDIR /opt/pytorch-mnist

# Add folder for the logs.
RUN mkdir /katib

RUN chgrp -R 0 /opt/pytorch-mnist \
&& chmod -R g+rwX /opt/pytorch-mnist \
&& chgrp -R 0 /katib \
&& chmod -R g+rwX /katib

ENTRYPOINT ["python3", "/opt/pytorch-mnist/mnist.py"]
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,13 @@
import torch.nn.functional as F
import torch.optim as optim

WORLD_SIZE = int(os.environ.get('WORLD_SIZE', 1))
# To fix this issue: https://github.com/pytorch/vision/issues/1938.
from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [("User-agent", "Mozilla/5.0")]
urllib.request.install_opener(opener)

logging.basicConfig(filename='/katib/mnist.log', level=logging.DEBUG)
WORLD_SIZE = int(os.environ.get("WORLD_SIZE", 1))


class Net(nn.Module):
Expand Down Expand Up @@ -45,11 +49,10 @@ def train(args, model, device, train_loader, optimizer, epoch):
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
msg = 'Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}'.format(
msg = "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item())
print(msg)
logging.debug(msg)
logging.info(msg)
niter = epoch * len(train_loader) + batch_idx


Expand All @@ -61,12 +64,12 @@ def test(args, model, device, test_loader, epoch):
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
test_loss += F.nll_loss(output, target, reduction="sum").item() # sum up batch loss
pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()

test_loss /= len(test_loader.dataset)
logging.info('\n{{metricName: accuracy, metricValue: {:.4f}}};{{metricName: loss, metricValue: {:.4f}}}\n'.format(
logging.info("{{metricName: accuracy, metricValue: {:.4f}}};{{metricName: loss, metricValue: {:.4f}}}\n".format(
float(correct) / len(test_loader.dataset), test_loss))


Expand All @@ -80,52 +83,70 @@ def is_distributed():

def main():
# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--save-model', action='store_true', default=False,
help='For Saving the current Model')
parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
parser.add_argument("--batch-size", type=int, default=64, metavar="N",
help="input batch size for training (default: 64)")
parser.add_argument("--test-batch-size", type=int, default=1000, metavar="N",
help="input batch size for testing (default: 1000)")
parser.add_argument("--epochs", type=int, default=10, metavar="N",
help="number of epochs to train (default: 10)")
parser.add_argument("--lr", type=float, default=0.01, metavar="LR",
help="learning rate (default: 0.01)")
parser.add_argument("--momentum", type=float, default=0.5, metavar="M",
help="SGD momentum (default: 0.5)")
parser.add_argument("--no-cuda", action="store_true", default=False,
help="disables CUDA training")
parser.add_argument("--seed", type=int, default=1, metavar="S",
help="random seed (default: 1)")
parser.add_argument("--log-interval", type=int, default=10, metavar="N",
help="how many batches to wait before logging training status")
parser.add_argument("--log-path", type=str, default="",
help="Path to save logs. Print to StdOut if log-path is not set")
parser.add_argument("--save-model", action="store_true", default=False,
help="For Saving the current Model")

if dist.is_available():
parser.add_argument('--backend', type=str, help='Distributed backend',
parser.add_argument("--backend", type=str, help="Distributed backend",
choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
default=dist.Backend.GLOO)
args = parser.parse_args()

# Use this format (%Y-%m-%dT%H:%M:%SZ) to record timestamp of the metrics.
# If log_path is empty print log to StdOut, otherwise print log to the file.
if args.log_path == "":
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.DEBUG)
else:
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.DEBUG,
filename=args.log_path)

use_cuda = not args.no_cuda and torch.cuda.is_available()
if use_cuda:
print('Using CUDA')
print("Using CUDA")

torch.manual_seed(args.seed)

device = torch.device("cuda" if use_cuda else "cpu")

if should_distribute():
print('Using distributed PyTorch with {} backend'.format(args.backend))
print("Using distributed PyTorch with {} backend".format(args.backend))
dist.init_process_group(backend=args.backend)

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
kwargs = {"num_workers": 1, "pin_memory": True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=True, download=True,
datasets.MNIST("../data", train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=False, transform=transforms.Compose([
datasets.MNIST("../data", train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
Expand All @@ -148,5 +169,5 @@ def main():
torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == '__main__':
if __name__ == "__main__":
main()
22 changes: 13 additions & 9 deletions examples/v1beta1/pytorchjob-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ spec:
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
type: minimize
goal: 0.001
objectiveMetricName: loss
algorithm:
algorithmName: random
parameters:
Expand Down Expand Up @@ -45,11 +45,13 @@ spec:
spec:
containers:
- name: pytorch
image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
command:
- "python"
- "/var/mnist.py"
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
- "--lr=${trialParameters.learningRate}"
- "--momentum=${trialParameters.momentum}"
Worker:
Expand All @@ -59,10 +61,12 @@ spec:
spec:
containers:
- name: pytorch
image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
# TODO (andreyvelich): Add tag to the image.
image: docker.io/kubeflowkatib/pytorch-mnist:latest
imagePullPolicy: Always
command:
- "python"
- "/var/mnist.py"
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
- "--lr=${trialParameters.learningRate}"
- "--momentum=${trialParameters.momentum}"
Loading