Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Katib API for NAS jobs #327

Merged
merged 45 commits into from
Jan 29, 2019
Merged
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
89e56a3
Add fields to studyjob structure
andreyvelich Dec 14, 2018
67ce5ee
Change nasjob yaml file
andreyvelich Dec 14, 2018
7fe9b7d
Change parameter type
andreyvelich Dec 18, 2018
9b8764d
Merge remote-tracking branch 'upstream/master' into 293-extend-sj-str…
andreyvelich Dec 18, 2018
f6eb5ce
Add Parameter Type=range
andreyvelich Dec 20, 2018
743265b
Change API
andreyvelich Jan 7, 2019
3e5a371
Change api.proto
andreyvelich Jan 7, 2019
abd564e
Change input size
andreyvelich Jan 7, 2019
4342a8b
Reset API structure
andreyvelich Jan 14, 2019
f673386
Change StudyJob API structure
andreyvelich Jan 14, 2019
fdb51d1
Remove Range parameter
andreyvelich Jan 15, 2019
6edb518
Fix api.proto
andreyvelich Jan 15, 2019
de9a6d7
Fix gopkg.toml
andreyvelich Jan 15, 2019
945d3fd
Remove old nasjob file
andreyvelich Jan 15, 2019
972cf30
Fix nasjob.yaml
andreyvelich Jan 15, 2019
ac63ac0
Add custom suggestion
andreyvelich Jan 15, 2019
1ff6840
Add blank NAS suggestion
andreyvelich Jan 17, 2019
e8c9636
Add correct YAML file for NAS example
andreyvelich Jan 17, 2019
f5498e4
Fix newline
andreyvelich Jan 17, 2019
7c15d64
Change StudyID to 1
andreyvelich Jan 17, 2019
2f8dc49
Add jobType parameter in Parsing
andreyvelich Jan 17, 2019
7ed6b0a
Remove changes in manager
andreyvelich Jan 17, 2019
9546fef
Add NasConfig inside Yaml file
andreyvelich Jan 18, 2019
a18466e
Fix name in nasConfig
andreyvelich Jan 18, 2019
3318467
Merge remote-tracking branch 'upstream/master' into 293-extend-sj-str…
andreyvelich Jan 18, 2019
04cdfbf
Fix get StudyConfig in NAS
andreyvelich Jan 18, 2019
f5a5d83
Add JobType in all services
andreyvelich Jan 18, 2019
a01710c
Add job_type in bayesian_service
andreyvelich Jan 18, 2019
f1dac5c
Add pointers in NasConfig structure
andreyvelich Jan 18, 2019
d653643
Fix Pointer in API
andreyvelich Jan 18, 2019
2802c6f
Add consts for jobType
andreyvelich Jan 18, 2019
794e7cc
Move const jobType to const file
andreyvelich Jan 19, 2019
4fa52ae
Remove Range parameter
andreyvelich Jan 22, 2019
7ca5f0c
Modify YAML file for NAS jobs
andreyvelich Jan 23, 2019
437b614
Add getStudyJobType function in GRPC server
andreyvelich Jan 24, 2019
9db4a81
Add blank GetStudyJobType func in manager
andreyvelich Jan 24, 2019
8f0d206
Merge remote-tracking branch 'upstream/master' into 293-extend-sj-str…
andreyvelich Jan 24, 2019
2324741
Fix metrics collector
andreyvelich Jan 24, 2019
8766eda
Remove jobType from getStudy
andreyvelich Jan 25, 2019
c362703
Remove getStudyJobType from manager
andreyvelich Jan 25, 2019
4df7151
Add NAS RL yaml deployment
andreyvelich Jan 25, 2019
b0fb3cd
Change worker to GPU
andreyvelich Jan 26, 2019
55c924e
Clean nasrl suggestion
andreyvelich Jan 26, 2019
ac2dd76
Add -u inside training-container
andreyvelich Jan 26, 2019
4263d1c
Fix namespace in worker template
andreyvelich Jan 28, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cmd/manager/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,8 @@ func (s *server) GetTrials(ctx context.Context, in *api_pb.GetTrialsRequest) (*a
}

func (s *server) GetTrial(ctx context.Context, in *api_pb.GetTrialRequest) (*api_pb.GetTrialReply, error) {
t, err := dbIf.GetTrial(in.TrialId)
return &api_pb.GetTrialReply{Trial: t}, err
t, err := dbIf.GetTrial(in.TrialId)
return &api_pb.GetTrialReply{Trial: t}, err
}

func (s *server) GetSuggestions(ctx context.Context, in *api_pb.GetSuggestionsRequest) (*api_pb.GetSuggestionsReply, error) {
Expand Down
8 changes: 8 additions & 0 deletions cmd/suggestion/nasrl/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
FROM python:3

ADD . /usr/src/app/github.com/kubeflow/katib
WORKDIR /usr/src/app/github.com/kubeflow/katib/cmd/suggestion/nasrl
RUN pip install --no-cache-dir -r requirements.txt
ENV PYTHONPATH /usr/src/app/github.com/kubeflow/katib:/usr/src/app/github.com/kubeflow/katib/pkg/api/python

ENTRYPOINT ["python", "-u", "main.py"]
andreyvelich marked this conversation as resolved.
Show resolved Hide resolved
Empty file.
29 changes: 29 additions & 0 deletions cmd/suggestion/nasrl/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import grpc
from concurrent import futures

import time

from pkg.api.python import api_pb2_grpc
from pkg.suggestion.nasrl_service import NasrlService
from pkg.suggestion.types import DEFAULT_PORT
from logging import getLogger, StreamHandler, INFO, DEBUG


_ONE_DAY_IN_SECONDS = 60 * 60 * 24


def serve():
print("NAS RL Suggestion Service")
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
api_pb2_grpc.add_SuggestionServicer_to_server(NasrlService(), server)
server.add_insecure_port(DEFAULT_PORT)
print("Listening...")
server.start()
try:
while True:
time.sleep(_ONE_DAY_IN_SECONDS)
except KeyboardInterrupt:
server.stop(0)

if __name__ == "__main__":
serve()
andreyvelich marked this conversation as resolved.
Show resolved Hide resolved
9 changes: 9 additions & 0 deletions cmd/suggestion/nasrl/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
grpcio
duecredit
cloudpickle==0.5.6
numpy>=1.13.3
scikit-learn>=0.19.0
scipy>=0.19.1
forestci
protobuf
googleapis-common-protos
120 changes: 120 additions & 0 deletions examples/nasjob-example-RL.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
apiVersion: "kubeflow.org/v1alpha1"
kind: StudyJob
metadata:
namespace: kubeflow
labels:
controller-tools.k8s.io: "1.0"
name: nas-rl-example
spec:
studyName: nas-rl-example
owner: crd
optimizationtype: maximize
objectivevaluename: Validation-Accuracy
optimizationgoal: 0.99
requestcount: 3
metricsnames:
- accuracy
nasConfig:
graphConfig:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain about graphConfig and operations.
I believe The graphConfig is general information for NAS Job and Operations are task-specific info e.g. CNN.
Why not merge them? The graphConfig is able to encode to parameterconfig.

Copy link
Member Author

@andreyvelich andreyvelich Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, GraphConfig is an initial general information for the Directed Acyclic Graph (DAG). We can't merge them, because for each StudyJob in NAS GraphConfig should be single and Operations can be as many as you want. Also, right now structure of Parameter config doesn't match with GraphConfig, we have to extend ParameterConfig as well in that case.

numLayers: 8
inputSize:
- 32
- 32
- 3
outputSize:
- 10
operations:
- operationType: convolution
parameterconfigs:
- name: filter_size
parametertype: categorical
feasible:
list:
- "3"
- "5"
- "7"
- name: num_filter
parametertype: categorical
feasible:
list:
- "32"
- "48"
- "64"
- "96"
- "128"
- name: stride
parametertype: categorical
feasible:
list:
- "1"
- "2"
- operationType: reduction
parameterconfigs:
- name: reduction_type
parametertype: categorical
feasible:
list:
- max_pooling
- avg_pooling
- name: pool_size
parametertype: int
feasible:
min: "2"
max: "3"
step: "1"
workerSpec:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.WorkerID}}
namespace: kubeflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/kubeflow/{{.NameSpace}}

spec:
template:
spec:
containers:
- name: {{.WorkerID}}
image: docker.io/deepermind/training-container-nas
command:
- "python3.5"
- "-u"
- "RunTrial.py"
{{- with .HyperParameters}}
{{- range .}}
- "--{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
suggestionSpec:
suggestionAlgorithm: "nasrl"
suggestionParameters:
- name: "lstm_num_cells"
value: "64"
- name: "lstm_num_layers"
value: "1"
- name: "lstm_keep_prob"
value: "1.0"
- name: "optimizer"
value: "adam"
- name: "init_learning_rate"
value: "1e-3"
- name: "lr_decay_start"
value: "0"
- name: "lr_decay_every"
value: "1000"
- name: "lr_decay_rate"
value: "0.9"
- name: "skip-target"
value: "0.4"
- name: "skip-weight"
value: "0.8"
- name: "l2_reg"
value: "0"
- name: "entropy_weight"
value: "1e-4"
- name: "baseline_decay"
value: "0.9999"
23 changes: 23 additions & 0 deletions manifests/vizier/suggestion/reinforcementlearning/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: vizier-suggestion-nasrl
namespace: kubeflow
labels:
app: vizier
component: suggestion-nasrl
spec:
replicas: 1
template:
metadata:
name: vizier-suggestion-nasrl
labels:
app: vizier
component: suggestion-nasrl
spec:
containers:
- name: vizier-suggestion-nasrl
image: docker.io/deepermind/katib-nasrl-suggestion
ports:
- name: api
containerPort: 6789
17 changes: 17 additions & 0 deletions manifests/vizier/suggestion/reinforcementlearning/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: v1
kind: Service
metadata:
name: vizier-suggestion-nasrl
namespace: kubeflow
labels:
app: vizier
component: suggestion-nasrl
spec:
type: ClusterIP
ports:
- port: 6789
protocol: TCP
name: api
selector:
app: vizier
component: suggestion-nasrl
Loading