Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster sanity check #4241

Merged
merged 20 commits into from
Sep 6, 2018
Merged

Conversation

danking
Copy link
Contributor

@danking danking commented Aug 30, 2018

I think this will, in practice, catch a lot of the issues we've created in the recent past with blowing memory or dataproc incompatibilities.

hail-ci-build.sh Outdated
--jar build/libs/hail-all-spark.jar \
--zip build/distributions/hail-python.zip && \
cluster submit ci-test-$SOURCE_SHA-$TARGET_SHA \
cluster-sanity-check.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I feel like your work is starting to pay off.

You need to shut the cluster down. I'd do it with gcloud since that's likely to be more reliable (or both, to test cloudtools delete command?)

I'd run the whole script with set -ex and exit trap to clean up the cluster (see below).

Seems like you should also append a random string to the cluster name in case two instances of the same source-target are running. Maybe use the job pod name?

Finally, if you're cancelling jobs, we might need another way to clean up (does the pod get any warning or is it summarily axed?) Running a clean up script from the repo is one way. Labelling clusters with job/pod info and having a service that watches k8s and cleans up abandoned clusters is another. Another option is to have a single test cluster up all the time that multiple jobs run against (although this doesn't test cloudtools).

$ cat foo.sh
#!/bin/bash

set -ex

function f() {
    set +e # otherwise bar won't echo
    
    echo foo
    ls /foo/bar/baz
    echo bar
}

trap f EXIT

echo quux
ls /foo/bar/baz
echo quam
$ bash foo.sh
+ trap f EXIT
+ echo quux
quux
+ ls /foo/bar/baz
ls: cannot access '/foo/bar/baz': No such file or directory
+ f
+ set +e
+ echo foo
foo
+ ls /foo/bar/baz
ls: cannot access '/foo/bar/baz': No such file or directory
+ echo bar
bar

@danking
Copy link
Contributor Author

danking commented Aug 31, 2018

A checklist of things to make this robust:

++ cluster start ci-test-4d8a9b262c3687f33359d92afdae693c819dfb09-e9e8a40bb4f0c2337e5088c26186a4da4948bed2 --version devel --spark 2.2.0 --jar build/libs/hail-all-spark.jar --zip build/distributions/hail-python.zip
ERROR: (gcloud.dataproc.clusters.create) PERMISSION_DENIED: Request had insufficient authentication scopes.
  • be certain clusters don't stick around

I am not too concerned about the latter. We should look carefully, but it appears that, by default, processes on pods get 30s notice via TERM before they're killed. All cluster needs to do is to send google a termination request. Although the command takes forever to exit after cluster stop, this is because it waits for the cluster to shut down before returning. I regularly issue cluster stop and then force-kill the cluster command instead of waiting for the cluster to shutdown.

@cseed
Copy link
Collaborator

cseed commented Aug 31, 2018

it appears that, by default, processes on pods get 30s notice via TERM before they're killed

Perfect! I was wondering about that.

Although the command takes forever to exit after cluster stop

gcloud dataproc clusters delete has an --async option.

@danking
Copy link
Contributor Author

danking commented Aug 31, 2018

it will pass once the next cloud tools version is deployed.

hail-ci-build.sh Outdated
#!/bin/bash
set -ex

CLUSTER_NAME=ci-test-${SOURCE_SHA:0:12}-${SOURCE_SHA:0:12}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the second one supposed to be TARGET_SHA?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel like your should generate a (globally) unique (random) cluster name.

cseed
cseed previously requested changes Aug 31, 2018
hail-ci-build.sh Outdated
time pip install -U cloudtools
gcloud auth activate-service-account \
hail-ci-0-1@broad-ctsa.iam.gserviceaccount.com \
--key-file=/secrets/hail-ci-0-1.key && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing &&

hail-ci-build.sh Outdated
#!/bin/bash
set -ex

CLUSTER_NAME=ci-test-${SOURCE_SHA:0:12}-${SOURCE_SHA:0:12}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel like your should generate a (globally) unique (random) cluster name.

hail-ci-build.sh Outdated
}
trap shutdown_cluster INT TERM

GRADLE_OPTS=-Xmx2048m ./gradlew testAll makeDocs archiveZip --gradle-user-home /gradle-cache && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set -ex, these && and trailing backslashes aren't necessary (and a bit more readable). Also, the EXIT_CODE can go, too.

@danking
Copy link
Contributor Author

danking commented Aug 31, 2018

@cseed how about ci-test-$(cat /dev/urandom | base64 | head -c 40)?

@cseed
Copy link
Collaborator

cseed commented Aug 31, 2018

base64 can contain some symbols, can the cluster names handle it? I normally use tr:

$ echo `LC_CTYPE=C tr -dc 'a-z0-9' < /dev/urandom | head -c 8`
k1awyx7o

@cseed
Copy link
Collaborator

cseed commented Aug 31, 2018

And indeed:

Cluster name 'ci-test-7ncBRTSJAgu1t8kTtltIXse5A1RwFFih0cIBma6T' must match pattern (?:[a-z](?:[-a-z0-9]{0,49}[a-z0-9])?)

And indeed, no upper case.

@danking
Copy link
Contributor Author

danking commented Sep 4, 2018

@cseed I found that I needed LC_ALL=C to appease OS X, so I included both env vars.

@danking
Copy link
Contributor Author

danking commented Sep 4, 2018

Waiting on a cloud tools change to let me set a staging bucket.

@danking
Copy link
Contributor Author

danking commented Sep 5, 2018

I have no explanation for the behavior of pip, it simply refuses to upgrade to the latest cloud tools.

+ pip search cloudtools
cloudtools (1.1.16)          - Collection of utilities for working on the Google Cloud Platform.
datawire-cloudtools (0.2.6)  - Datawire Cloud Tools
cloudseed (0.0.1)            - Cloudtools

real	0m0.867s
user	0m0.649s
sys	0m0.084s
+ pip install -U cloudtools
Collecting cloudtools
  Downloading https://files.pythonhosted.org/packages/46/78/966c9af5b88a01af73bb56486e853c00ff4865de0bf380282aa54fdec43a/cloudtools-1.1.15-py3-none-any.whl
Installing collected packages: cloudtools
Successfully installed cloudtools-1.1.15

real	0m1.718s
user	0m1.378s
sys	0m0.158s

@tpoterba
Copy link
Contributor

tpoterba commented Sep 5, 2018

I've seen this before, the PyPI databases are out of sync. You can see the latest with list but not install

@tpoterba
Copy link
Contributor

tpoterba commented Sep 5, 2018

wait an hour or so

@danking
Copy link
Contributor Author

danking commented Sep 5, 2018

hmm it seems to have finally updated

@danking
Copy link
Contributor Author

danking commented Sep 6, 2018

@cseed,

Alright, I think we've settled into a good pattern. I trap on EXIT and do all necessary clean up / copying. In the signal handler, I set +e and disable traps on INT and TERM. In the rest of the script, I use set -e and never use any &&'s.

I added a comment about this, but I'll also call out here this line:

trap "exit 42" INT TERM

I noticed that dash and sh do not execute the EXIT handler when interpretation halts due to receiving INT or TERM. The internet suggests that calling the EXIT handler when INT or TERM kills interpretation is a bash-ism. The line above explicitly calls exit which triggers the EXIT handler in sh and dash.

@danking danking merged commit d5a958d into hail-is:master Sep 6, 2018
danking added a commit to Nealelab/cloudtools that referenced this pull request Sep 7, 2018
copying some ideas from the discussion at hail-is/hail#4241
danking added a commit to Nealelab/cloudtools that referenced this pull request Sep 7, 2018
copying some ideas from the discussion at hail-is/hail#4241
danking pushed a commit that referenced this pull request May 8, 2019
* update

* update

* update

* updatE

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* update

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* update

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* added default Spark configs to init_notebook.py

* updatE

* update

* update

* add time delay to allow Jupyter to start

* add time delay to allow Jupyter to start

* revert log change

* update

* updatE

* updated README

* set maxResultSize property to unlimited

* changed default worker to n1-standard-8

* merged init_default.py functionality into init_notebook.py

* merged init_default.py functionality into init_notebook.py

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* fixed order of code in init script

* fixed ports argument in start-up script

* moved waiting for Jupyter code to init script

* updated alias code block

* fixed init script filename

* Google Chrome check needs to be fixed

* added gitignore

* highmem worker by default with --vep.

* added --hash option to start_cluster.py to reference older Hail builds

* Merge pull request #5 from Nealelab/dev

added --hash option to start_cluster.py to reference older Hail builds

* decoupled default conf in Jupyter notebook Spark from /etc/spark/conf/spark-defaults.conf

* typo in submit_cluster

* modified init_notebook script

* Update stop_cluster.py

* Now passes extra properties to gcloud

* added ability to specifiy custom Hail jar and zip for Jupyter notebook on startup

* Some tightening of options

* Moving into main

* Removed duplicate keyword argument

* remove duplicate argument

* Added diagnose_cluster.py

Compiles log files for a cluster to a local directory or a google bucket

```
python diagnose_cluster.py -n my-cluster -d my-cluster-diagnose/
python diagnose_cluster.py -n my-cluster -d gs://my-bucket/my-cluster-diagnose/
```

```
usage: diagnose_cluster.py [-h] --name NAME --dest DEST [--hail-log HAIL_LOG]
                           [--overwrite] [--no-diagnose] [--compress]
                           [--workers [WORKERS [WORKERS ...]]] [--take TAKE]

optional arguments:
  -h, --help            show this help message and exit
  --name NAME, -n NAME  Cluster name
  --dest DEST, -d DEST  Directory for diagnose output -- must be local
  --hail-log HAIL_LOG, -l HAIL_LOG
                        Path for hail.log file
  --overwrite           Delete dest directory before adding new files
  --no-diagnose         Do not run gcloud dataproc clusters diagnose
  --compress, -z        GZIP all files
  --workers [WORKERS [WORKERS ...]]
                        Specific workers to get log files from
  --take TAKE           Only download logs from the first N workers
```

- Runs `gcloud dataproc clusters diagnose`
- Grabs following log files from master node

```
/var/log/hive/hive-*
/var/log/google-dataproc-agent.0.log
/var/log/dataproc-initialization-script-0.log
/var/log/hadoop-mapreduce/mapred-mapred-historyserver*
/var/log/hadoop-hdfs/*-m.*
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-*-m.*
/home/hail/hail.log # can be modified with command line argument
```

- Grabs following log files from workers

```
/var/log/hadoop-hdfs/hadoop-hdfs-datanode-*.*
/var/log/dataproc-startup-script.log
/var/log/hadoop-yarn/yarn-yarn-nodemanager-*.*
/var/log/hadoop-yarn/userlogs/*
```

Output directory has following structure:

```
diagnostic.tar
master/my-cluster-m/...
workers/my-cluster-w-*/...
hadoop-yarn/userlogs/application*/container*
```

* sec worker fix

* Break apart ssh options.

Saw failures with some version of gcloud/ssh.

* Exposed --metadata and fixed problem with creating directory

* Recapitulating subprocess fixes of PR #11

* Fix typo in README

* Added executable

* Updates to support multiple Hail versions and new deployment locations.

 - init_notebook is now versioned for compatibility. This commit uses
   version 2, which I've uploaded to gs://hail-common/init_notebook-2.py.
 - Hail now deploys both 0.1 and devel versions, so I added an argument to
   allow either to be used. The stable version should of course be used by
   default.
 - The init arg is now empty by default, because the init_notebook script
   should always be run (and requires the compatibility version to decide
   the correct path). It is still possible to use additional init actions.

* small fix in init_notebook; updated submit script to reflect new Hail deployment

* packaged commands under umbrella 'cluster' module

* updated diagnose

* updated readme; added --quiet flag to stop command

* updated readme with optional arguments

* Update LICENSE.txt

* make notebook default for cluster connect

* Overhaul CLI using argparse subparsers; interface change

 - More informative help messages
 - Added default args to --help output
 - Interface change: module comes before name

* Fixed HAIL_VERSION metadata variable.

* updated setup.py to reflect v1.1

* changed some instances of check_call to call to avoid redundant errors

* Remove zsh artifacts from README

* added --args option to submit script to allow passing arguments to submitted Hail scripts

* incremented to v1.1.2

* Update README.md

* Remove sleep

* removed Anaconda from notebook init; added --pkgs option to cluster start

* Fix deployment issues by bumping compatibility version

* fixed jar distribution issues

* forgot something

* Updating spark version to dataproc 1.2

* a few fixes for 2.2.0

* COMPAT version changes

* Made the os.mkdir statements safer and free from race conditions

* Fix cloudtools to work with Hail devel / 0.2 (#47)

* Update README.md (#48)

* Unify hail 0.1 and 0.2 again, fix submit (#49)

* Unify hail 0.1 and 0.2 again, fix submit

* Fixed submit help message

* Bump version

* Update init_notebook.py (#51)

* Add parsimonious (#52)

* Parameterize master memory fraction (#53)

* Parameterize master memory fraction

* Parameterize master memory fraction

* Parameterize master memory fraction

* add bokeh to imports (#54)

* Use specific version of decorator (#56)

* Update README.md (#57)

* add modify jar and zip (#59)

* * Fixed zip copying (#60)

* Added gs:// support

* rolling back google-cloud version (#62)

* moved up package installation in init script (#63)

* use beta for max-idle option (#61)

* use beta for max-idle option

* bug fix

* added Intel MKL to init script (#64)

* added Intel MKL to init script

* fix

* another fix

* Update default version to devel / spark 2.2.0; update README (#65)

* Update default version to devel / spark 2.2.0; update README

* fix

* Added initialization time-out option. (#71)

* add async option to stop (#73)

* check for errors in start, stop, submit, and list (#74)

* update version to 1.14 (#75)

* Syntax error (#76)

* fix syntax error

* bump versino

* add a bucket parameter (#78)

* add a bucket parameter

* also document deployment

* use config files to set some default properties (#77)

* do... something

* set image based on spark version

* tweak to run using paths that deploy will spit out

* fix

* fix rebase

* Set up Continuous Integration (#80)

* wip hail ci

* fix formatting

* ignore emacs temp files

* add cluster sanity checks

* Update setup.py

* Update cluster-sanity-check-0.2.py

* Fix CI Build (#81)

* Update hail-ci-build.sh

* Update hail-ci-build.sh

* add more necessary things

* fix build image and update file

* fix build image maybe

* use python2

* fix image

* Update hail-ci-build.sh

* Update hail-ci-build.sh

* Continuous Deployment (#82)

* add deploy script

* document deployment secret creation

* fix readme

* fix if check

* ensure twine is in build image

* kick ci

* set required property? apparently?

* bump to 0.2 (#79)

* add make to image (#85)

* fix deploy (#86)

* copy some lessons from hail (#84)

copying some ideas from the discussion at #4241

* Update hail-ci-deploy.sh (#87)

* fix (#88)

* fix cloudtools published check (#89)

* add warning, versioned hash lookup (#90)

* fix deploy script version checking (#92)

* Test python 3.6 and fix python 3.7 incompatibility (#91)

* test python3

* also fix async is reserved word

* checked in bad build file

* unneeded var

* shush pip

* kick ci

* update build hash

* Ignore INT and TERM in shutdown_cluster

* parse init script list (#94)

* parse init script list

* Update __init__.py

* switched devel vep to use docker init (#96)

* bump version for vep init (#98)

* deploy python2 and python3 to pypi (#93)

* Update start.py (#99)

* Update start.py

* Update start.py

* Update __init__.py (#100)

* fix python3 deploy (#101)

* Fix pkgs logic (#102)

* Adding more options to modify (#67)

* Added options to modify clusters

* Update modify.py

* Add a max-idle 40m to test clusters (#103)

* Add a max-idle 40m to test clusters

* need gcloud beta components

* Pin dependency versions (#105)

* pin dependency versions

* update the version of cloudtools

* install all packages together to ensure dependencies are calculated together

* fail when subprocess fails

* fix conda invocation

* compatibility with python2

* Revert "fail when subprocess fails"

This reverts commit 25e7c0a524823d91894b538427f179611e79f271.

* blah

* wtf

* if was backwards

* restart tests

* Improve Error Messages when Subprocesses Fail (#111)

* add and use safe_call

* fail when subprocess fails

* use safe_call

* use safe_call extensively

* simplify and make correct safe_call

* fix splat

* fix

* foo

* update verison (#113)

* Added describe to get Hail file info/schema (#112)

* Added describe to get Hail file info/schema

* f -> format

* Update setup.py

* Update __init__.py (#115)

* Fix cloudtools (#116)

* fix

* bump version

* fix (#117)

* bump ver (#118)

* fixed describe ordering for python2 (#119)

* devel => 0.2 (#121)

* add latest (#120)

* added --max-age option. (#123)

* added --max-age option.

* bump version

* update to 3.0.0 (#122)

* update to 3.0.0

* bump

* bump

* s/devel/0.2

* Fix packages again (#124)

* Fix packages again

* fix

* Add 'modify' and 'list' command docs (#125)

* Update connect.py (#126)

* Rollout fix for chrome 72 (#130)

* Add python files or folders from environment variable; zip files together (#127)

* Add python files or folders from environment variable; zip files together

* bumping version

* files -> pyfiles

* missed one

* overloaded variable

* updating VEP init script (#129)

* updating VEP init script

* Update __init__.py

* files -> pyfiles once more (#131)

* fix for jupyter/tornado incompatibility (#133)

* Adding project flag (#134)

* Adding project flag

* Adding configuration option as well

* Adding support for GRCh38 VEP (#135)

* Adding support for GRCh38 VEP

* version bump

* Fixing VEP version for 38 (#136)

* Adding support for GRCh38 VEP

* version bump

* fix for 38 VEP version

* Update __init__.py

* Disable stackdriver on cloudtools clusters (#138)

* Update default spark version (#139)

* Update default spark version

* Clean up imports

* allowing pass-through args for submit (#140)

* allowing pass-through args for submit

* bump version

* moar version

* moved cloudtools to subdirectory project for inclusion in monorepo

* moved .gitignore

* bump

* bump
@danking danking deleted the most-basic-cluster-test branch December 18, 2019 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants