Add add-slaves and remove-slaves commands #115

nchammas · 2016-05-07T18:34:41Z

This PR adds add-slaves and remove-slaves commands which enable the user to resize existing clusters. add-slaves will query the master for its configuration and automatically use that information to provision new slaves. The only configuration the user may want to specify is the spot price for new slaves.

The PR also fixes a number of bugs, including a surprising config problem I discovered ( 0c9a24b; how did it not stand out before? 🤔) that may have been causing sporadic problems. These bugs were interfering with my ability to test this PR, so I thought it would be good to fix them as part of this work.

TODO:

Implement add-slaves.
Implement remove-slaves.
Update README.
Protect all calls to instances.filter() from empty input. (see instances.filter(InstanceIds=[]) returns all instances, rather than an empty set boto/boto3#479)
Update help text for the new commands.
~~Add acceptance tests.~~

Will defer this till later since the full test suite already takes too long, and since testing this properly is a bit difficult.

Open questions:

Does the Spark master need to be restarted
- when slaves are removed from a running cluster? No.
- when slaves are added to a running cluster? No.
Does the HDFS master need to be restarted
- when slaves are removed from a running cluster? No?
  
  HDFS seems to work properly, though warnings are thrown when you add new files due to the dead slaves. hdfs dfsadmin -report still shows the removed slaves, and hdfs dfsadmin -refreshNodes doesn't seem to remove them from the report that dfsadmin shows. I'm not sure if this is OK.
- when slaves are added to a running cluster? No.

Fixes #16.
Fixes #113.
Fixes #140.

nchammas · 2016-07-31T20:55:28Z

Not sure why the Travis build is failing. Am investigating here: pyinstaller/pyinstaller#2123

This protects against the issue described here: boto/boto3#479

A fix for #129?

nchammas · 2016-08-07T21:05:41Z

I am ready to merge this PR in.

Pinging @ereed-tesla, @serialx, and @soyeonbaek-dev, since I know you are interested in this feature.

Do you have any feedback or questions about this before I merge it in? Especially regarding the some of the open questions noted above.

serialx · 2016-08-08T05:00:40Z

First, hooray for adding cluster resize! Very awaited feature indeed! 😃

For those who don't know: We have been cooking our own cluster resize patches to flintrock and have been using that for over two weeks. We mainly only use Spark without HDFS, so our experiences with dynamic cluster sizes are limited to Spark. And we don't restart master. It seems to work fine with both adding and removing slaves. Since the system itself is fault tolerant, Spark seems to work fine with this.

About removing slaves. The code simply removes N slaves in the list. I think it would be better to have some kind of prioritisation, like removing spot instances first. You can see @soyeonbaek-dev's code

nchammas · 2016-08-08T13:04:53Z

About removing slaves. ... I think it would be better to have some kind of prioritisation, like removing spot instances first.

This is a good suggestion. If you have a mix of spot and on-demand instances (which this PR now makes possible), you probably do want to remove the spot instances first.

I'll add this to the PR.

nchammas · 2016-08-10T04:07:37Z

OK, that's done. I'll leave this open for a few more days and then merge it in if there's no more feedback.

serialx · 2016-08-11T10:54:06Z

Testing this branch revealed a bug related to configuring EC2 instance profiles. When I use --ec2-instance-profile-name settings to create a cluster, add-slaves command fails with the following error:

Traceback (most recent call last):
  File "/usr/local/bin/flintrock", line 9, in <module>
    load_entry_point('Flintrock==0.6.0.dev0', 'console_scripts', 'flintrock')()
  File "/usr/local/lib/python3.5/site-packages/flintrock/flintrock.py", line 1031, in main
    cli(obj={})
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/flintrock/flintrock.py", line 653, in add_slaves
    **provider_options)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 46, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 272, in add_slaves
    instance_initiated_shutdown_behavior=instance_initiated_shutdown_behavior)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 677, in _create_instances
    'EbsOptimized': ebs_optimized})['SpotInstanceRequests']
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 278, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 572, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RequestSpotInstances operation: Value (AIPAJZTBVWIAVNULB3U6W) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name

I was able to fix this bug using a simple patch. But I don't really like how this works. Perhaps you have a better solution to this. :)

diff --git a/flintrock/ec2.py b/flintrock/ec2.py
index 0f3dd1a..630d8fa 100644
--- a/flintrock/ec2.py
+++ b/flintrock/ec2.py
@@ -249,7 +249,10 @@ class EC2Cluster(FlintrockCluster):
         if not self.master_instance.iam_instance_profile:
             instance_profile_name = ''
         else:
-            instance_profile_name = self.master_instance.iam_instance_profile['Id']
+            iam = boto3.resource('iam')
+            instance_profile_id = self.master_instance.iam_instance_profile['Id']
+            profiles = filter(lambda x: x.instance_profile_id == instance_profile_id, iam.instance_profiles.all())
+            instance_profile_name = list(profiles)[0].instance_profile_name
         instance_initiated_shutdown_behavior = response['InstanceInitiatedShutdownBehavior']['Value']

         self.add_slaves_check()

Edit: Oops! Wrong traceback. Changed to the one relevant!

nchammas · 2016-08-11T22:26:29Z

Hmm, so the IAM profile ID is different from the IAM profile name? I presume iam_instance_profile['Arn'] is also not what we want?

I'll take a closer look at this tonight or tomorrow.

serialx · 2016-08-12T06:51:37Z

I tried the Arn approach. Also failed:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/serialx/workspace/flintrock/flintrock/__main__.py", line 8, in <module>
    sys.exit(main())
  File "/Users/serialx/workspace/flintrock/flintrock/flintrock.py", line 1031, in main
    cli(obj={})
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/serialx/workspace/flintrock/flintrock/flintrock.py", line 653, in add_slaves
    **provider_options)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 46, in wrapper
    res = func(*args, **kwargs)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 275, in add_slaves
    instance_initiated_shutdown_behavior=instance_initiated_shutdown_behavior)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 680, in _create_instances
    'EbsOptimized': ebs_optimized})['SpotInstanceRequests']
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/botocore/client.py", line 278, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/botocore/client.py", line 572, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RequestSpotInstances operation: Value (arn:aws:iam::527973214447:instance-profile/zeppelin-role) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name

I guess we need to feed it with IAM Profile Names. :(

nchammas · 2016-08-12T17:44:24Z

Thanks for investigating @serialx, it looks like you're right. The only way to get the profile by name is with the heavy-handed pull and filter. I've reported a couple of issues against Boto3 to see if there is a way this can be fixed at some point:

I've pushed a new commit that fixes this based off of your patch above. Take a look and let me know if you have any other concerns. Thanks again for finding and reporting this issue.

nchammas force-pushed the resize-cluster branch 3 times, most recently from 319f464 to 9c41184 Compare May 28, 2016 17:23

nchammas mentioned this pull request Jul 21, 2016

Elastic spark cluster amplab/spark-ec2#39

Open

nchammas force-pushed the resize-cluster branch from 9c41184 to dba74f6 Compare July 24, 2016 03:56

nchammas referenced this pull request in devsisters/flintrock Jul 24, 2016

add add_slaves command in flintrock

4a357ad

nchammas added 11 commits August 1, 2016 22:19

working remove-slaves

e6ddbfd

move manifest loading logic to its own method

957d2ba

working remove when cluster is stopped

2661062

clamp to max slaves

b5cf271

remove comment

0b41950

fix style

a1c3f78

source new slave properties from existing master

ce54419

style fix

1f18a4a

working add-slaves

eadbbd5

correct pluralization

1be6ef3

add new commands to README

6ca22d1

nchammas force-pushed the resize-cluster branch from ff41826 to 6ca22d1 Compare August 2, 2016 02:19

nchammas added 6 commits August 6, 2016 01:38

switch all EC2 filters to use Filters parameter

dd32ba9

This protects against the issue described here: boto/boto3#479

simplify + correct master startup code

1ae0e17

fix Spark env settings

0c9a24b

A fix for #129?

adjust print to handle missing master

08ea064

expand help text for add-slaves command

4f412ec

remove obsolete comment

70e7987

nchammas changed the title ~~[WIP] Add add-slaves and remove-slaves commands~~ Add add-slaves and remove-slaves commands Aug 7, 2016

remove spot slaves first, if any

dd1566c

This was referenced Aug 11, 2016

Add running scripts before and after cluster provision #142

Closed

Ensure Java >= 8 on launched nodes #143

Closed

nchammas added 2 commits August 12, 2016 13:39

correctly lookup IAM profile name from master

692cfe5

add IAM profile name to template

03e9178

nchammas added 2 commits August 12, 2016 14:28

add CHANGES for #115

e8ed8a7

use IAM profile ARN internally

95d7bc9

nchammas merged commit ecd8782 into master Aug 14, 2016

nchammas deleted the resize-cluster branch August 14, 2016 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add add-slaves and remove-slaves commands #115

Add add-slaves and remove-slaves commands #115

nchammas commented May 7, 2016 •

edited

Loading

nchammas commented Jul 31, 2016

nchammas commented Aug 7, 2016

serialx commented Aug 8, 2016

nchammas commented Aug 8, 2016

nchammas commented Aug 10, 2016

serialx commented Aug 11, 2016 •

edited

Loading

nchammas commented Aug 11, 2016

serialx commented Aug 12, 2016

nchammas commented Aug 12, 2016

Add add-slaves and remove-slaves commands #115

Add add-slaves and remove-slaves commands #115

Conversation

nchammas commented May 7, 2016 • edited Loading

nchammas commented Jul 31, 2016

nchammas commented Aug 7, 2016

serialx commented Aug 8, 2016

nchammas commented Aug 8, 2016

nchammas commented Aug 10, 2016

serialx commented Aug 11, 2016 • edited Loading

nchammas commented Aug 11, 2016

serialx commented Aug 12, 2016

nchammas commented Aug 12, 2016

nchammas commented May 7, 2016 •

edited

Loading

serialx commented Aug 11, 2016 •

edited

Loading