Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add add-slaves and remove-slaves commands #115

Merged
merged 22 commits into from
Aug 14, 2016
Merged

Add add-slaves and remove-slaves commands #115

merged 22 commits into from
Aug 14, 2016

Conversation

nchammas
Copy link
Owner

@nchammas nchammas commented May 7, 2016

This PR adds add-slaves and remove-slaves commands which enable the user to resize existing clusters. add-slaves will query the master for its configuration and automatically use that information to provision new slaves. The only configuration the user may want to specify is the spot price for new slaves.

The PR also fixes a number of bugs, including a surprising config problem I discovered ( 0c9a24b; how did it not stand out before? 🤔) that may have been causing sporadic problems. These bugs were interfering with my ability to test this PR, so I thought it would be good to fix them as part of this work.

TODO:

Open questions:

  • Does the Spark master need to be restarted
    • when slaves are removed from a running cluster? No.
    • when slaves are added to a running cluster? No.
  • Does the HDFS master need to be restarted
    • when slaves are removed from a running cluster? No?

      HDFS seems to work properly, though warnings are thrown when you add new files due to the dead slaves. hdfs dfsadmin -report still shows the removed slaves, and hdfs dfsadmin -refreshNodes doesn't seem to remove them from the report that dfsadmin shows. I'm not sure if this is OK.

    • when slaves are added to a running cluster? No.

Fixes #16.
Fixes #113.
Fixes #140.

@nchammas nchammas force-pushed the resize-cluster branch 3 times, most recently from 319f464 to 9c41184 Compare May 28, 2016 17:23
nchammas referenced this pull request in devsisters/flintrock Jul 24, 2016
@nchammas
Copy link
Owner Author

Not sure why the Travis build is failing. Am investigating here: pyinstaller/pyinstaller#2123

@nchammas
Copy link
Owner Author

nchammas commented Aug 7, 2016

I am ready to merge this PR in.

Pinging @ereed-tesla, @serialx, and @soyeonbaek-dev, since I know you are interested in this feature.

Do you have any feedback or questions about this before I merge it in? Especially regarding the some of the open questions noted above.

@nchammas nchammas changed the title [WIP] Add add-slaves and remove-slaves commands Add add-slaves and remove-slaves commands Aug 7, 2016
@serialx
Copy link
Contributor

serialx commented Aug 8, 2016

First, hooray for adding cluster resize! Very awaited feature indeed! 😃

For those who don't know: We have been cooking our own cluster resize patches to flintrock and have been using that for over two weeks. We mainly only use Spark without HDFS, so our experiences with dynamic cluster sizes are limited to Spark. And we don't restart master. It seems to work fine with both adding and removing slaves. Since the system itself is fault tolerant, Spark seems to work fine with this.

About removing slaves. The code simply removes N slaves in the list. I think it would be better to have some kind of prioritisation, like removing spot instances first. You can see @soyeonbaek-dev's code

@nchammas
Copy link
Owner Author

nchammas commented Aug 8, 2016

About removing slaves. ... I think it would be better to have some kind of prioritisation, like removing spot instances first.

This is a good suggestion. If you have a mix of spot and on-demand instances (which this PR now makes possible), you probably do want to remove the spot instances first.

I'll add this to the PR.

@nchammas
Copy link
Owner Author

OK, that's done. I'll leave this open for a few more days and then merge it in if there's no more feedback.

@serialx
Copy link
Contributor

serialx commented Aug 11, 2016

Testing this branch revealed a bug related to configuring EC2 instance profiles. When I use --ec2-instance-profile-name settings to create a cluster, add-slaves command fails with the following error:

Traceback (most recent call last):
  File "/usr/local/bin/flintrock", line 9, in <module>
    load_entry_point('Flintrock==0.6.0.dev0', 'console_scripts', 'flintrock')()
  File "/usr/local/lib/python3.5/site-packages/flintrock/flintrock.py", line 1031, in main
    cli(obj={})
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/flintrock/flintrock.py", line 653, in add_slaves
    **provider_options)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 46, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 272, in add_slaves
    instance_initiated_shutdown_behavior=instance_initiated_shutdown_behavior)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 677, in _create_instances
    'EbsOptimized': ebs_optimized})['SpotInstanceRequests']
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 278, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 572, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RequestSpotInstances operation: Value (AIPAJZTBVWIAVNULB3U6W) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name

I was able to fix this bug using a simple patch. But I don't really like how this works. Perhaps you have a better solution to this. :)

diff --git a/flintrock/ec2.py b/flintrock/ec2.py
index 0f3dd1a..630d8fa 100644
--- a/flintrock/ec2.py
+++ b/flintrock/ec2.py
@@ -249,7 +249,10 @@ class EC2Cluster(FlintrockCluster):
         if not self.master_instance.iam_instance_profile:
             instance_profile_name = ''
         else:
-            instance_profile_name = self.master_instance.iam_instance_profile['Id']
+            iam = boto3.resource('iam')
+            instance_profile_id = self.master_instance.iam_instance_profile['Id']
+            profiles = filter(lambda x: x.instance_profile_id == instance_profile_id, iam.instance_profiles.all())
+            instance_profile_name = list(profiles)[0].instance_profile_name
         instance_initiated_shutdown_behavior = response['InstanceInitiatedShutdownBehavior']['Value']

         self.add_slaves_check()

Edit: Oops! Wrong traceback. Changed to the one relevant!

@nchammas
Copy link
Owner Author

Hmm, so the IAM profile ID is different from the IAM profile name? I presume iam_instance_profile['Arn'] is also not what we want?

I'll take a closer look at this tonight or tomorrow.

@serialx
Copy link
Contributor

serialx commented Aug 12, 2016

I tried the Arn approach. Also failed:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/serialx/workspace/flintrock/flintrock/__main__.py", line 8, in <module>
    sys.exit(main())
  File "/Users/serialx/workspace/flintrock/flintrock/flintrock.py", line 1031, in main
    cli(obj={})
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/serialx/workspace/flintrock/flintrock/flintrock.py", line 653, in add_slaves
    **provider_options)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 46, in wrapper
    res = func(*args, **kwargs)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 275, in add_slaves
    instance_initiated_shutdown_behavior=instance_initiated_shutdown_behavior)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 680, in _create_instances
    'EbsOptimized': ebs_optimized})['SpotInstanceRequests']
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/botocore/client.py", line 278, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/botocore/client.py", line 572, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RequestSpotInstances operation: Value (arn:aws:iam::527973214447:instance-profile/zeppelin-role) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name

I guess we need to feed it with IAM Profile Names. :(

@nchammas
Copy link
Owner Author

Thanks for investigating @serialx, it looks like you're right. The only way to get the profile by name is with the heavy-handed pull and filter. I've reported a couple of issues against Boto3 to see if there is a way this can be fixed at some point:

I've pushed a new commit that fixes this based off of your patch above. Take a look and let me know if you have any other concerns. Thanks again for finding and reporting this issue.

@nchammas nchammas merged commit ecd8782 into master Aug 14, 2016
@nchammas nchammas deleted the resize-cluster branch August 14, 2016 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants