-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add add-slaves and remove-slaves commands #115
Conversation
319f464
to
9c41184
Compare
Not sure why the Travis build is failing. Am investigating here: pyinstaller/pyinstaller#2123 |
ff41826
to
6ca22d1
Compare
This protects against the issue described here: boto/boto3#479
A fix for #129?
I am ready to merge this PR in. Pinging @ereed-tesla, @serialx, and @soyeonbaek-dev, since I know you are interested in this feature. Do you have any feedback or questions about this before I merge it in? Especially regarding the some of the open questions noted above. |
First, hooray for adding cluster resize! Very awaited feature indeed! 😃 For those who don't know: We have been cooking our own cluster resize patches to flintrock and have been using that for over two weeks. We mainly only use Spark without HDFS, so our experiences with dynamic cluster sizes are limited to Spark. And we don't restart master. It seems to work fine with both adding and removing slaves. Since the system itself is fault tolerant, Spark seems to work fine with this. About removing slaves. The code simply removes N slaves in the list. I think it would be better to have some kind of prioritisation, like removing spot instances first. You can see @soyeonbaek-dev's code |
This is a good suggestion. If you have a mix of spot and on-demand instances (which this PR now makes possible), you probably do want to remove the spot instances first. I'll add this to the PR. |
OK, that's done. I'll leave this open for a few more days and then merge it in if there's no more feedback. |
Testing this branch revealed a bug related to configuring EC2 instance profiles. When I use
I was able to fix this bug using a simple patch. But I don't really like how this works. Perhaps you have a better solution to this. :) diff --git a/flintrock/ec2.py b/flintrock/ec2.py
index 0f3dd1a..630d8fa 100644
--- a/flintrock/ec2.py
+++ b/flintrock/ec2.py
@@ -249,7 +249,10 @@ class EC2Cluster(FlintrockCluster):
if not self.master_instance.iam_instance_profile:
instance_profile_name = ''
else:
- instance_profile_name = self.master_instance.iam_instance_profile['Id']
+ iam = boto3.resource('iam')
+ instance_profile_id = self.master_instance.iam_instance_profile['Id']
+ profiles = filter(lambda x: x.instance_profile_id == instance_profile_id, iam.instance_profiles.all())
+ instance_profile_name = list(profiles)[0].instance_profile_name
instance_initiated_shutdown_behavior = response['InstanceInitiatedShutdownBehavior']['Value']
self.add_slaves_check() Edit: Oops! Wrong traceback. Changed to the one relevant! |
Hmm, so the IAM profile ID is different from the IAM profile name? I presume I'll take a closer look at this tonight or tomorrow. |
I tried the Arn approach. Also failed:
I guess we need to feed it with IAM Profile Names. :( |
Thanks for investigating @serialx, it looks like you're right. The only way to get the profile by name is with the heavy-handed pull and filter. I've reported a couple of issues against Boto3 to see if there is a way this can be fixed at some point:
I've pushed a new commit that fixes this based off of your patch above. Take a look and let me know if you have any other concerns. Thanks again for finding and reporting this issue. |
This PR adds
add-slaves
andremove-slaves
commands which enable the user to resize existing clusters.add-slaves
will query the master for its configuration and automatically use that information to provision new slaves. The only configuration the user may want to specify is the spot price for new slaves.The PR also fixes a number of bugs, including a surprising config problem I discovered ( 0c9a24b; how did it not stand out before? 🤔) that may have been causing sporadic problems. These bugs were interfering with my ability to test this PR, so I thought it would be good to fix them as part of this work.
TODO:
Implement
add-slaves
.Implement
remove-slaves
.Update README.
Protect all calls to
instances.filter()
from empty input. (see instances.filter(InstanceIds=[]) returns all instances, rather than an empty set boto/boto3#479)Update help text for the new commands.
Add acceptance tests.Will defer this till later since the full test suite already takes too long, and since testing this properly is a bit difficult.
Open questions:
when slaves are removed from a running cluster? No?
HDFS seems to work properly, though warnings are thrown when you add new files due to the dead slaves.
hdfs dfsadmin -report
still shows the removed slaves, andhdfs dfsadmin -refreshNodes
doesn't seem to remove them from the report thatdfsadmin
shows. I'm not sure if this is OK.when slaves are added to a running cluster? No.
Fixes #16.
Fixes #113.
Fixes #140.