Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout when trying to bootstrap or ssh spot request instances #372

Closed
jeremywadsack opened this issue Jun 29, 2017 · 7 comments
Closed

Comments

@jeremywadsack
Copy link

I'm running into an issue where I can launch an instance and configure it without a problem. If I try to do the same with a spot request instance it hangs and eventually times out.

      def service
        @service ||= Fog::Compute.new(
            provider:              "AWS",
            aws_access_key_id:     Settings.amazon.access_key_id,
            aws_secret_access_key: Settings.amazon.secret_access_key,
            region:                Settings.parser.region
        )
      end

      def launch(name)
        options = {
            :name                      => name,
            :tags                      => {"Name": name},
            :flavor_id                 => Settings.parser.instance_type,
            :image_id                  => Settings.parser.image,
            :private_key_path          => File.join(Rails.root, "secrets", "parser"),
            :public_key_path           => File.join(Rails.root, "secrets", "parser.pub"),
            :username                  => "ec2-user",
            :groups                    => ["quick-start-1"],
            :availability_zone         => Settings.parser.zone,
            :iam_instance_profile_name => "parser-role",
            :block_device_mapping      => [
                {
                    "DeviceName"              => "/dev/xvda",
                    "Ebs.VolumeSize"          => 40,
                    "Ebs.DeleteOnTermination" => true,
                    "Ebs.VolumeType"          => "gp2"
                }
            ]
        }

        server = service.spot_requests.bootstrap(options.merge(
            :price => "0.03",
            :type  => "one-time",
        ))
      end

Calling launch(...) raises Fog::Errors::TimeoutError from the bootstrap call.

~/.rvm/gems/ruby-2.3.1@log-processing/gems/fog-core-1.44.3/lib/fog/core/wait_for.rb:9:in `block in wait_for': The specified wait_for timeout (600 seconds) was exceeded (Fog::Errors::TimeoutError)
	from ~/.rvm/gems/ruby-2.3.1@log-processing/gems/fog-core-1.44.3/lib/fog/core/wait_for.rb:6:in `loop'
	from ~/.rvm/gems/ruby-2.3.1@log-processing/gems/fog-core-1.44.3/lib/fog/core/wait_for.rb:6:in `wait_for'
	from ~/.rvm/gems/ruby-2.3.1@log-processing/gems/fog-core-1.44.3/lib/fog/core/model.rb:74:in `wait_for'
	from ~/.rvm/gems/ruby-2.3.1@log-processing/gems/fog-aws-1.4.0/lib/fog/aws/models/compute/server.rb:219:in `setup'
	from ~/.rvm/gems/ruby-2.3.1@log-processing/gems/fog-aws-1.4.0/lib/fog/aws/models/compute/spot_requests.rb:71:in `bootstrap'

The problem appears to be that it's waiting to be sshable? which never returns true so it times out.

I dug into that method and logged the errors and it appears to timeout on the connection:

54.215.70.245 #<Errno::ECONNREFUSED: Connection refused - connect(2) for 54.215.70.245:22>
54.215.70.245 #<Errno::ECONNREFUSED: Connection refused - connect(2) for 54.215.70.245:22>
54.215.70.245 #<Errno::ECONNREFUSED: Connection refused - connect(2) for 54.215.70.245:22>
54.215.70.245 #<Errno::ECONNREFUSED: Connection refused - connect(2) for 54.215.70.245:22>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>
54.215.70.245 #<Timeout::Error: execution expired>

While it's retrying AWS shows the instance as ready and I'm able to ssh into the instance using the credentials.

If I change this to servers.bootstrap then is works just fine, even though the sshable? call is the same for both.

I'm running fog-aws 1.4.0.

I'm new to spot instances so it's possible that I've configured something wrong, but I'm not sure what I need to change. Looking for guidance or anything I can do to help resolve this or help provide a fix.

@geemus
Copy link
Member

geemus commented Jul 6, 2017

It may just be a timing issue in terms of how/when things come up on AWS itself. I'm afraid I don't personally have much experience with spot instances though. You might consider decomposing the bootstrap method and adding a larger delay before it tries to check sshable? and/or a longer timeout, that might be enough to get you what you want. I'm certainly happy to discuss more if you have questions. Thanks!

@jeremywadsack
Copy link
Author

@geemus Thanks for the suggestion. I had already decomposed #bootstrap for other reasons but dug into #sshable? I found that if I change the Timeout parameter in that method in fog-core to 11, then it works fine. If it's 10 or less it hangs until the 600s enclosing Timeout fails.

I can open a PR with that change, but I suspect the few extra second are likely going to be different for different cloud providers, instance types, zones, and other factors. Would it make sense to use an exponential decay for that call?

@geemus
Copy link
Member

geemus commented Jul 12, 2017

@jeremywadsack thanks for taking the time to dig-in. I imagine I set it at 10 because it was easy and seemed good-enough at the time. I would tend to agree that it is brittle though (especially across providers). Exponential backoff sounds good, as long as we can keep the code for it from getting to complex. Would you be up for taking a pass at that? Thanks!

@jeremywadsack
Copy link
Author

jeremywadsack commented Jul 12, 2017 via email

@geemus
Copy link
Member

geemus commented Jul 12, 2017

Sure, sounds good.

jeremywadsack added a commit to keylimetoolbox/fog-core that referenced this issue Jul 19, 2017
… not respond in 8s

Issue fog/fog-aws#372

For AWS Spot Request, instances never complete the `#setup` process
eventually timing out through the `#wait_for` block because the
default 8s is apparently not long enough to get the ssh connection.
Through testing 11s seemed to work, but this is likely highly variable
for regions, instance types, images, and providers.

This change implements a 1.5 increase in timeout each time #sshable?
is called, capped at 60s. If a successful connection is made, then
the timeout is reset to the initial value.
@jeremywadsack
Copy link
Author

@geemus Opened a PR with a solution for fixing that. Let me know if this makes sense, any concerns, etc.

@geemus
Copy link
Member

geemus commented Jul 25, 2017

@jeremywadsack thanks, it's looking pretty good (sorry for the delay, I was on vacation last week). I'm going to go ahead and close this before I forget/lose it. I'm pretty confident we'll get your PR (or something quite like it) in soon. Thanks!

@geemus geemus closed this as completed Jul 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants