Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait improvements #161

Merged
merged 1 commit into from
Jul 11, 2021
Merged

Wait improvements #161

merged 1 commit into from
Jul 11, 2021

Conversation

DavidGOrtega
Copy link
Contributor

@DavidGOrtega DavidGOrtega commented Jul 9, 2021

@DavidGOrtega DavidGOrtega requested a review from 0x2b3bfa0 July 9, 2021 09:49
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the connection timeout from 10 to 2 seconds in the call to runCommand may not have the desired effect. It will only limit the timeout of each connection, but not the delay between tries, which is being automatically handled by Terraform's exponential backoff algorithm. If you want to reduce the latter, we would need to use the raw WaitForStateContext instead. Is this change worth the maintenance overhead? I don't think people will notice the 10 second difference.

In my opinion, the full instance log is not competely useless. It will only show in case of failure, and it has helped us in the past to diagnose quite a few difficult issues. Restricting the log output to the cml unit when running on raw machines (as opposed to orchestrators) would leave us without any easy way of retrieving the whole low-level logs in case of failure. Addressing your concerns from #154, you can always search for cml in the whole logs to get the runner-specific lines.

Comment on lines +214 to +224
switch cloud := d.Get("cloud").(string); cloud {
case "kubernetes":
logEvents, logError = resourceMachineLogs(ctx, d, m)
default:
logEvents, logError = utils.RunCommand("journalctl --unit cml --no-pager",
2*time.Second,
net.JoinHostPort(d.Get("instance_ip").(string), "22"),
"ubuntu",
d.Get("ssh_private").(string))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep a separate resourceMachineLogs function for each provider, even if some of them share the same code. The proposed changes are introducing two additional calls for non-SSH vendors, and it doesn't feel quite elegant, in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO having duplicate code is not elegant at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course we are having the resource machine which can have intermediate business logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed changes are introducing two additional calls for non-SSH vendors

Which two calls?

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO having duplicate code is not elegant at all

Agreed! Nevertheless, in this case we're just duplicating a line of code for the sake of having the same methods on every provider. We aren't copying and pasting a copy of the utils.RunCommand function, but just a single line call. 🙃 The current pattern would be much more clean after #147.

And of course we are having the resource machine which can have intermediate business logic

Intermediate busniess logic is a pretty general term here. While 66% of our supported vendors — or 75% very soon — are going to rely on SSH for log gathering, that percentage will considerably decrease as soon as we start adding more container orchestrators. Should we count a specific log gathering mechanism as intemediate business logic just because it's the same for a couple of providers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which two calls?

For orchestrators and other non-SSH providers, we would be calling the following two blocks sequentially, hitting two identical switch statements (hello code duplication) and making two calls: one to resourceMachineLogs and other to provider.ResourceMachineLogs

switch cloud := d.Get("cloud").(string); cloud {
case "kubernetes":
logEvents, logError = resourceMachineLogs(ctx, d, m)
default:
logEvents, logError = utils.RunCommand("journalctl --unit cml --no-pager",
2*time.Second,
net.JoinHostPort(d.Get("instance_ip").(string), "22"),
"ubuntu",
d.Get("ssh_private").(string))
}

func resourceMachineLogs(ctx context.Context, d *schema.ResourceData, m interface{}) (string, error) {
switch cloud := d.Get("cloud").(string); cloud {
case "kubernetes":
return kubernetes.ResourceMachineLogs(ctx, d, m)
default:
return utils.RunCommand("journalctl --no-pager",
2*time.Second,
net.JoinHostPort(d.Get("instance_ip").(string), "22"),
"ubuntu",
d.Get("ssh_private").(string))
}
}

@DavidGOrtega
Copy link
Contributor Author

In my opinion, the full instance log is not competely useless. It will only show in case of failure, and it has helped us in the past to diagnose quite a few difficult issues. Restricting the log output to the cml unit when running on raw machines (as opposed to orchestrators) would leave us without any easy way of retrieving the whole low-level logs in case of failure. Addressing your concerns from #154, you can always search for cml in the whole logs to get the runner-specific lines.

Its not restricting the machine logs its restricting the runner logs and to be honest I do not see current logs useful at all.

Reducing the connection timeout from 10 to 2 seconds in the call to runCommand may not have the desired effect. It will only limit the timeout of each connection, but not the delay between tries, which is being automatically handled by Terraform's exponential backoff algorithm. If you want to reduce the latter, we would need to use the raw WaitForStateContext instead. Is this change worth the maintenance overhead? I don't think people will notice the 10 second difference.

I think it's totally correct. Your approach is try to connect via SSH with 10 seconds timeout. I prefer to do it several short times because the runner will be destroyed after 20 secs aprox having, the 10 secs screen approach, to have tried twice at most. The problem is that the SSH connection timeout does not seems to be effective in terms of being able to connect with those 10 secs once the connection is effective.

A batch of 60 runs with both methods shows empirically that this method succeeded always while the other failed multiple times showing an error in line 14 of utils.js (executing terraform)

@0x2b3bfa0
Copy link
Member

I think it's totally correct. Your approach is try to connect via SSH with 10 seconds timeout. I prefer to do it several short times because the runner will be destroyed after 20 secs aprox having, the 10 secs screen approach, to have tried twice at most. The problem is that the SSH connection timeout does not seems to be effective in terms of being able to connect with those 10 secs once the connection is effective.

A batch of 60 runs with both methods shows empirically that this method succeeded always while the other failed multiple times showing an error in line 14 of utils.js (executing terraform)

If the old 10 second timeout had a lower success rate than the proposed 2 second timeout, it's definitely worth the change. As long as it has time to establish the connection and retrieve a few thousand lines in the worst case, everything should be fine.

@DavidGOrtega
Copy link
Contributor Author

As long as it has time to establish the connection and retrieve a few thousand lines in the worst case,

timeout is not for execution, timeout is for connection

@DavidGOrtega
Copy link
Contributor Author

Also regarding to logs is that I do not need more logs of what I would see if I were launching a runner locally in my machine (non cloud)

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jul 9, 2021

timeout is not for execution, timeout is for connection

True! I thought that we were accounting for both, but I only specified the ClientConfig one. 🤦🏼

@DavidGOrtega
Copy link
Contributor Author

Indeed makes no sense to make it for all. This way we can connect within 2 secs and can still consume 18 secs of logs

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jul 9, 2021

Its not restricting the machine logs its restricting the runner logs [...]

Indeed, you're restricting the resource_cml_runner logs, not the resource_machine logs. Incidentally, the former are the only logs that we ever print on error; the latter won't be accessible by any means. 🙊

[...] and to be honest I do not see current logs useful at all.
Also regarding to logs is that I do not need more logs of what I would see if I were launching a runner locally in my machine (non cloud)

I still disagree on this:

  • When launching a runner locally in your machine, you're in control of your environment and know if your system is working correctly and how to debug it.

  • When launching a runner in an ephemeral cloud instance that gets terminated on error, it would make sense to have a way of retrieving the full logs before they get lost. I don't care if it's hidden behind a debug environment variable, but there should be a way to do that, lest spending hours of trial and error debugging machine-level issues.

@DavidGOrtega
Copy link
Contributor Author

the latter won't be accessible by any means.

Exactly what we need. We just only need to provide errors from the cloud (at creation) and runner. The rest is users responsibility (as the vendors do). Anyway Im not the only one who thinks that our logs are too much

@DavidGOrtega
Copy link
Contributor Author

DavidGOrtega commented Jul 9, 2021

The logs for the plain machine can be displayed inside the machine resource code. We can open a ticket for that and thats the reason I have not removed the method wich is indeed useful

@0x2b3bfa0
Copy link
Member

Exactly what we need. We just only need to provide errors from the cloud (at creation) and runner. The rest is users responsibility (as the vendors do).

Unlike vendors, we offer custom machine images that fail from time to time and produce machine-level logs. Are we supposed to ignore errors produced at the machine level and make users responsible for them? I see some cases where that wouldn't be convenient; see #118 for an example.

Anyway Im not the only one who thinks that our logs are too much

If there is more people thinking the same and they have GitHub accounts, it would be nice to bring them into this conversation so we can get a richer context.

The logs for the plain machine can be displayed inside the machine resource code.

Given that the resource_machine gets completed as soon as the machine is up and running, we wouldn't be able to print any meaningful log before Terraform exits.

WE can open a ticket for that and thats the reason I have not removed the method wich is indeed useful

Agreed! What if we always print the full logs for resource_cml_runner machines to the Terraform debug output (not visible by default) and the runner-specific logs to the standard output just in case of error?

@DavidGOrtega
Copy link
Contributor Author

Agreed! What if we always print the full logs for resource_cml_runner machines to the Terraform debug output (not visible by default) and the runner-specific logs to the standard output just in case of error?

Sounds good! Can you please open a ticket? We need to sort out this one because as I said its failing the deployment sometimes.

@0x2b3bfa0
Copy link
Member

Sounds good! Can you please open a ticket? We need to sort out this one because as I said its failing the deployment sometimes.

Better yet, can you open a separate pull request for the SSH connection timeout and leave this for the log along with #154?

@DavidGOrtega
Copy link
Contributor Author

Better yet, can you open a separate pull request for the SSH connection timeout and leave this for the log along with #154?

No, I wont say its better.
That enormous amount of logs is not yet useful for the runner to justify it

@DavidGOrtega
Copy link
Contributor Author

open #163

@DavidGOrtega
Copy link
Contributor Author

@0x2b3bfa0 we need to merge this

 Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Listening on GnuPG
│ cryptographic agent and passphrase cache (access for web browsers).
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Reached target Sockets.
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Reached target Basic
│ System.
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Reached target Default.
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Startup finished in 19ms.
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[1]: Started User Manager for UID
│ 1000.
│ 
╵
    at /usr/lib/node_modules/@dvcorg/cml/src/utils.js:14:27
    at ChildProcess.exithandler (child_process.js:315:5)
    at ChildProcess.emit (events.js:314:20)
    at maybeClose (internal/child_process.js:1022:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:287:5)
Cleaning up file based variables
00:00
ERROR: Job failed: exit code 1

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jul 9, 2021

we need to merge this

Then, let's merge it if we need to. I'm not giving an explicit high-level approval and definitely would prefer to separate #153 from #154, but yes, it works, and availability goes first. :shipit:

@0x2b3bfa0
Copy link
Member

@DavidGOrtega, I'm curious to see the full log you quote on #161 (comment). Can you please attach it?

@0x2b3bfa0 0x2b3bfa0 added bug Something isn't working cloud-common Applies to every cloud vendor resource-runner iterative_runner TF resource labels Jul 9, 2021
@DavidGOrtega DavidGOrtega merged commit 5faf23d into master Jul 11, 2021
@DavidGOrtega DavidGOrtega deleted the wait-improvement branch July 11, 2021 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cloud-common Applies to every cloud vendor resource-runner iterative_runner TF resource
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The error log is so long that is completely useless Reduce ssh heartbeat time
2 participants