Wait improvements #161

DavidGOrtega · 2021-07-09T09:46:34Z

fixes The error log is so long that is completely useless #154 Only displays the service logs
fixes Reduce ssh heartbeat time #153 Reduces the timeout from 10 to 2

0x2b3bfa0

Reduce ssh heartbeat time #153

Reducing the connection timeout from 10 to 2 seconds in the call to runCommand may not have the desired effect. It will only limit the timeout of each connection, but not the delay between tries, which is being automatically handled by Terraform's exponential backoff algorithm. If you want to reduce the latter, we would need to use the raw WaitForStateContext instead. Is this change worth the maintenance overhead? I don't think people will notice the 10 second difference.

The error log is so long that is completely useless #154

In my opinion, the full instance log is not competely useless. It will only show in case of failure, and it has helped us in the past to diagnose quite a few difficult issues. Restricting the log output to the cml unit when running on raw machines (as opposed to orchestrators) would leave us without any easy way of retrieving the whole low-level logs in case of failure. Addressing your concerns from #154, you can always search for cml in the whole logs to get the runner-specific lines.

0x2b3bfa0 · 2021-07-09T12:22:30Z

iterative/resource_runner.go

+		switch cloud := d.Get("cloud").(string); cloud {
+		case "kubernetes":
+			logEvents, logError = resourceMachineLogs(ctx, d, m)
+		default:
+			logEvents, logError = utils.RunCommand("journalctl --unit cml --no-pager",
+				2*time.Second,
+				net.JoinHostPort(d.Get("instance_ip").(string), "22"),
+				"ubuntu",
+				d.Get("ssh_private").(string))
+		}
+


I'd prefer to keep a separate resourceMachineLogs function for each provider, even if some of them share the same code. The proposed changes are introducing two additional calls for non-SSH vendors, and it doesn't feel quite elegant, in my opinion.

IMHO having duplicate code is not elegant at all

And of course we are having the resource machine which can have intermediate business logic

The proposed changes are introducing two additional calls for non-SSH vendors

Which two calls?

IMHO having duplicate code is not elegant at all

Agreed! Nevertheless, in this case we're just duplicating a line of code for the sake of having the same methods on every provider. We aren't copying and pasting a copy of the utils.RunCommand function, but just a single line call. 🙃 The current pattern would be much more clean after #147.

And of course we are having the resource machine which can have intermediate business logic

Intermediate busniess logic is a pretty general term here. While 66% of our supported vendors — or 75% very soon — are going to rely on SSH for log gathering, that percentage will considerably decrease as soon as we start adding more container orchestrators. Should we count a specific log gathering mechanism as intemediate business logic just because it's the same for a couple of providers?

Which two calls?

For orchestrators and other non-SSH providers, we would be calling the following two blocks sequentially, hitting two identical switch statements (hello code duplication) and making two calls: one to resourceMachineLogs and other to provider.ResourceMachineLogs

terraform-provider-iterative/iterative/resource_runner.go

Lines 214 to 223 in 05aa3a3

switch cloud := d.Get("cloud").(string); cloud {

case "kubernetes":

logEvents, logError = resourceMachineLogs(ctx, d, m)

default:

logEvents, logError = utils.RunCommand("journalctl --unit cml --no-pager",

2*time.Second,

net.JoinHostPort(d.Get("instance_ip").(string), "22"),

"ubuntu",

d.Get("ssh_private").(string))

}

terraform-provider-iterative/iterative/resource_machine.go

Lines 238 to 249 in 05aa3a3

func resourceMachineLogs(ctx context.Context, d *schema.ResourceData, m interface{}) (string, error) {

switch cloud := d.Get("cloud").(string); cloud {

case "kubernetes":

return kubernetes.ResourceMachineLogs(ctx, d, m)

default:

return utils.RunCommand("journalctl --no-pager",

2*time.Second,

net.JoinHostPort(d.Get("instance_ip").(string), "22"),

"ubuntu",

d.Get("ssh_private").(string))

}

}

DavidGOrtega · 2021-07-09T12:25:39Z

In my opinion, the full instance log is not competely useless. It will only show in case of failure, and it has helped us in the past to diagnose quite a few difficult issues. Restricting the log output to the cml unit when running on raw machines (as opposed to orchestrators) would leave us without any easy way of retrieving the whole low-level logs in case of failure. Addressing your concerns from #154, you can always search for cml in the whole logs to get the runner-specific lines.

Its not restricting the machine logs its restricting the runner logs and to be honest I do not see current logs useful at all.

Reducing the connection timeout from 10 to 2 seconds in the call to runCommand may not have the desired effect. It will only limit the timeout of each connection, but not the delay between tries, which is being automatically handled by Terraform's exponential backoff algorithm. If you want to reduce the latter, we would need to use the raw WaitForStateContext instead. Is this change worth the maintenance overhead? I don't think people will notice the 10 second difference.

I think it's totally correct. Your approach is try to connect via SSH with 10 seconds timeout. I prefer to do it several short times because the runner will be destroyed after 20 secs aprox having, the 10 secs screen approach, to have tried twice at most. The problem is that the SSH connection timeout does not seems to be effective in terms of being able to connect with those 10 secs once the connection is effective.

A batch of 60 runs with both methods shows empirically that this method succeeded always while the other failed multiple times showing an error in line 14 of utils.js (executing terraform)

0x2b3bfa0 · 2021-07-09T12:35:59Z

I think it's totally correct. Your approach is try to connect via SSH with 10 seconds timeout. I prefer to do it several short times because the runner will be destroyed after 20 secs aprox having, the 10 secs screen approach, to have tried twice at most. The problem is that the SSH connection timeout does not seems to be effective in terms of being able to connect with those 10 secs once the connection is effective.

A batch of 60 runs with both methods shows empirically that this method succeeded always while the other failed multiple times showing an error in line 14 of utils.js (executing terraform)

If the old 10 second timeout had a lower success rate than the proposed 2 second timeout, it's definitely worth the change. As long as it has time to establish the connection and retrieve a few thousand lines in the worst case, everything should be fine.

DavidGOrtega · 2021-07-09T12:37:41Z

As long as it has time to establish the connection and retrieve a few thousand lines in the worst case,

timeout is not for execution, timeout is for connection

DavidGOrtega · 2021-07-09T12:39:32Z

Also regarding to logs is that I do not need more logs of what I would see if I were launching a runner locally in my machine (non cloud)

0x2b3bfa0 · 2021-07-09T12:41:10Z

timeout is not for execution, timeout is for connection

True! I thought that we were accounting for both, but I only specified the ClientConfig one. 🤦🏼

DavidGOrtega · 2021-07-09T12:44:11Z

Indeed makes no sense to make it for all. This way we can connect within 2 secs and can still consume 18 secs of logs

0x2b3bfa0 · 2021-07-09T12:48:30Z

Its not restricting the machine logs its restricting the runner logs [...]

Indeed, you're restricting the resource_cml_runner logs, not the resource_machine logs. Incidentally, the former are the only logs that we ever print on error; the latter won't be accessible by any means. 🙊

[...] and to be honest I do not see current logs useful at all.
Also regarding to logs is that I do not need more logs of what I would see if I were launching a runner locally in my machine (non cloud)

I still disagree on this:

When launching a runner locally in your machine, you're in control of your environment and know if your system is working correctly and how to debug it.
When launching a runner in an ephemeral cloud instance that gets terminated on error, it would make sense to have a way of retrieving the full logs before they get lost. I don't care if it's hidden behind a debug environment variable, but there should be a way to do that, lest spending hours of trial and error debugging machine-level issues.

DavidGOrtega · 2021-07-09T12:52:06Z

the latter won't be accessible by any means.

Exactly what we need. We just only need to provide errors from the cloud (at creation) and runner. The rest is users responsibility (as the vendors do). Anyway Im not the only one who thinks that our logs are too much

DavidGOrtega · 2021-07-09T12:53:13Z

The logs for the plain machine can be displayed inside the machine resource code. We can open a ticket for that and thats the reason I have not removed the method wich is indeed useful

0x2b3bfa0 · 2021-07-09T13:07:06Z

Exactly what we need. We just only need to provide errors from the cloud (at creation) and runner. The rest is users responsibility (as the vendors do).

Unlike vendors, we offer custom machine images that fail from time to time and produce machine-level logs. Are we supposed to ignore errors produced at the machine level and make users responsible for them? I see some cases where that wouldn't be convenient; see #118 for an example.

Anyway Im not the only one who thinks that our logs are too much

If there is more people thinking the same and they have GitHub accounts, it would be nice to bring them into this conversation so we can get a richer context.

The logs for the plain machine can be displayed inside the machine resource code.

Given that the resource_machine gets completed as soon as the machine is up and running, we wouldn't be able to print any meaningful log before Terraform exits.

WE can open a ticket for that and thats the reason I have not removed the method wich is indeed useful

Agreed! What if we always print the full logs for resource_cml_runner machines to the Terraform debug output (not visible by default) and the runner-specific logs to the standard output just in case of error?

DavidGOrtega · 2021-07-09T13:14:44Z

Agreed! What if we always print the full logs for resource_cml_runner machines to the Terraform debug output (not visible by default) and the runner-specific logs to the standard output just in case of error?

Sounds good! Can you please open a ticket? We need to sort out this one because as I said its failing the deployment sometimes.

0x2b3bfa0 · 2021-07-09T13:17:03Z

Sounds good! Can you please open a ticket? We need to sort out this one because as I said its failing the deployment sometimes.

Better yet, can you open a separate pull request for the SSH connection timeout and leave this for the log along with #154?

DavidGOrtega · 2021-07-09T13:18:57Z

Better yet, can you open a separate pull request for the SSH connection timeout and leave this for the log along with #154?

No, I wont say its better.
That enormous amount of logs is not yet useful for the runner to justify it

DavidGOrtega · 2021-07-09T15:06:01Z

open #163

DavidGOrtega · 2021-07-09T15:07:16Z

@0x2b3bfa0 we need to merge this

 Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Listening on GnuPG
│ cryptographic agent and passphrase cache (access for web browsers).
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Reached target Sockets.
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Reached target Basic
│ System.
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Reached target Default.
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[5759]: Startup finished in 19ms.
│ Jul 09 14:31:16 ip-172-31-11-183 systemd[1]: Started User Manager for UID
│ 1000.
│ 
╵
    at /usr/lib/node_modules/@dvcorg/cml/src/utils.js:14:27
    at ChildProcess.exithandler (child_process.js:315:5)
    at ChildProcess.emit (events.js:314:20)
    at maybeClose (internal/child_process.js:1022:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:287:5)
Cleaning up file based variables
00:00
ERROR: Job failed: exit code 1

0x2b3bfa0 · 2021-07-09T15:24:03Z

we need to merge this

Then, let's merge it if we need to. I'm not giving an explicit high-level approval and definitely would prefer to separate #153 from #154, but yes, it works, and availability goes first.

0x2b3bfa0 · 2021-07-09T15:25:06Z

@DavidGOrtega, I'm curious to see the full log you quote on #161 (comment). Can you please attach it?

Wait improvements

05aa3a3

DavidGOrtega requested a review from 0x2b3bfa0 July 9, 2021 09:49

0x2b3bfa0 reviewed Jul 9, 2021

View reviewed changes

DavidGOrtega mentioned this pull request Jul 9, 2021

Param for extra machine logs #163

Closed

0x2b3bfa0 assigned DavidGOrtega Jul 9, 2021

0x2b3bfa0 added bug Something isn't working cloud-common Applies to every cloud vendor resource-runner iterative_runner TF resource labels Jul 9, 2021

DavidGOrtega merged commit 5faf23d into master Jul 11, 2021

DavidGOrtega deleted the wait-improvement branch July 11, 2021 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait improvements #161

Wait improvements #161

DavidGOrtega commented Jul 9, 2021 •

edited

Loading

0x2b3bfa0 left a comment

0x2b3bfa0 Jul 9, 2021

DavidGOrtega Jul 9, 2021

DavidGOrtega Jul 9, 2021

DavidGOrtega Jul 9, 2021

0x2b3bfa0 Jul 9, 2021 •

edited

Loading

0x2b3bfa0 Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021 •

edited by DavidGOrtega

Loading

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021 •

edited

Loading

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 9, 2021

	switch cloud := d.Get("cloud").(string); cloud {
	case "kubernetes":
	logEvents, logError = resourceMachineLogs(ctx, d, m)
	default:
	logEvents, logError = utils.RunCommand("journalctl --unit cml --no-pager",
	2*time.Second,
	net.JoinHostPort(d.Get("instance_ip").(string), "22"),
	"ubuntu",
	d.Get("ssh_private").(string))
	}

	func resourceMachineLogs(ctx context.Context, d *schema.ResourceData, m interface{}) (string, error) {
	switch cloud := d.Get("cloud").(string); cloud {
	case "kubernetes":
	return kubernetes.ResourceMachineLogs(ctx, d, m)
	default:
	return utils.RunCommand("journalctl --no-pager",
	2*time.Second,
	net.JoinHostPort(d.Get("instance_ip").(string), "22"),
	"ubuntu",
	d.Get("ssh_private").(string))
	}
	}

Wait improvements #161

Wait improvements #161

Conversation

DavidGOrtega commented Jul 9, 2021 • edited Loading

0x2b3bfa0 left a comment

Choose a reason for hiding this comment

0x2b3bfa0 Jul 9, 2021

Choose a reason for hiding this comment

DavidGOrtega Jul 9, 2021

Choose a reason for hiding this comment

DavidGOrtega Jul 9, 2021

Choose a reason for hiding this comment

DavidGOrtega Jul 9, 2021

Choose a reason for hiding this comment

0x2b3bfa0 Jul 9, 2021 • edited Loading

Choose a reason for hiding this comment

0x2b3bfa0 Jul 9, 2021

Choose a reason for hiding this comment

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021 • edited by DavidGOrtega Loading

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021 • edited Loading

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021 • edited Loading

0x2b3bfa0 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

0x2b3bfa0 commented Jul 9, 2021 • edited Loading

0x2b3bfa0 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021 •

edited

Loading

0x2b3bfa0 Jul 9, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 9, 2021 •

edited by DavidGOrtega

Loading

0x2b3bfa0 commented Jul 9, 2021 •

edited

Loading

DavidGOrtega commented Jul 9, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 9, 2021 •

edited

Loading