-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait improvements #161
Wait improvements #161
Conversation
DavidGOrtega
commented
Jul 9, 2021
•
edited
Loading
edited
- fixes The error log is so long that is completely useless #154 Only displays the service logs
- fixes Reduce ssh heartbeat time #153 Reduces the timeout from 10 to 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing the connection timeout from 10 to 2 seconds in the call to runCommand
may not have the desired effect. It will only limit the timeout of each connection, but not the delay between tries, which is being automatically handled by Terraform's exponential backoff algorithm. If you want to reduce the latter, we would need to use the raw WaitForStateContext
instead. Is this change worth the maintenance overhead? I don't think people will notice the 10 second difference.
In my opinion, the full instance log is not competely useless. It will only show in case of failure, and it has helped us in the past to diagnose quite a few difficult issues. Restricting the log output to the cml
unit when running on raw machines (as opposed to orchestrators) would leave us without any easy way of retrieving the whole low-level logs in case of failure. Addressing your concerns from #154, you can always search for cml
in the whole logs to get the runner-specific lines.
switch cloud := d.Get("cloud").(string); cloud { | ||
case "kubernetes": | ||
logEvents, logError = resourceMachineLogs(ctx, d, m) | ||
default: | ||
logEvents, logError = utils.RunCommand("journalctl --unit cml --no-pager", | ||
2*time.Second, | ||
net.JoinHostPort(d.Get("instance_ip").(string), "22"), | ||
"ubuntu", | ||
d.Get("ssh_private").(string)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to keep a separate resourceMachineLogs
function for each provider, even if some of them share the same code. The proposed changes are introducing two additional calls for non-SSH vendors, and it doesn't feel quite elegant, in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO having duplicate code is not elegant at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And of course we are having the resource machine which can have intermediate business logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed changes are introducing two additional calls for non-SSH vendors
Which two calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO having duplicate code is not elegant at all
Agreed! Nevertheless, in this case we're just duplicating a line of code for the sake of having the same methods on every provider. We aren't copying and pasting a copy of the utils.RunCommand
function, but just a single line call. 🙃 The current pattern would be much more clean after #147.
And of course we are having the resource machine which can have intermediate business logic
Intermediate busniess logic is a pretty general term here. While 66% of our supported vendors — or 75% very soon — are going to rely on SSH for log gathering, that percentage will considerably decrease as soon as we start adding more container orchestrators. Should we count a specific log gathering mechanism as intemediate business logic just because it's the same for a couple of providers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which two calls?
For orchestrators and other non-SSH providers, we would be calling the following two blocks sequentially, hitting two identical switch
statements (hello code duplication) and making two calls: one to resourceMachineLogs
and other to provider.ResourceMachineLogs
terraform-provider-iterative/iterative/resource_runner.go
Lines 214 to 223 in 05aa3a3
switch cloud := d.Get("cloud").(string); cloud { | |
case "kubernetes": | |
logEvents, logError = resourceMachineLogs(ctx, d, m) | |
default: | |
logEvents, logError = utils.RunCommand("journalctl --unit cml --no-pager", | |
2*time.Second, | |
net.JoinHostPort(d.Get("instance_ip").(string), "22"), | |
"ubuntu", | |
d.Get("ssh_private").(string)) | |
} |
terraform-provider-iterative/iterative/resource_machine.go
Lines 238 to 249 in 05aa3a3
func resourceMachineLogs(ctx context.Context, d *schema.ResourceData, m interface{}) (string, error) { | |
switch cloud := d.Get("cloud").(string); cloud { | |
case "kubernetes": | |
return kubernetes.ResourceMachineLogs(ctx, d, m) | |
default: | |
return utils.RunCommand("journalctl --no-pager", | |
2*time.Second, | |
net.JoinHostPort(d.Get("instance_ip").(string), "22"), | |
"ubuntu", | |
d.Get("ssh_private").(string)) | |
} | |
} |
Its not restricting the machine logs its restricting the runner logs and to be honest I do not see current logs useful at all.
I think it's totally correct. Your approach is try to connect via SSH with 10 seconds timeout. I prefer to do it several short times because the runner will be destroyed after 20 secs aprox having, the 10 secs screen approach, to have tried twice at most. The problem is that the SSH connection timeout does not seems to be effective in terms of being able to connect with those 10 secs once the connection is effective. A batch of 60 runs with both methods shows empirically that this method succeeded always while the other failed multiple times showing an error in line 14 of utils.js (executing terraform) |
If the old 10 second timeout had a lower success rate than the proposed 2 second timeout, it's definitely worth the change. As long as it has time to establish the connection and retrieve a few thousand lines in the worst case, everything should be fine. |
timeout is not for execution, timeout is for connection |
Also regarding to logs is that I do not need more logs of what I would see if I were launching a runner locally in my machine (non cloud) |
True! I thought that we were accounting for both, but I only specified the |
Indeed makes no sense to make it for all. This way we can connect within 2 secs and can still consume 18 secs of logs |
Indeed, you're restricting the
I still disagree on this:
|
Exactly what we need. We just only need to provide errors from the cloud (at creation) and runner. The rest is users responsibility (as the vendors do). Anyway Im not the only one who thinks that our logs are too much |
The logs for the plain machine can be displayed inside the machine resource code. We can open a ticket for that and thats the reason I have not removed the method wich is indeed useful |
Unlike vendors, we offer custom machine images that fail from time to time and produce machine-level logs. Are we supposed to ignore errors produced at the machine level and make users responsible for them? I see some cases where that wouldn't be convenient; see #118 for an example.
If there is more people thinking the same and they have GitHub accounts, it would be nice to bring them into this conversation so we can get a richer context.
Given that the
Agreed! What if we always print the full logs for |
Sounds good! Can you please open a ticket? We need to sort out this one because as I said its failing the deployment sometimes. |
Better yet, can you open a separate pull request for the SSH connection timeout and leave this for the log along with #154? |
No, I wont say its better. |
open #163 |
@0x2b3bfa0 we need to merge this
|
@DavidGOrtega, I'm curious to see the full log you quote on #161 (comment). Can you please attach it? |