Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consume job_explanation we sometimes set in ansible-runner #12089

Closed
wants to merge 2 commits into from

Conversation

shanemcd
Copy link
Member

Replaces / updates #11376

@AlanCoding
Copy link
Member

Regarding the part of this that I wrote, what data would this get us?

https://github.com/ansible/ansible-runner/blob/d01304455c3c25669fba70d9983487ffdb85f1c5/ansible_runner/streaming.py#L101

Failed to JSON parse a line from transmit stream

Failed to extract private data directory on worker

These are pretty vague messages. If I got them by themselves, I wouldn't know what to do. I'm not against surfacing them to the user, indeed, I think that's a very good idea. However, this save will overwrite any job_explanation that was set in another part of the code or by another process. This case, in particular, is non-exclusive with getting additional error details. The most highly valued error details tends to get what comes from AWXReceptorJob, reading details from the work unit, or the work unit output.

Because of that, I have cold feet about doing this without stacking it on top of #11832, which fixed tests that were actively failing (we just forget because they get set to skip). If we can get this in a way that stacks on top of any other error details, then that's the ideal, that would be perfect.

@shanemcd
Copy link
Member Author

Right now jobs fail with an even more vague explanation: Job terminated due to error.

This at least tells us the last thing that went wrong.

# We call this to get the current values from the database, in case update_model was called
# within the threadpools inside of AWXReceptorJob. We use update_model instead of
# refresh_from_db here because it contains retry logic that is resilient to database failures.
self.instance = self.update_model(self.instance.pk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @sarabrajsingh, as he and I discussed error logs that came from post_run_hook due to not doing this.

Summarizing here...

Each thread has its own connection object. When connections are dropped, that object is stale. If it's not hitting one particular corner case, it will correct itself after 1 failed query. So post_run_hook errors, but then code after it starts to work again, so it's not easy to observe unless you check the logs.

@AlanCoding
Copy link
Member

AlanCoding commented May 3, 2022

I've modified adjacent code, so here's what I think a rebase should look like:

https://github.com/ansible/awx/compare/devel...AlanCoding:capture-runner-error?expand=1

EDIT: to resolve conflicts.

@AlanCoding
Copy link
Member

A replacement PR exists

@AlanCoding AlanCoding closed this Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants