Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug with cancelling "attach -w" after you have run a process previously #65822

Merged
merged 2 commits into from
Sep 19, 2023

Conversation

jimingham
Copy link
Collaborator

The problem is that the when the "attach" command is initiated, the ExecutionContext for the command has a process - it's the exited one from the previour run. But the attach wait creates a new process for the attach, and then errors out instead of interrupting when it finds that its process and the one in the command's ExecutionContext don't match.

This change checks that if we're returning a target from GetExecutionContext, we fill the context with it's current process, not some historical one.

…eviously.

The problem is that the when the "attach" command is initiated, the ExecutionContext
for the command has a process - it's the exited one from the previour run.  But the
`attach wait` creates a new process for the attach, and then errors out instead of
interrupting when it finds that its process and the one in the command's ExecutionContext
don't match.

This change checks that if we're returning a target from GetExecutionContext, we fill
the context with it's current process, not some historical one.
ExecutionContext exe_ctx;
if (!m_overriden_exe_contexts.empty()) {
// During the course of a command, the target may have replaced the process
// coming in with another. If fix that here:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If fix that here" -> is there a word missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, not quite. "If" -> "I".

Copy link
Member

@bulbazord bulbazord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea seems good to me. I have similar concerns to Adrian but overall I think I'm on board with this idea. I'm kind of confused why we don't destroy the old Process object after it's exited though, wouldn't properly cleaning up after ourselves solve a lot of this issue? Or is there something I don't understand about the lifecycle of Process?

: m_debugger.GetSelectedExecutionContext();
ExecutionContext CommandInterpreter::GetExecutionContext() {
ExecutionContext exe_ctx;
if (!m_overriden_exe_contexts.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion/Bikeshedding: Flip the condition and you can reduce one level of indentation in this method. That's just my preference though, doesn't make a difference either way.

lldb/source/Target/Process.cpp Show resolved Hide resolved
lldb/test/API/commands/process/attach/TestProcessAttach.py Outdated Show resolved Hide resolved
@jimingham
Copy link
Collaborator Author

jimingham commented Sep 9, 2023

This is in reply to Alex's comment about why we keep the process around, but the quote this reply didn't seem to work...

After a process exits you might still want to ask it what its exit status was. Plus there might be other system accounting we've gathered for that process. So until you replace it with another process, you very well might want to ask questions of it. So we can't just destroy it as soon as it exits.

@bulbazord
Copy link
Member

This is in reply to Alex's comment about why we keep the process around, but the quote this reply didn't seem to work...

After a process exits you might still want to ask it what its exit status was. Plus there might be other system accounting we've gathered for that process. So until you replace it with another process, you very well might want to ask questions of it. So we can't just destroy it as soon as it exits.

That definitely explains why we persist the previous process even after its finished, but this doesn't sound like there's a technical limitation to actually removing the Process from the Target after it exits. To be more specific, you could either make it so the Process object in m_process_sp could be moved out and into another member of Target, or you could just persist the data you care about instead of the entire process.

That might be an alternative to this solution but I'm not sure if that would be any better or more useful than what you're doing here. I'm mostly thinking aloud here (and trying to improve my knowledge of LLDB).

Copy link
Collaborator

@clayborg clayborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different way to approach this fix is to just fix ExecutionContext so that this can never be out of date by removing the "m_process_sp" member variable. Is is not great that we can an execution context that has an older process, and would we ever want an ExecutionContext to be able to have a mismatched process? If we remove m_process_sp, we can always get the process from the m_target_sp and then we can never get it wrong that way and we might avoid some of the changes in this patch? Lemme know what you think as It would be good to make sure our ExecutionContext variables can't get the wrong info.

@jimingham
Copy link
Collaborator Author

jimingham commented Sep 11, 2023 via email

@jimingham
Copy link
Collaborator Author

This is in reply to Alex's comment about why we keep the process around, but the quote this reply didn't seem to work...
After a process exits you might still want to ask it what its exit status was. Plus there might be other system accounting we've gathered for that process. So until you replace it with another process, you very well might want to ask questions of it. So we can't just destroy it as soon as it exits.

That definitely explains why we persist the previous process even after its finished, but this doesn't sound like there's a technical limitation to actually removing the Process from the Target after it exits. To be more specific, you could either make it so the Process object in m_process_sp could be moved out and into another member of Target, or you could just persist the data you care about instead of the entire process.

That might be an alternative to this solution but I'm not sure if that would be any better or more useful than what you're doing here. I'm mostly thinking aloud here (and trying to improve my knowledge of LLDB).

I'm not sure this would make things easier. The model where you just ask info of the dead process till you make another one is very straightforward. Also, the window where this is likely to cause problems is very small, and all internal to launching and attaching, so making the handling of the old process more complex to solve this problem seems like overkill.

@clayborg
Copy link
Collaborator

It seems a bit weird to have an Execution context with a thread and a stack frame but not the process they belong to. I don't know that it would really help either. Even if you removed the process so that you no longer had to check for that being invalid, if a thread or frame had been set in the execution context, you'd still have to check for them being from the wrong process. So I don't think we would made things significantly simpler by leaving out the process. I think it's easier to deal with it when we hand them out. Jim

Or we can have ExecutionContext have a single element so that it is always correct:

ExecutionContext {
...
   std::shared_ptr<ExecutionContextScope> m_exe_scope_sp;

Since target, proces, thread and frames all inherit from ExecutionContextScope, all we need to do is store one item. The ExecutionContextScope API has:

  virtual lldb::TargetSP CalculateTarget() = 0;
  virtual lldb::ProcessSP CalculateProcess() = 0;
  virtual lldb::ThreadSP CalculateThread() = 0;
  virtual lldb::StackFrameSP CalculateStackFrame() = 0;

So then everything would always be correct.

@clayborg
Copy link
Collaborator

So if we use the ExecutionContextScope and init a ExectionContext with a target, it can calculate the process and target. If you init it with a thread, it can calculate the thread, process and target. And if you init it with a frame, it can get the frame, thread, process and target.

@jimingham
Copy link
Collaborator Author

It seems a bit weird to have an Execution context with a thread and a stack frame but not the process they belong to. I don't know that it would really help either. Even if you removed the process so that you no longer had to check for that being invalid, if a thread or frame had been set in the execution context, you'd still have to check for them being from the wrong process. So I don't think we would made things significantly simpler by leaving out the process. I think it's easier to deal with it when we hand them out. Jim

Or we can have ExecutionContext have a single element so that it is always correct:

ExecutionContext {
...
   std::shared_ptr<ExecutionContextScope> m_exe_scope_sp;

Since target, proces, thread and frames all inherit from ExecutionContextScope, all we need to do is store one item. The ExecutionContextScope API has:

  virtual lldb::TargetSP CalculateTarget() = 0;
  virtual lldb::ProcessSP CalculateProcess() = 0;
  virtual lldb::ThreadSP CalculateThread() = 0;
  virtual lldb::StackFrameSP CalculateStackFrame() = 0;

So then everything would always be correct.

The only time this will be incorrect in this way is if you make an execution context filling in the current target & process and then hold onto it over the process in the target changing. Internally, we only hold onto execution context's while executing a command, and we return that to ourselves in the API I fixed. So we really don't have to fix this anywhere else to solve actual problems.

This seems like a much wider ranging change than is required to fix this bug. I'm all for rethinking the ExecutionContextScope/ExecutionContext/ExecutionContextRef nexus, but I don't think that's a trivial rethink, and I don't have the time to undertake that right now.

@clayborg
Copy link
Collaborator

This seems like a much wider ranging change than is required to fix this bug. I'm all for rethinking the ExecutionContextScope/ExecutionContext/ExecutionContextRef nexus, but I don't think that's a trivial rethink, and I don't have the time to undertake that right now.

My only point in mentioning this is many of the changed files in your patch goes away if we fix ExecutionContext.

Change look fine to me if we don't want to change ExecutionContext, but i will let others do the final accept.

@jimingham jimingham merged commit 7265f79 into llvm:main Sep 19, 2023
1 of 3 checks passed
@jimingham jimingham deleted the cancel-attach-after-run branch September 19, 2023 18:26
@mysterymath
Copy link
Contributor

It looks like this is breaking buildbots:
https://lab.llvm.org/buildbot/#/builders/68/builds/60267

The test can't find main.cpp, and it looks like the immediately preceding test issues an os.chdir. It has a hook to reset it, but maybe there's some problem with it?

@jimingham
Copy link
Collaborator Author

jimingham commented Sep 19, 2023 via email

DavidSpickett added a commit that referenced this pull request Sep 20, 2023
…ocess previously (#65822)"

This reverts commit 7265f79.

The new test case is flaky on Linux AArch64 (https://lab.llvm.org/buildbot/#/builders/96)
and more flaky on Windows on Arm (https://lab.llvm.org/buildbot/#/builders/219/builds/5735).
@DavidSpickett
Copy link
Collaborator

This is failing on AArch64 Linux and Windows, it finds some async error in the output but the bots don't say what. I'll find what it is.

break

self.dbg.DispatchInputInterrupt()
self.dbg.DispatchInputInterrupt()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this done twice?

if "Cancelled async attach" in line:
found_result = True
break
self.assertTrue(found_result, "Found async error in results")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails if results is empty.

********************************
[]
********************************
FAIL: LLDB (/home/david.spickett/build-llvm-aarch64/bin/clang-aarch64) :: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase)
<bound method SBProcess.Kill of SBProcess: pid = 0, state = exited, threads = 0, executable = ProcessAttach>: success

Restore dir to: /home/david.spickett/build-llvm-aarch64
======================================================================
FAIL: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/david.spickett/llvm-project/lldb/test/API/commands/process/attach/TestProcessAttach.py", line 195, in test_run_then_attach_wait_interrupt
    self.assertTrue(found_result, "Found async error in results")
AssertionError: False is not True : Found async error in results
Config=aarch64-/home/david.spickett/build-llvm-aarch64/bin/clang
----------------------------------------------------------------------

When the test works the results contain:

********************************
['error: attach failed: Cancelled async attach.\n', '\n', '... Interrupted.\n']
********************************

Running it in a loop it took ~40 runs to get a failing one.

I wonder if that is because the attach happens to finish a lot faster sometimes, so there's no time to cancel it? If that's down to the OS and machine load, I'm not sure how we'd make this predictable.

The ugly option is to say if the results are empty, pass the test and assume the other 39 runs will check for the bug.

@jimingham
Copy link
Collaborator Author

jimingham commented Sep 20, 2023 via email

@jimingham
Copy link
Collaborator Author

jimingham commented Sep 20, 2023 via email

@jimingham
Copy link
Collaborator Author

jimingham commented Sep 20, 2023 via email

@jimingham
Copy link
Collaborator Author

jimingham commented Sep 20, 2023 via email

@jimingham
Copy link
Collaborator Author

jimingham commented Sep 20, 2023 via email

@DavidSpickett
Copy link
Collaborator

Thanks for looking into this. Well for now at least we know that users on Linux will only see the bug 1 in N times :)

(the bots are still red but for unrelated reasons)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants