Fix a bug with cancelling "attach -w" after you have run a process previously #65822

jimingham · 2023-09-08T23:52:34Z

The problem is that the when the "attach" command is initiated, the ExecutionContext for the command has a process - it's the exited one from the previour run. But the attach wait creates a new process for the attach, and then errors out instead of interrupting when it finds that its process and the one in the command's ExecutionContext don't match.

This change checks that if we're returning a target from GetExecutionContext, we fill the context with it's current process, not some historical one.

…eviously. The problem is that the when the "attach" command is initiated, the ExecutionContext for the command has a process - it's the exited one from the previour run. But the `attach wait` creates a new process for the attach, and then errors out instead of interrupting when it finds that its process and the one in the command's ExecutionContext don't match. This change checks that if we're returning a target from GetExecutionContext, we fill the context with it's current process, not some historical one.

adrian-prantl · 2023-09-09T00:04:33Z

lldb/source/Interpreter/CommandInterpreter.cpp

+  ExecutionContext exe_ctx;
+  if (!m_overriden_exe_contexts.empty()) {
+    // During the course of a command, the target may have replaced the process 
+    // coming in with another.  If fix that here:


"If fix that here" -> is there a word missing?

Ah, not quite. "If" -> "I".

lldb/test/API/commands/process/attach/TestProcessAttach.py

bulbazord

Idea seems good to me. I have similar concerns to Adrian but overall I think I'm on board with this idea. I'm kind of confused why we don't destroy the old Process object after it's exited though, wouldn't properly cleaning up after ourselves solve a lot of this issue? Or is there something I don't understand about the lifecycle of Process?

bulbazord · 2023-09-09T00:04:21Z

lldb/source/Interpreter/CommandInterpreter.cpp

-             : m_debugger.GetSelectedExecutionContext();
+ExecutionContext CommandInterpreter::GetExecutionContext() {
+  ExecutionContext exe_ctx;
+  if (!m_overriden_exe_contexts.empty()) {


Suggestion/Bikeshedding: Flip the condition and you can reduce one level of indentation in this method. That's just my preference though, doesn't make a difference either way.

lldb/source/Target/Process.cpp

lldb/test/API/commands/process/attach/TestProcessAttach.py

lldb/test/API/commands/process/attach/main.cpp

jimingham · 2023-09-09T00:33:04Z

This is in reply to Alex's comment about why we keep the process around, but the quote this reply didn't seem to work...

After a process exits you might still want to ask it what its exit status was. Plus there might be other system accounting we've gathered for that process. So until you replace it with another process, you very well might want to ask questions of it. So we can't just destroy it as soon as it exits.

bulbazord · 2023-09-09T05:05:57Z

This is in reply to Alex's comment about why we keep the process around, but the quote this reply didn't seem to work...

After a process exits you might still want to ask it what its exit status was. Plus there might be other system accounting we've gathered for that process. So until you replace it with another process, you very well might want to ask questions of it. So we can't just destroy it as soon as it exits.

That definitely explains why we persist the previous process even after its finished, but this doesn't sound like there's a technical limitation to actually removing the Process from the Target after it exits. To be more specific, you could either make it so the Process object in m_process_sp could be moved out and into another member of Target, or you could just persist the data you care about instead of the entire process.

That might be an alternative to this solution but I'm not sure if that would be any better or more useful than what you're doing here. I'm mostly thinking aloud here (and trying to improve my knowledge of LLDB).

clayborg

A different way to approach this fix is to just fix ExecutionContext so that this can never be out of date by removing the "m_process_sp" member variable. Is is not great that we can an execution context that has an older process, and would we ever want an ExecutionContext to be able to have a mismatched process? If we remove m_process_sp, we can always get the process from the m_target_sp and then we can never get it wrong that way and we might avoid some of the changes in this patch? Lemme know what you think as It would be good to make sure our ExecutionContext variables can't get the wrong info.

jimingham · 2023-09-11T15:51:37Z

It seems a bit weird to have an Execution context with a thread and a stack frame but not the process they belong to. I don't know that it would really help either. Even if you removed the process so that you no longer had to check for that being invalid, if a thread or frame had been set in the execution context, you'd still have to check for them being from the wrong process. So I don't think we would made things significantly simpler by leaving out the process. I think it's easier to deal with it when we hand them out. Jim

…

On Sep 9, 2023, at 11:57 AM, Greg Clayton ***@***.***> wrote: @clayborg commented on this pull request. A different way to approach this fix is to just fix ExecutionContext so that this can never be out of date by removing the "m_process_sp" member variable. Is is not great that we can an execution context that has an older process, and would we ever want an ExecutionContext to be able to have a mismatched process? If we remove m_process_sp, we can always get the process from the m_target_sp and then we can never get it wrong that way and we might avoid some of the changes in this patch? Lemme know what you think as It would be good to make sure our ExecutionContext variables can't get the wrong info. — Reply to this email directly, view it on GitHub <#65822 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW5VHULVKS5IW4BFKQ3XZS3ZTANCNFSM6AAAAAA4RAWCQY>. You are receiving this because you authored the thread.

jimingham · 2023-09-11T16:34:12Z

This is in reply to Alex's comment about why we keep the process around, but the quote this reply didn't seem to work...
After a process exits you might still want to ask it what its exit status was. Plus there might be other system accounting we've gathered for that process. So until you replace it with another process, you very well might want to ask questions of it. So we can't just destroy it as soon as it exits.

That definitely explains why we persist the previous process even after its finished, but this doesn't sound like there's a technical limitation to actually removing the Process from the Target after it exits. To be more specific, you could either make it so the Process object in m_process_sp could be moved out and into another member of Target, or you could just persist the data you care about instead of the entire process.

That might be an alternative to this solution but I'm not sure if that would be any better or more useful than what you're doing here. I'm mostly thinking aloud here (and trying to improve my knowledge of LLDB).

I'm not sure this would make things easier. The model where you just ask info of the dead process till you make another one is very straightforward. Also, the window where this is likely to cause problems is very small, and all internal to launching and attaching, so making the handling of the old process more complex to solve this problem seems like overkill.

clayborg · 2023-09-12T22:26:48Z

It seems a bit weird to have an Execution context with a thread and a stack frame but not the process they belong to. I don't know that it would really help either. Even if you removed the process so that you no longer had to check for that being invalid, if a thread or frame had been set in the execution context, you'd still have to check for them being from the wrong process. So I don't think we would made things significantly simpler by leaving out the process. I think it's easier to deal with it when we hand them out. Jim

Or we can have ExecutionContext have a single element so that it is always correct:

ExecutionContext {
...
   std::shared_ptr<ExecutionContextScope> m_exe_scope_sp;

Since target, proces, thread and frames all inherit from ExecutionContextScope, all we need to do is store one item. The ExecutionContextScope API has:

  virtual lldb::TargetSP CalculateTarget() = 0;
  virtual lldb::ProcessSP CalculateProcess() = 0;
  virtual lldb::ThreadSP CalculateThread() = 0;
  virtual lldb::StackFrameSP CalculateStackFrame() = 0;

So then everything would always be correct.

clayborg · 2023-09-12T22:32:26Z

So if we use the ExecutionContextScope and init a ExectionContext with a target, it can calculate the process and target. If you init it with a thread, it can calculate the thread, process and target. And if you init it with a frame, it can get the frame, thread, process and target.

jimingham · 2023-09-13T00:16:38Z

It seems a bit weird to have an Execution context with a thread and a stack frame but not the process they belong to. I don't know that it would really help either. Even if you removed the process so that you no longer had to check for that being invalid, if a thread or frame had been set in the execution context, you'd still have to check for them being from the wrong process. So I don't think we would made things significantly simpler by leaving out the process. I think it's easier to deal with it when we hand them out. Jim

Or we can have ExecutionContext have a single element so that it is always correct:
ExecutionContext {
...
   std::shared_ptr<ExecutionContextScope> m_exe_scope_sp;
Since target, proces, thread and frames all inherit from ExecutionContextScope, all we need to do is store one item. The ExecutionContextScope API has:
  virtual lldb::TargetSP CalculateTarget() = 0;
  virtual lldb::ProcessSP CalculateProcess() = 0;
  virtual lldb::ThreadSP CalculateThread() = 0;
  virtual lldb::StackFrameSP CalculateStackFrame() = 0;
So then everything would always be correct.

The only time this will be incorrect in this way is if you make an execution context filling in the current target & process and then hold onto it over the process in the target changing. Internally, we only hold onto execution context's while executing a command, and we return that to ourselves in the API I fixed. So we really don't have to fix this anywhere else to solve actual problems.

This seems like a much wider ranging change than is required to fix this bug. I'm all for rethinking the ExecutionContextScope/ExecutionContext/ExecutionContextRef nexus, but I don't think that's a trivial rethink, and I don't have the time to undertake that right now.

clayborg · 2023-09-13T17:02:16Z

This seems like a much wider ranging change than is required to fix this bug. I'm all for rethinking the ExecutionContextScope/ExecutionContext/ExecutionContextRef nexus, but I don't think that's a trivial rethink, and I don't have the time to undertake that right now.

My only point in mentioning this is many of the changed files in your patch goes away if we fix ExecutionContext.

Change look fine to me if we don't want to change ExecutionContext, but i will let others do the final accept.

mysterymath · 2023-09-19T22:05:33Z

It looks like this is breaking buildbots:
https://lab.llvm.org/buildbot/#/builders/68/builds/60267

The test can't find main.cpp, and it looks like the immediately preceding test issues an os.chdir. It has a hook to reset it, but maybe there's some problem with it?

jimingham · 2023-09-19T23:47:56Z

Good eye! The cleanup actually does run but it was told to reset the directory to the "testdir" which is: build/lldb-test-build.noindex/commands/process/attach/TestProcessAttach.test_attach_to_process_from_different_dir_by_id but the tests are supposed to start in the source directory. I fixed that in: 74338bf teach me to just run the test I added! Jim

…

On Sep 19, 2023, at 3:05 PM, Daniel Thornburgh ***@***.***> wrote: It looks like this is breaking buildbots: https://lab.llvm.org/buildbot/#/builders/68/builds/60267 The test can't find main.cpp, and it looks like the immediately preceding test issues an os.chdir. It has a hook to reset it, but maybe there's some problem with it? — Reply to this email directly, view it on GitHub <#65822 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW7ITHEIFWRJLWQS7VLX3IJLVANCNFSM6AAAAAA4RAWCQY>. You are receiving this because you modified the open/close state.

…ocess previously (#65822)" This reverts commit 7265f79. The new test case is flaky on Linux AArch64 (https://lab.llvm.org/buildbot/#/builders/96) and more flaky on Windows on Arm (https://lab.llvm.org/buildbot/#/builders/219/builds/5735).

DavidSpickett · 2023-09-20T08:22:00Z

This is failing on AArch64 Linux and Windows, it finds some async error in the output but the bots don't say what. I'll find what it is.

DavidSpickett · 2023-09-20T09:16:18Z

lldb/test/API/commands/process/attach/TestProcessAttach.py

+                break
+
+        self.dbg.DispatchInputInterrupt()
+        self.dbg.DispatchInputInterrupt()


Why is this done twice?

DavidSpickett · 2023-09-20T09:56:42Z

lldb/test/API/commands/process/attach/TestProcessAttach.py

+            if "Cancelled async attach" in line:
+                found_result = True
+                break
+        self.assertTrue(found_result, "Found async error in results")


This fails if results is empty.

******************************** [] ******************************** FAIL: LLDB (/home/david.spickett/build-llvm-aarch64/bin/clang-aarch64) :: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) <bound method SBProcess.Kill of SBProcess: pid = 0, state = exited, threads = 0, executable = ProcessAttach>: success Restore dir to: /home/david.spickett/build-llvm-aarch64 ====================================================================== FAIL: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/david.spickett/llvm-project/lldb/test/API/commands/process/attach/TestProcessAttach.py", line 195, in test_run_then_attach_wait_interrupt self.assertTrue(found_result, "Found async error in results") AssertionError: False is not True : Found async error in results Config=aarch64-/home/david.spickett/build-llvm-aarch64/bin/clang ----------------------------------------------------------------------

When the test works the results contain:

******************************** ['error: attach failed: Cancelled async attach.\n', '\n', '... Interrupted.\n'] ********************************

Running it in a loop it took ~40 runs to get a failing one.

I wonder if that is because the attach happens to finish a lot faster sometimes, so there's no time to cancel it? If that's down to the OS and machine load, I'm not sure how we'd make this predictable.

The ugly option is to say if the results are empty, pass the test and assume the other 39 runs will check for the bug.

jimingham · 2023-09-20T17:59:45Z

The way interruption works in lldb is that the first time you dispatch an interrupt, it just raises the "voluntary interrupt" flag. Turns out the process attach part of lldb isn't written in a way that makes checking for the interrupt flag possible, so that doesn't work. The second time you issue the interrupt, since we've already raised the Debugger interrupt flag, we send the Process level interrupt to any running processes. That's the one that actually interrupts the attach attempt. This is a variant of the scheme gdb used (at least back in the day), where the first ^C tries to interrupt the current command, and the second ^C tries to interrupt the process connection. In gdb, on the second ^C you would get a prompt "I see you pressed ^C twice, did you really want to interrupt the process". I didn't emulate that part as in my usage I pretty much never answered "no" to this question... Jim

…

On Sep 20, 2023, at 2:16 AM, David Spickett ***@***.***> wrote: @DavidSpickett commented on this pull request. In lldb/test/API/commands/process/attach/TestProcessAttach.py <#65822 (comment)>: > + # No need to track the output + self.stdout_path = self.getBuildArtifact("stdout.txt") + self.out_filehandle = open(self.stdout_path, "w") + self.dbg.SetOutputFileHandle(self.out_filehandle, False) + self.dbg.SetErrorFileHandle(self.out_filehandle, False) + + n_errors, quit_req, crashed = self.dbg.RunCommandInterpreter( + True, True, options, 0, False, False) + + while 1: + time.sleep(1) + if target.process.state == lldb.eStateAttaching: + break + + self.dbg.DispatchInputInterrupt() + self.dbg.DispatchInputInterrupt() Why is this done twice? — Reply to this email directly, view it on GitHub <#65822 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW5Y7TPO7JCFGAH5KH3X3KX7DANCNFSM6AAAAAA4RAWCQY>. You are receiving this because you modified the open/close state.

jimingham · 2023-09-20T18:05:46Z

The way the test works, we run one real process and let it exit. Then we do "attach -w -n noone_would_use_this_name" because we don't want the second attach attempt to be able to succeed. lldb should just stay stuck till the interrupt succeeds, there shouldn't be anything other way to get out of this. It sounds like something else is kicking us out of the attach wait loop on these systems? Jim

…

On Sep 20, 2023, at 2:56 AM, David Spickett ***@***.***> wrote: @DavidSpickett commented on this pull request. In lldb/test/API/commands/process/attach/TestProcessAttach.py <#65822 (comment)>: > + time.sleep(1) + if target.process.state == lldb.eStateAttaching: + break + + self.dbg.DispatchInputInterrupt() + self.dbg.DispatchInputInterrupt() + + self.out_filehandle.flush() + reader = open(self.stdout_path, "r") + results = reader.readlines() + found_result = False + for line in results: + if "Cancelled async attach" in line: + found_result = True + break + self.assertTrue(found_result, "Found async error in results") This fails if results is empty. ******************************** [] ******************************** FAIL: LLDB (/home/david.spickett/build-llvm-aarch64/bin/clang-aarch64) :: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) <bound method SBProcess.Kill of SBProcess: pid = 0, state = exited, threads = 0, executable = ProcessAttach>: success Restore dir to: /home/david.spickett/build-llvm-aarch64 ====================================================================== FAIL: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/david.spickett/llvm-project/lldb/test/API/commands/process/attach/TestProcessAttach.py", line 195, in test_run_then_attach_wait_interrupt self.assertTrue(found_result, "Found async error in results") AssertionError: False is not True : Found async error in results Config=aarch64-/home/david.spickett/build-llvm-aarch64/bin/clang ---------------------------------------------------------------------- When the test works the results contain: ******************************** ['error: attach failed: Cancelled async attach.\n', '\n', '... Interrupted.\n'] ******************************** Running it in a loop it took ~40 runs to get a failing one. I wonder if that is because the attach happens to finish a lot faster sometimes, so there's no time to cancel it? If that's down to the OS and machine load, I'm not sure how we'd make this predictable. The ugly option is to say if the results are empty, pass the test and assume the other 39 runs will check for the bug. — Reply to this email directly, view it on GitHub <#65822 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW225PO46OIV2JRC2I3X3K4WPANCNFSM6AAAAAA4RAWCQY>. You are receiving this because you modified the open/close state.

jimingham · 2023-09-20T20:15:11Z

You are right, there is a racy bit here - I was also getting ~one fails per 50 runs. The raciness is due to the time it takes between when we send the interrupt and when we finish up the command and produce the results. So we can end up reading from the command output before the command has written its result. I can make this more deterministic by waiting for the process state to shift to eStateExited. But even with that it was still a little flakey because there's still some time between when the interrupt is noticed and causes us to switch the process state and when lldb makes the command return and prints that to stdout. Unfortunately this is a test of command-line behavior, you don't get the same problem with the SB API's directly. So I have to do it as a "run a command-line command" test, and that makes it harder to get direct signals about when to look at the result. If we sent "command completed" events I could wait for that, then the test would be deterministic. That seems like a feature we should design, not jam in to fix a test problem.. However, if I just insert a time.sleep(1) between seeing the state flip to eStateExited and looking at the results, I can run this #200 and see no failures. That's just the time to back out of the command execution stack and write the result to stdout, so that shouldn't be as variable. If this still ends up failing intermittently, then I'll have to cook up some kind of "command completed" event that the test can wait on. Jim

…

On Sep 20, 2023, at 11:05 AM, Jim Ingham ***@***.***> wrote: The way the test works, we run one real process and let it exit. Then we do "attach -w -n noone_would_use_this_name" because we don't want the second attach attempt to be able to succeed. lldb should just stay stuck till the interrupt succeeds, there shouldn't be anything other way to get out of this. It sounds like something else is kicking us out of the attach wait loop on these systems? Jim > On Sep 20, 2023, at 2:56 AM, David Spickett ***@***.***> wrote: > > > @DavidSpickett commented on this pull request. > > In lldb/test/API/commands/process/attach/TestProcessAttach.py <#65822 (comment)>: > > > + time.sleep(1) > + if target.process.state == lldb.eStateAttaching: > + break > + > + self.dbg.DispatchInputInterrupt() > + self.dbg.DispatchInputInterrupt() > + > + self.out_filehandle.flush() > + reader = open(self.stdout_path, "r") > + results = reader.readlines() > + found_result = False > + for line in results: > + if "Cancelled async attach" in line: > + found_result = True > + break > + self.assertTrue(found_result, "Found async error in results") > This fails if results is empty. > > ******************************** > [] > ******************************** > FAIL: LLDB (/home/david.spickett/build-llvm-aarch64/bin/clang-aarch64) :: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) > <bound method SBProcess.Kill of SBProcess: pid = 0, state = exited, threads = 0, executable = ProcessAttach>: success > > Restore dir to: /home/david.spickett/build-llvm-aarch64 > ====================================================================== > FAIL: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home/david.spickett/llvm-project/lldb/test/API/commands/process/attach/TestProcessAttach.py", line 195, in test_run_then_attach_wait_interrupt > self.assertTrue(found_result, "Found async error in results") > AssertionError: False is not True : Found async error in results > Config=aarch64-/home/david.spickett/build-llvm-aarch64/bin/clang > ---------------------------------------------------------------------- > When the test works the results contain: > > ******************************** > ['error: attach failed: Cancelled async attach.\n', '\n', '... Interrupted.\n'] > ******************************** > Running it in a loop it took ~40 runs to get a failing one. > > I wonder if that is because the attach happens to finish a lot faster sometimes, so there's no time to cancel it? If that's down to the OS and machine load, I'm not sure how we'd make this predictable. > > The ugly option is to say if the results are empty, pass the test and assume the other 39 runs will check for the bug. > > — > Reply to this email directly, view it on GitHub <#65822 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW225PO46OIV2JRC2I3X3K4WPANCNFSM6AAAAAA4RAWCQY>. > You are receiving this because you modified the open/close state. >

jimingham · 2023-09-20T20:40:04Z

Interesting... I split the end stage of the test up into: (a) wait for up to 20 seconds the state to switch to eStateExited (b) wait another second, then check the command output file On macOS, as I said, that seemed to make the test quite robust, but when I checked the test in I got a bot failure at stage (a). That really sounds like the underlying feature of interrupting a process in this state doesn't work reliably on those systems. The patch this is testing just changed a bug in generic code that prevented us from TRYING to do this interrupt, but didn't change any of the underlying process interrupt machinery. So it looks like the test is uncovering underlying flakiness. I changed the test to be macOS only for now. We should go back and make the feature work reliably on the other platforms when we get a change. Jim

…

On Sep 20, 2023, at 1:14 PM, Jim Ingham ***@***.***> wrote: You are right, there is a racy bit here - I was also getting ~one fails per 50 runs. The raciness is due to the time it takes between when we send the interrupt and when we finish up the command and produce the results. So we can end up reading from the command output before the command has written its result. I can make this more deterministic by waiting for the process state to shift to eStateExited. But even with that it was still a little flakey because there's still some time between when the interrupt is noticed and causes us to switch the process state and when lldb makes the command return and prints that to stdout. Unfortunately this is a test of command-line behavior, you don't get the same problem with the SB API's directly. So I have to do it as a "run a command-line command" test, and that makes it harder to get direct signals about when to look at the result. If we sent "command completed" events I could wait for that, then the test would be deterministic. That seems like a feature we should design, not jam in to fix a test problem.. However, if I just insert a time.sleep(1) between seeing the state flip to eStateExited and looking at the results, I can run this #200 and see no failures. That's just the time to back out of the command execution stack and write the result to stdout, so that shouldn't be as variable. If this still ends up failing intermittently, then I'll have to cook up some kind of "command completed" event that the test can wait on. Jim > On Sep 20, 2023, at 11:05 AM, Jim Ingham ***@***.***> wrote: > > The way the test works, we run one real process and let it exit. Then we do "attach -w -n noone_would_use_this_name" because we don't want the second attach attempt to be able to succeed. lldb should just stay stuck till the interrupt succeeds, there shouldn't be anything other way to get out of this. It sounds like something else is kicking us out of the attach wait loop on these systems? > > Jim > > > >> On Sep 20, 2023, at 2:56 AM, David Spickett ***@***.***> wrote: >> >> >> @DavidSpickett commented on this pull request. >> >> In lldb/test/API/commands/process/attach/TestProcessAttach.py <#65822 (comment)>: >> >> > + time.sleep(1) >> + if target.process.state == lldb.eStateAttaching: >> + break >> + >> + self.dbg.DispatchInputInterrupt() >> + self.dbg.DispatchInputInterrupt() >> + >> + self.out_filehandle.flush() >> + reader = open(self.stdout_path, "r") >> + results = reader.readlines() >> + found_result = False >> + for line in results: >> + if "Cancelled async attach" in line: >> + found_result = True >> + break >> + self.assertTrue(found_result, "Found async error in results") >> This fails if results is empty. >> >> ******************************** >> [] >> ******************************** >> FAIL: LLDB (/home/david.spickett/build-llvm-aarch64/bin/clang-aarch64) :: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) >> <bound method SBProcess.Kill of SBProcess: pid = 0, state = exited, threads = 0, executable = ProcessAttach>: success >> >> Restore dir to: /home/david.spickett/build-llvm-aarch64 >> ====================================================================== >> FAIL: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "/home/david.spickett/llvm-project/lldb/test/API/commands/process/attach/TestProcessAttach.py", line 195, in test_run_then_attach_wait_interrupt >> self.assertTrue(found_result, "Found async error in results") >> AssertionError: False is not True : Found async error in results >> Config=aarch64-/home/david.spickett/build-llvm-aarch64/bin/clang >> ---------------------------------------------------------------------- >> When the test works the results contain: >> >> ******************************** >> ['error: attach failed: Cancelled async attach.\n', '\n', '... Interrupted.\n'] >> ******************************** >> Running it in a loop it took ~40 runs to get a failing one. >> >> I wonder if that is because the attach happens to finish a lot faster sometimes, so there's no time to cancel it? If that's down to the OS and machine load, I'm not sure how we'd make this predictable. >> >> The ugly option is to say if the results are empty, pass the test and assume the other 39 runs will check for the bug. >> >> — >> Reply to this email directly, view it on GitHub <#65822 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW225PO46OIV2JRC2I3X3K4WPANCNFSM6AAAAAA4RAWCQY>. >> You are receiving this because you modified the open/close state. >> >

jimingham · 2023-09-20T20:41:01Z

On Sep 20, 2023, at 1:39 PM, Jim Ingham ***@***.***> wrote: Interesting... I split the end stage of the test up into: (a) wait for up to 20 seconds the state to switch to eStateExited (b) wait another second, then check the command output file On macOS, as I said, that seemed to make the test quite robust, but when I checked the test in I got a bot failure at stage (a).

That is, a Linux bot failure... Jim

…

That really sounds like the underlying feature of interrupting a process in this state doesn't work reliably on those systems. The patch this is testing just changed a bug in generic code that prevented us from TRYING to do this interrupt, but didn't change any of the underlying process interrupt machinery. So it looks like the test is uncovering underlying flakiness. I changed the test to be macOS only for now. We should go back and make the feature work reliably on the other platforms when we get a change. Jim > On Sep 20, 2023, at 1:14 PM, Jim Ingham ***@***.***> wrote: > > You are right, there is a racy bit here - I was also getting ~one fails per 50 runs. > > The raciness is due to the time it takes between when we send the interrupt and when we finish up the command and produce the results. So we can end up reading from the command output before the command has written its result. > > I can make this more deterministic by waiting for the process state to shift to eStateExited. But even with that it was still a little flakey because there's still some time between when the interrupt is noticed and causes us to switch the process state and when lldb makes the command return and prints that to stdout. > > Unfortunately this is a test of command-line behavior, you don't get the same problem with the SB API's directly. So I have to do it as a "run a command-line command" test, and that makes it harder to get direct signals about when to look at the result. > > If we sent "command completed" events I could wait for that, then the test would be deterministic. That seems like a feature we should design, not jam in to fix a test problem.. > > However, if I just insert a time.sleep(1) between seeing the state flip to eStateExited and looking at the results, I can run this #200 and see no failures. That's just the time to back out of the command execution stack and write the result to stdout, so that shouldn't be as variable. If this still ends up failing intermittently, then I'll have to cook up some kind of "command completed" event that the test can wait on. > > Jim > > > >> On Sep 20, 2023, at 11:05 AM, Jim Ingham ***@***.***> wrote: >> >> The way the test works, we run one real process and let it exit. Then we do "attach -w -n noone_would_use_this_name" because we don't want the second attach attempt to be able to succeed. lldb should just stay stuck till the interrupt succeeds, there shouldn't be anything other way to get out of this. It sounds like something else is kicking us out of the attach wait loop on these systems? >> >> Jim >> >> >> >>> On Sep 20, 2023, at 2:56 AM, David Spickett ***@***.***> wrote: >>> >>> >>> @DavidSpickett commented on this pull request. >>> >>> In lldb/test/API/commands/process/attach/TestProcessAttach.py <#65822 (comment)>: >>> >>> > + time.sleep(1) >>> + if target.process.state == lldb.eStateAttaching: >>> + break >>> + >>> + self.dbg.DispatchInputInterrupt() >>> + self.dbg.DispatchInputInterrupt() >>> + >>> + self.out_filehandle.flush() >>> + reader = open(self.stdout_path, "r") >>> + results = reader.readlines() >>> + found_result = False >>> + for line in results: >>> + if "Cancelled async attach" in line: >>> + found_result = True >>> + break >>> + self.assertTrue(found_result, "Found async error in results") >>> This fails if results is empty. >>> >>> ******************************** >>> [] >>> ******************************** >>> FAIL: LLDB (/home/david.spickett/build-llvm-aarch64/bin/clang-aarch64) :: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) >>> <bound method SBProcess.Kill of SBProcess: pid = 0, state = exited, threads = 0, executable = ProcessAttach>: success >>> >>> Restore dir to: /home/david.spickett/build-llvm-aarch64 >>> ====================================================================== >>> FAIL: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase) >>> ---------------------------------------------------------------------- >>> Traceback (most recent call last): >>> File "/home/david.spickett/llvm-project/lldb/test/API/commands/process/attach/TestProcessAttach.py", line 195, in test_run_then_attach_wait_interrupt >>> self.assertTrue(found_result, "Found async error in results") >>> AssertionError: False is not True : Found async error in results >>> Config=aarch64-/home/david.spickett/build-llvm-aarch64/bin/clang >>> ---------------------------------------------------------------------- >>> When the test works the results contain: >>> >>> ******************************** >>> ['error: attach failed: Cancelled async attach.\n', '\n', '... Interrupted.\n'] >>> ******************************** >>> Running it in a loop it took ~40 runs to get a failing one. >>> >>> I wonder if that is because the attach happens to finish a lot faster sometimes, so there's no time to cancel it? If that's down to the OS and machine load, I'm not sure how we'd make this predictable. >>> >>> The ugly option is to say if the results are empty, pass the test and assume the other 39 runs will check for the bug. >>> >>> — >>> Reply to this email directly, view it on GitHub <#65822 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW225PO46OIV2JRC2I3X3K4WPANCNFSM6AAAAAA4RAWCQY>. >>> You are receiving this because you modified the open/close state. >>> >> >

DavidSpickett · 2023-09-21T09:19:31Z

Thanks for looking into this. Well for now at least we know that users on Linux will only see the bug 1 in N times :)

(the bots are still red but for unrelated reasons)

jimingham requested review from JDevlieghere, bulbazord and clayborg September 8, 2023 23:52

jimingham requested a review from a team as a code owner September 8, 2023 23:52

github-actions bot added the lldb label Sep 8, 2023

adrian-prantl reviewed Sep 9, 2023

View reviewed changes

lldb/test/API/commands/process/attach/TestProcessAttach.py Outdated Show resolved Hide resolved

bulbazord reviewed Sep 9, 2023

View reviewed changes

Address review comments

cd68bd4

clayborg reviewed Sep 9, 2023

View reviewed changes

jimingham merged commit 7265f79 into llvm:main Sep 19, 2023
1 of 3 checks passed

jimingham deleted the cancel-attach-after-run branch September 19, 2023 18:26

DavidSpickett reviewed Sep 20, 2023

View reviewed changes

paulwalker-arm mentioned this pull request Sep 20, 2023

sve gather scatter offset sinking #66932

Closed

paulwalker-arm mentioned this pull request Sep 21, 2023

sve gather scatter offset sinking #66994

Closed

kstoimenov mentioned this pull request Sep 22, 2023

Add memcpm test kstoimenov/llvm-project#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug with cancelling "attach -w" after you have run a process previously #65822

Fix a bug with cancelling "attach -w" after you have run a process previously #65822

jimingham commented Sep 8, 2023

adrian-prantl Sep 9, 2023

jimingham Sep 9, 2023

bulbazord left a comment

bulbazord Sep 9, 2023

jimingham commented Sep 9, 2023 •

edited

Loading

bulbazord commented Sep 9, 2023

clayborg left a comment

jimingham commented Sep 11, 2023 via email •

edited

Loading

jimingham commented Sep 11, 2023

clayborg commented Sep 12, 2023

clayborg commented Sep 12, 2023

jimingham commented Sep 13, 2023

clayborg commented Sep 13, 2023

mysterymath commented Sep 19, 2023

jimingham commented Sep 19, 2023 via email

DavidSpickett commented Sep 20, 2023

DavidSpickett Sep 20, 2023

DavidSpickett Sep 20, 2023

jimingham commented Sep 20, 2023 via email

jimingham commented Sep 20, 2023 via email

jimingham commented Sep 20, 2023 via email

jimingham commented Sep 20, 2023 via email

jimingham commented Sep 20, 2023 via email

DavidSpickett commented Sep 21, 2023

Fix a bug with cancelling "attach -w" after you have run a process previously #65822

Fix a bug with cancelling "attach -w" after you have run a process previously #65822

Conversation

jimingham commented Sep 8, 2023

adrian-prantl Sep 9, 2023

Choose a reason for hiding this comment

jimingham Sep 9, 2023

Choose a reason for hiding this comment

bulbazord left a comment

Choose a reason for hiding this comment

bulbazord Sep 9, 2023

Choose a reason for hiding this comment

jimingham commented Sep 9, 2023 • edited Loading

bulbazord commented Sep 9, 2023

clayborg left a comment

Choose a reason for hiding this comment

jimingham commented Sep 11, 2023 via email • edited Loading

jimingham commented Sep 11, 2023

clayborg commented Sep 12, 2023

clayborg commented Sep 12, 2023

jimingham commented Sep 13, 2023

clayborg commented Sep 13, 2023

mysterymath commented Sep 19, 2023

jimingham commented Sep 19, 2023 via email

DavidSpickett commented Sep 20, 2023

DavidSpickett Sep 20, 2023

Choose a reason for hiding this comment

DavidSpickett Sep 20, 2023

Choose a reason for hiding this comment

jimingham commented Sep 20, 2023 via email

jimingham commented Sep 20, 2023 via email

jimingham commented Sep 20, 2023 via email

jimingham commented Sep 20, 2023 via email

jimingham commented Sep 20, 2023 via email

DavidSpickett commented Sep 21, 2023

jimingham commented Sep 9, 2023 •

edited

Loading

jimingham commented Sep 11, 2023 via email •

edited

Loading