-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests][eventpipe] Port over Dotnet/Diagnostics to enable eventpipe tests on Android #64358
Conversation
41b89d3
to
4f4b29d
Compare
src/tests/tracing/eventpipe/common/EventPipeSessionConfiguration.cs
Outdated
Show resolved
Hide resolved
src/tests/tracing/eventpipe/common/ExposedSocketNetworkStream.cs
Outdated
Show resolved
Hide resolved
If we are going to maintain a copy of I'd suggest we create a new directory (perhaps I would also suggest that any changes specific to testing in the runtime repo that are not going to be merged back into the diagnostics repo be put in their own library that depends on the local copy of M.D.NetCore.Client. If you need access to internals of the libraries, you can use the InternalsVisibleTo property to expose them to arbitrary projects. CC @dotnet/dotnet-diag |
Why not just reference the package we publish on the dotnet-tools feed? It is a little old @hoyosjs could probably tell us why.
|
There are things we do in the tests that require us to inspect parts of the IPC protocol etc. that the library doesn't expose. We also don't want to get into a cyclic dependency where we need to update the library in dotnet/diagnostics before we can test a feature that is being added in dotnet/runtime. This leaves us room to use the internals of the library, e.g., inspecting IPC message payloads and headers, using |
The decision to keep a local version for agility was from the perspective of runtime devs. I'd say we should really get the tests to use the library now that it's matured enough. Diagnostics builds nightly at the very least, and there's no major dependencies here where I'd foresee a blocked insertion. |
Certain tests can, but some tests are doing things at a lower level than what's exposed by the library. If this is solely to light up BigEvent, EventProvider, EventSourceError, BufferSize, and similar on mobile, then we should just use the external package for those tests. Other tests should remain unchanged to make sure they're validating at the lower level. |
As previously discussed in meetings, we will need to do TCP/IP extensions to the library in order to run test on mobile platforms, we would need to extend constructors accepting TCP/IP based URI's, so instead of extending the diagnostic client in diagnostic repro, it was decided to cut the circular dependency between runtime repro and diagnostic repro and extend a local version of diagnostic client in dotnet runtime repro. Current tests are also depending on nuget reference that is an old diagnostic client, where we only have the nuget packages available but no source code. @josalem, @hoyosjs before we spend more time on this we will need to make up our minds on how to proceed (since we have been pointed in different directions). If we believe we still want to extend the diagnostic client in diagnostic repro with TCP/IP support and use that when testing, we will need to update the tests to use a new diagnostic repro builds. That means we will need support to configure TCP/IP as an alternative diagnostic protocol and since it will be accessible through a public constructor it might leak into the public (maybe we could make it protected and use reflection in dotnet runtime as an alternative). We would also need to get changes merged into the diagnostic repro and then upgrade tests in dotnet runtime to use new diagnostic client including the extended TCP/IP support. |
Is this modified version in a branch somewhere we could look at? I'm hoping to understand what changes were necessary to get the tests running. Based on the description at the top of this PR it sounded like the M.D.N.C code added here is a subset of the public API/functionality available in the official version. However if I understood that part correctly then I didn't understand why the original NuGet package would not also be workable. |
Apologies to ask for the repeat since I wasn't at the meeting, was there a plan in place for how we would release this functionality to customers, or was TCP/IP purely an internal feature for testing that customers would never have access to? |
This is currently for internal testing in dotnet runtime CI only, running EventPipe runtime tests on mobile (Android/iOS) that needs TCP/IP support in diagnostic client and should currently not be available to customers. |
Is the plan that it will eventually be available to them or they will use EventPipe on Android some other way? |
@mdh1418 can reference the changes he originally did to get diagnostic client in diagnostic repro working together with runtime tests. We need an extension to the library and current nuget package used by dotnet/runtime won't work, it is an old nuget, that we don't even have any source code hooked up to. Since we will need to enable TCP/IP support in upstream diagnostic client, merge that, upgrade dotnet/runtime to take dependency on new nuget package from diagnostic repro for tests to run on mobile in dotnet/runtime CI. @josalem suggested that we should cut the dependency between repository, put a copy of diagnostic client into dotnet/runtime repository, add the extensions needed into that copy and make dotnet/runtime CI independent of diagnostic repro in order to run the low level CoreCLR runtime tests on CI. For us, we can go either way, but if we think we should keep dependency on diagnostic repro from dotnet/runtime repro we would need to include code to enable TCP/IP in the diagnostic client, get it merged, update EventPipe tests on dotnet/runtime to use new diagnostic client (instead of old nuget that we don't have any source code for anymore). |
It depends if we think we should include TCP/IP support into all our diagnostic client tooling or not. In .net6 we only added support into dsrouter, reducing the need to do compliance work for all diagnostic client tools that continued to use named pipes or unix domain sockets only (via dsrouter in mobile scenarios). Reduced the amount of needed compliance work a lot. |
Thanks @lateralusX, now the pieces are clicking better. I can see pros and cons of both the source copy or the binary reference so I'm fine with whichever you guys decide. I agree with @josalem above that if we are going to do the source copy route then we should aim to make resyncing the repos as easy as possible - ideally it would just be a direct file copy from one repo to the other. If you want to add tcp support to the upstream version of the library that seems fine to me. I think it could be done using an internal API which could either be invoked via reflection, InternalsVisibleTo, or the client source is directly compiled into the test assembly so that there is no assembly boundary to cross. |
dotnet/diagnostics#2833 It doesn't fully work as there needs to be a way to identify Android as the running platform. I originally made it work locally by forcing the Android options when building the diagnostics repo locally, but I didnt manage to detect the Android platform as OperatingSystem used by diagnostics didnt recognize Android |
Apologies for making the plan a little murky! My main point was mostly that we should make a full copy of the diagnostics library to make maintenance easier. As for any other changes, I want to avoid the following workflow:
I want to be able to still do the following:
|
No worries and no apologies needed. Minor miscommunications happen all the time and are trivially worked out : ) |
@noahfalk @josalem @hoyosjs, so do we agree that we should continue down the path started by this PR, making a local copy of diagnostic client in dotnet/runtime repro and extend that, making it clear what is upstream and what is extension to the library in dotnet/runtime, making future sync of the code bases simple, to support mobile platforms in EventPipe runtime tests in dotnet/runtime? |
I agree, and @josalem is OOF but as best I understand he agreed too. I think both he and I were primarily worried about making the sync of source simple and we are happy as long as they stay more closely aligned. |
4f4b29d
to
cf0c067
Compare
@noahfalk @hoyosjs @lateralusX Can I have another set of 👀 to look over the changes? |
@josalem @noahfalk @hoyosjs @lateralusX There is an option of having platform-variant source code by splitting the projects into two pairs (perhaps bigevent.csproj and bigevent_legacy.csproj ?) and then marking the project with I believe our options to push this PR to the finish line are:
To me, option 1 looks the best until someone is able to test on Linux arm. |
@hoyosjs - I think you have more expertise with what is possible in our runtime tests than I. I think (1) is a fine option unless you have a better suggestion? Thanks @mdh1418 👍 I'd say give Juan 24 hours to respond, and if he is busy with other things and can't get to it then assume option 1 is good : ) |
Option 1 is ok here if we can't figure out 3. I am looking at the dumps to see what happened with 3. |
Finally managed to get enough parts for the dump: The main testing thread is here:
Which is just waiting for the reads to happen: runtime/src/tests/tracing/eventpipe/common/IpcTraceTest.cs Lines 290 to 291 in 9b58954
The faulting thread is in:
We get a SIGBUS for reading at 0xf0c04845, which is part of the ep_rt_utf16_to_utf8_string requested by eventpipe_collect_tracing_command_try_parse_config; This shuts down the PAL, and we hit the assert because there's no managed thread. This can only happen if I am not sure what caused the sigbus. [r1] and [r1]+4 are available in the memory I see that's the Sentinel provider name string.
I can't get this to reproduce locally - it repros relatively consistenty in CI though. I've requested a CI machine to try to look if it reproduces more easily there. That might be a while, so I'd say split works fine for now. I also can't explain why it only happens in Checked (I'd expect such shutdown path to happen any time, so it makes little sense unless something in checked is what's causing the SIGBUS). |
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
@hoyosjs Could the signal be due to an alignment issue on that string (looks like it by just looking at the address in the callstack)? ARM architectures handle unaligned reads, but it is possible to configure them to trigger faults on unaligned data access. I have run EventPipe and dotnet-trace on ARM64 hardware without issues, and we also have pass on CI on several ARM64 configurations, maybe we are doing something on this specific ARM configuration that disable unaligned data access that in turn will trigger the fault or the specific instruction selection in this case, LDRD? Looking into the parsing rules of Looks like this has been an issue all along since we introduced the collect tracing 2 command, since the protocol fields are unaligned by design, but code would try to reuse unicode strings directly from buffer (without copy) but in the past it was probably never run on an ARM configuration that would trigger a SIGBUS on unaligned reads in the past. I believe a fix to this issue would mean making temp alloc/memcpy of the string from the buffer and then convert it to utf8 and free the temp allocated string. I also believe we don't need to perform this for all our string parsing, just the once in collect tracing 2 parsing, or it will introduce additional temporary heap allocations for all our string parsing of IPC buffers. @mdh1418 How can we leverage this PR to try out the fix? I guess we will need to get back so we have at least one test that reproduce the issue on ARM Linux, once that is done, I could push a fix towards this branch and we can see how it plays out. |
@lateralusX The projects are all duplicated, with one copy being disabled in issues.targets for Linux arm coreclr, and the other copy being disabled in issues.targets for not Linux arm coreclr. The only one not disabled in issues.targets is I just removed one of the ExcludeList Items for bigevent on Linux arm CoreCLR if you'd want to try out the fix for the underlying issue. |
Collect tracing 2 EventPipe command triggers an unaligned UTF16 string read that could cause a SIGBUS on platforms not supporting unalinged reads of UTF16 strings (so far only seen on 32-bit ARM Linux CI machine). On CoreCLR this could even cause a unalinged int read due to optimizations used in UTF8Encoding::GetByteCount.
Pushed a fix that makes sure collect trace 2 command parsing logic won't end up in reading its provider name and config filter strings from an unaligned address. This happens since collect trace 2 payload added a bool (1 byte) in its payload and the parsing code would read the configuration like it did for regular collect trace command (reusing UTF16 string directly from payload). Due to the structure of the payload in collect trace 2 we will need to make a temporary copy of the payload strings into allocated memory (ending up in a correctly aligned buffer) and then do the conversion from UTF16 to UTF8 using a correctly aligned UTF16 buffer. |
Looks like this is mainly triggered by ARM 32-bit optimization of: runtime/src/coreclr/pal/src/locale/utf8.cpp Line 2523 in 4da6b9a
On 32-bit ARM, word size is 32-bit and it looks like its probably optimize this code: runtime/src/coreclr/pal/src/locale/utf8.cpp Line 2749 in 4da6b9a
|
@mdh1418 Looks like fixing the unaligned read made CoreCLR Linux ARM test lane to pass on CI, https://github.com/dotnet/runtime/pull/64358/checks?check_run_id=5358747816. So if that fix take care of your Linux ARM issues in this PR, then maybe we should remove the workaround to special handle ARM Linux using the old IPC client, cb2cacd |
Thanks @lateralusX !! I reverted the few commits that special cased duplicated the test projects for Linux arm CoreCLR. We'll see how CI goes 🤞 |
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
@@ -306,6 +314,50 @@ ipc_message_flatten ( | |||
ep_exit_error_handler (); | |||
} | |||
|
|||
static | |||
bool | |||
ipc_message_try_parse_string_utf16_t_byte_array ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lateralusX Is this mostly meant as a targeted fix? Feels odd we are only safekeeping this one case. I wonder if we can statically test for alignment on these reads or check dynamically if the platform supports it and perform copies when necessary. This also feels like a potential pain point for our portable builds (pretty much would for reads to always be aligned). The fix itself (copying) LGTM, just feels like we might be chasing bugs in this area for a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, fix was evaluating if it was the root cause of the ARM 32-bit issue seen on CoreCLR. Second I didn't want to switch to copy semantics in all cases, since it has been like that for a long time (inherited from C++ codebase) and its an optimization that can be used in most cases of current parsing logic (all except in the collect trace 2). Switching to copy semantics would also change the ownership rules of parsed UTF16 strings meaning we would need to do more adjustments in parsing logic artifacts that currently don't have ownership of the UTF16 strings. I added a runtime assert to make sure the version not doing a byte copy, ds_ipc_message_try_parse_string_utf16_t
, will enforce that the passed in buffer is property aligned, in order to catch any future unaligned use. I also looked through all existing calls to ds_ipc_message_try_parse_string_utf16_t
and they all currently parsed on aligned buffer address (except the fixed collect trace 2 parsing logic) The above function, ipc_message_try_parse_string_utf16_t_byte_array
is also made static, so only called from within that source. Doing a copy if necessary logic would even complicate the ownership rules of these strings more.
Since the alignment is based on the parsing logic implemented in functions like eventpipe_collect_tracing2_command_try_parse_payload
there is no static test, but as I said above, I added a runtime validation on debug builds, so as long as we have tests covering this (that we apparently didn't have until we updated the Diagnostic Client), we should hit assert on debug builds, if we add new payloads that could trigger the same issue in future.
The actual SIGBUS happening in this case is also rather interesting since you could argue that it is a compiler optimization bug. ARM supports unaligned reads for word and halfword instructions, but not for double word. The code does two word size reads inside UTF, that would have worked if optimizer didn't bundle them in to a LDRD instruction that will need address to be aligned, so optimization moved from a program that would have worked to one that triggers a SIGBUS.
So to sum up, current fix take care of the scenario currently hit with unaligned UTF16 string in IPC message payload. I added an assert to prevent future incorrect use of UTF16 string parsing function using unaligned addresses. Alternative would be to abandon idea to reuse UTF16 string from payload and always take copy, totally doable, but will add more changes to all current *CommandPayload structs, since they will need to switch to owning copy of UTF16 string and make sure they are freed as part of their fini/free implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the test exclusions and everything looks sane to me. Thanks for pushing through all the oddness here :)
The failure on Thanks for all of your help! @hoyosjs @noahfalk @lateralusX |
In effort to enable diagnostics tests for mobile (specifically Android) in #60879, it was realized that some changes to
Microsoft.Diagnostics.NETCore.Client
are needed in order for the eventpipe runtime tests to work (bigevent, buffersize, eventsourceerror). However, some of the changes (including a need to update the runtime version used in Dotnet/Diagnostics to detect Android as the platform) are expected to lead to a significant amount of changes for everything to build properly. As such, we look to port over the barebone necessities from Dotnet/Diagnostics in order to get the eventpipe runtime tests working after #60879.This PR has the following changes/commits:
Ports over the complete set of files from Dotnet/Diagnostics/src/Microsoft.Diagnostics.NETCore.Client/
Modify some files for TCPIP support and wrap with a runtime defined preprocessor directive so that the changes would not surface in the upstream (Dotnet/Diagnostics/src/Microsoft.Diagnostics.NETCore.Client), disable CS1591, CS8073, and CS0162 warnings in the previously mentioned preprocessor directive, Avoids CS0108 by wrapping with
NET6_0
preprocessor directive,Removes Microsoft.Diagnostics.NETCore.Client package dependency from tests as the local files are an exact copy and avoids package version conflict.
Ports over a couple of IpcTraceTest changes from Dotnet/Diagnostics/src/eventpipe/common/IpcTraceTest.cs, including DiagnosticsClient adaptation needed for diagnostics to work on Android.
Hooks in TCP/IP support into the eventpipe tests through IpcTraceTest by leveraging the DiagnosticsClient internals made accessible to
src/tests/tracing/eventpipe/common/common.csproj
throughAssemblyInfo.cs
Disables eventpipe tests requiring subprocesses on Android as there is no current functionality to run multiple apps at the same time.
Updates the 5 eventpipe tests (bigevent, buffersize, eventsourceerror, processinfo, processinfo2) that now run on Android, and updates remaining tests (rundownvalidation, providervalidation, gcdump, debuginfo/tester) to use local Microsoft.Diagnostics.NETCore.Client source files.
Makes sure collect trace 2 command parsing logic won't end up in reading its provider name and config filter strings from an unaligned address
Testing
The diagnostics tests pass locally on Android x64, Windows x64, MacCatalyst x64.