-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono][eventpipe] Write primitive types and UTF-16 strings in little endian order #68648
Conversation
Thank you so much for this work adding big endian support in Mono EventPipe! I have a little things on my backlog to complete first, but will review this PR during next week, but just did a quick look at it and overall it looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks great! Thank you for putting effort into this, extending EventPipe to support big endian architectures. Left some comments/questions/ideas in the PR.
pack_list.append( | ||
" success &= write_buffer((const uint8_t *)%s, %s, &buffer, &offset, &size, &fixedBuffer);" % | ||
(paramName, size)) | ||
t = winTypeToFixedWidthType(parameter.winType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned that we will get additional overhead due to this on all Mono platforms, hopefully the compiler will do a good job inlining write_buffer_x in the loop and hopefully it will use intrinsic for memcpy, but we end up with many more calls that would need to be optimized. Could and option for the array to be one call wirte_buffer with the array type as argument and then you could only implement it on big endian platform? Alternative is to use different generation depending on target platform, so the below will just be a specialization for big endian platform, where the majority will keep existing behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the detailed review :)
My intention of having only one endian-independent implementation was that this will be better tested than having two obviously. Having a look at the assembly of a release build I see that for EventPipeWriteEventMethodDCEndILToNativeMap
as well as EventPipeWriteEventMonoProfilerJitDone_V1
all calls to write_buffer_x
and memcpy
are inlined (on x64 and s390x), however, this does not include the call to write_buffer
from eventpipehelpers.c
of course. Thus, in order to keep the overhead minimal and not introduce any potential performance regression I restored the old behavior for little-endian architectures.
Note, while having a further look at all write_buffer_x
functions I changed write_buffer_uintptr_t
slightly in case of sizeof (uintptr_t) == 4
.
pack_list.append( | ||
" success &= write_buffer((const uint8_t *)%s, %s, &buffer, &offset, &size, &fixedBuffer);" % | ||
(paramName, size)) | ||
pack_list.append(" const uint8_t *valuePtr = %s;" % paramName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned that we will get additional overhead due to this on all Mono platforms, hopefully the compiler will do a good job inlining write_buffer_x in the loop and hopefully it will use intrinsic for memcpy, but we end up with many more calls that would need to be optimized. Could and option be to use different generation depending on target platform, so the below will just be a specialization for big endian platform, where the majority will keep existing behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
src/coreclr/scripts/genEventPipe.py
Outdated
@@ -851,6 +877,9 @@ def getMonoEventPipeImplFilePrefix(): | |||
size_t *size, | |||
bool *fixed_buffer) | |||
{ | |||
#if BIGENDIAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we utilize Mono existing macros for this, like GUINT16_TO_LE and friends (but making EventPipe specifics in ep.h) or use already defined EP_SWAPBYTES used elsewhere in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed EP_SWAPBYTES
and introduced EP_UINT16_TO_LE
and similar in ep.h
.
src/native/eventpipe/ep-block.c
Outdated
@@ -909,12 +898,10 @@ ep_sequence_point_block_init ( | |||
EP_SERIALIZATION_FORMAT_NETTRACE_V4) != NULL); | |||
|
|||
const ep_timestamp_t timestamp = ep_sequence_point_get_timestamp (sequence_point); | |||
memcpy (sequence_point_block->block.write_pointer, ×tamp, sizeof (timestamp)); | |||
sequence_point_block->block.write_pointer += sizeof (timestamp); | |||
ep_write_buffer_int64_t (&sequence_point_block->block.write_pointer, timestamp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for more clear intent, maybe it would be good to have a ep_write_buffer_timestamp instead of using underlying int64_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added function ep_write_buffer_timestamp
which serves as a wrapper for ep_write_buffer_int64_t
.
src/native/eventpipe/ep-file.c
Outdated
ep_fast_serializer_write_buffer (fast_serializer, (const uint8_t *)&file->file_open_system_time, sizeof (file->file_open_system_time)); | ||
ep_fast_serializer_write_buffer (fast_serializer, (const uint8_t *)&file->file_open_timestamp, sizeof (file->file_open_timestamp)); | ||
ep_fast_serializer_write_buffer (fast_serializer, (const uint8_t *)&file->timestamp_frequency, sizeof (file->timestamp_frequency)); | ||
ep_fast_serializer_write_uint16_t (fast_serializer, file->file_open_system_time.year); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe implement a ep_fast_serializer_write_system_time and then only do serialize of individual fields on big endian platform and keep ep_fast_serializer_write_buffer on little endian platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ep_fast_serializer_write_uint16_t (fast_serializer, file->file_open_system_time.second); | ||
ep_fast_serializer_write_uint16_t (fast_serializer, file->file_open_system_time.milliseconds); | ||
ep_fast_serializer_write_int64_t (fast_serializer, file->file_open_timestamp); | ||
ep_fast_serializer_write_int64_t (fast_serializer, file->timestamp_frequency); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you go with a ep_write_buffer_timestamp function it could be used here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear. I only used it for file->file_open_timestamp
and left file->timestamp_frequency
as is. Not sure whether an extra function is worthwhile for that or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp_frequency is typed as int64_t so that is fine to serialize as int64_t.
src/native/eventpipe/ep.h
Outdated
ep_write_buffer_string_utf16_t (uint8_t **buf, const ep_char16_t *str, size_t len) | ||
{ | ||
#if BIGENDIAN | ||
ep_char16_t *buf_str = (ep_char16_t *)*buf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have similar logic spread out through the PR to do similar things, maybe we should share one common implementation handling utf16_t strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see three functions used for UTF-16 string handling:
bool write_buffer_string_utf8_to_utf16_t (const ep_char8_t *value, uint8_t **buffer, size_t *offset, size_t *size, bool *fixed_buffer);
void metadata_generator_write_string_to_buffer (uint8_t *buffer, size_t buffer_len, size_t *offset, const ep_char16_t *value);
void ep_write_buffer_string_utf16_t (uint8_t **buf, const ep_char16_t *str, size_t len);
The first one does a string conversion which the latter two don't. Furthermore, all metadata_generator_write_X_to_buffer
do not alter the pointer but change the offset in contrast to ep_write_buffer_X
which alter the buffer pointer. Thus the API differs quite a bit. Despite that the body more or less only consists of memcpy
. So I'm not sure how one implementation should look like. Did you have anything in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, just saw that we did string conversions in several locations, but if there is nothing to reuse between the constructions, then you can keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few thoughts on this:
- Echoing Johan's concerns about potential overhead for little-endian systems.
- The shared code paths, e.g., writing blocks, headers, metadata, etc. to the stream, should be little-endian for sure, but attempting to convert all the payloads seems like a Sisyphean task. It will probably be simpler to just allow payloads to be big endian and pay the price at the reader end to parse the payloads differently.
- To that end, we would need a mechanism for identifying to the reader that the incoming stream was big-endian payloads. This could be as simple as the reader accepting the magic
NETTRACE
bytes in little-endian or big-endian order and then adjusting its parsing accordingly. If we do things that way, CoreCLR would also getwork for big-endian archs.Let me validate CoreCLR support for big-endian archs before giving this feedback.
Lastly, out of curiosity, have you tried inspecting these big-endian files on a big-endian machine? Specifically, have you tried doing something like dotnet-trace convert --fornat speedscope ...
and then opening that file elsewhere?
src/mono/mono/eventpipe/ep-rt-mono.c
Outdated
uint32_t payload_data = | ||
#if BIGENDIAN | ||
__builtin_bswap32 (data->payload_data); | ||
#else | ||
data->payload_data; | ||
#endif | ||
ep_write_sample_profile_event (sampling_thread, sampling_event, &adapter, &data->stack_contents, (uint8_t *)&payload_data, sizeof (payload_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than do these payload conversions here, it may be simpler to identify a big-endian arch at the reader end and handle it there. Doing it here will require you to find every event, including user defined ones, with a payload and convert the contents to little-endian before it gets cast to a byte array. You'll either only have a few events with little-endian payloads, or you'll spend a lot more time converting everything and re-engineering all the code paths to swap bytes around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a question of scope. If @stefan-sf-ibm knows he won't care about EventSource generated events then the current approach handles it. @stefan-sf-ibm if you do care about those events then I could imagine a few ways to deal with it:
a. Do similar endian swapping loging in EventSource serializer and consider all events manually written via WriteEventCore as out of scope
b. Write some code that will parse event metadata to understand the schema of the payload and then endian swap all the fields (this handles the WriteEventCore case not handled in (a)
c. Make a breaking change to the file format allowing for big endian encodings and update format readers to accomodate it.
(b) adds more perf overhead, (c) probably creates a situation where tools based on sufficiently new TraceEvent support big endian data but older tools or non-TraceEvent based tools don't (I think we're up to at least three 3rd party implementations).
My inclination is to favor runtime complexity over file format complexity (favor the current PR approach or a/b options over the c option) assuming that big endian devices remain a tiny fraction of overall usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the current PR approach and the scope that it covers, payloads generated through https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/ClrEtwAll.man and emitted by Mono runtime, is a good start, I guess it mainly covers the initial TARGET_S390X needs, AFIK, that is the only big endian architecture currently running Mono out of dotnet/runtime (don't think CoreCLR supports any big endian platforms). If more support for custom payloads are needed in the future for big endian platforms, I believe we could revisit this, but current PR will at least make sure all Mono runtime emitted events are readable by existing tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment regarding the overhead for little-endian architectures in the above comment.
Ideally we would have for nettrace format something similar like in the Common Trace Format used by LTTng where one can define globally or even per integer (probably not important for us) the byte order. This would have the least overhead while taking a trace which is performance critical. My experience with EventPipe is limited to the last couple of weeks where I basically only got the implementation running. So I'm not sure how important all this is and how it manifests in practice like skipping events due to high load only because of byte swaps and its side-effects from the implementation like non-inlined function calls and so on. I'm also not sure how likely it is to change the current format and in which time frame something like that could happen. If there is anything I can do I would be happy to jump in. Anyhow, in order to gain some experience this PR might suite as a starter where all Mono events are handled. Though, in the long run I'm still kind of worried about user events.
@josalem the TraceEvent library is also on my todo list. A nettrace file created with and without this PR is not convertible on a big-endian machine. With this PR it is at least convertible on a little-endian machine. However, I just gave the sample provider a chance yesterday and traced an example application and the trace looks broken to me, i.e., the flamegraph contains no function names. It looks like as I have missed something which I need to investigate ... setting this PR to WIP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you get the rundown events into the .nettrace files and that they are correct since they are needed for the symbolication of callstacks it tools, they have the mapping between method id in callstacks and the function names/signatures. You can look at all events included in the nettrace file in perfview, and on some, it will also display the payload, but not for all.
c32ce09
to
0c7732f
Compare
src/native/eventpipe/ep.h
Outdated
@@ -8,6 +8,22 @@ | |||
#include "ep-stack-contents.h" | |||
#include "ep-rt.h" | |||
|
|||
#if FEATURE_CORECLR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to but this into each runtimes ep-rt-mono|coreclr.h header, that's where we keep runtime specializations. You can look at some other macros with different implementation on each runtime to understand the pattern.
0c7732f
to
3b1924f
Compare
StackBlocksThe reason for missing all functions in the flame graph was that IPs of StackBlocks missed a byte swap. Fixed by adjusting UTF-16 stringsSome of the UTF-16 strings generated by The converse, i.e., conversion from UTF-16 to UTF-8, needs some adjustment, too. Function Currently the naming scheme differs for UTF-8 to UTF-16 conversion where DS_VAL{16,32,64}I'm wondering whether it is fine to replace all At least the current implementation is missing casts. For example, in here the parameter is of type I will do some more testing in the following days. Last but not least I just have to say that EventTrace is an awesome library which helped a lot in order to debug the binary trace file :) |
3b1924f
to
b4af718
Compare
As previously mentioned I removed the macros uint64_t pid = DS_VAL64 (ep_rt_current_process_get_id ()); Using attribute Beside that I fixed a couple of missing conversions in diagnostics. Since all my local tests are successful, now, removing draft flag. |
b4af718
to
427868d
Compare
While rebasing in order to fix some merge conflicts new events where introduced which require string handling as e.g. |
427868d
to
f6be7d6
Compare
@lateralusX I did a couple of further tests and so far they all passed. Can we merge this? |
@stefan-sf-ibm I looked through the changes again I believe they look fine. @noahfalk Any objections merging this? The initial concerns have been addressed (so little endian systems should have close to zero overhead) and it will make sure native runtime events works on big endian systems enable use of perfview on nettrace files captured on IBM S390X. It will not do anything with managed events since there is currently not hooked into the serializer mechanism of those events, but based on above discussion, at least getting the native runtime events working with existing tooling adds value for big endian platforms justifying doing the extension. |
That is OK, DS namespace already have relationship -> EP headers, what we should avoid is the other way around. |
Merging this will also fix the |
No objections from me, sorry to keep you waiting I was out sick the last couple days. |
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
No objections from me either |
Failures in runtime-extra-platforms looks unrelated. |
/azp run runtime |
Azure Pipelines successfully started running 1 pipeline(s). |
I could also try a rebase if that is fine for you. I remember using a two weeks old mainline version I ran into some weird problems which seem to be fixed meanwhile. |
@stefan-sf-ibm, yes please rebase and we can see if we get better pass rate on runtime-extra-platforms as well. |
f6be7d6
to
4aa5926
Compare
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding support for big endian platform in EventPipe native runtime events!
By convention of the EventPipe file format every primitive type is of little-endian order. Currently there is no distinction of the host endianness which means that while firing events primitive types are written in big-endian order on s390x. This, of course, renders the generated nettrace file unreadable e.g. by tools like perfview (especially if those files are read on little-endian machines). This patch fixes this by writing primitive types always in little-endian order which involves byte swaps on big-endian machines. In order to keep a good debug experience those byte swaps are done late.
The generated event functions by
genEventPipe.py
already utilize for primitive types specialized serialization functions as e.g.write_buffer_uint32_t
which have been extended with byte swaps accordingly. In order to deal with events which contain arrays and (arrays of) structures the generated functions have been adapted as follows.Arrays are iterated and each element is applied to a specialized function as e.g.
Similar for structures each element is written by a specialized function as e.g.
Following the idea of
genEventPipe.py
where for each primitive typeT
a specialized serialization functionwrite_buffer_T
exists this patch introduces the same forblock
andfast serializer
. For the former new functions of the formep_write_buffer_T
have been introduced and for the latter calls toep_fast_serializer_write_buffer
are replaced by specialized functions as e.g.ep_fast_serializer_write_uint32_t
where the byte swap happens.