Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono][eventpipe] Write primitive types and UTF-16 strings in little endian order #68648

Merged
merged 2 commits into from
Jul 4, 2022

Conversation

stefan-sf-ibm
Copy link
Contributor

By convention of the EventPipe file format every primitive type is of little-endian order. Currently there is no distinction of the host endianness which means that while firing events primitive types are written in big-endian order on s390x. This, of course, renders the generated nettrace file unreadable e.g. by tools like perfview (especially if those files are read on little-endian machines). This patch fixes this by writing primitive types always in little-endian order which involves byte swaps on big-endian machines. In order to keep a good debug experience those byte swaps are done late.

The generated event functions by genEventPipe.py already utilize for primitive types specialized serialization functions as e.g. write_buffer_uint32_t which have been extended with byte swaps accordingly. In order to deal with events which contain arrays and (arrays of) structures the generated functions have been adapted as follows.

Arrays are iterated and each element is applied to a specialized function as e.g.

uint32_t EventPipeWriteEventMethodDCEndILToNativeMap(
/* ... */
    for (uint32_t i = 0; i < CountOfMapEntries; ++i) {
        success &= write_buffer_uint32_t (ILOffsets[i], &buffer, &offset, &size, &fixedBuffer);
    }
    for (uint32_t i = 0; i < CountOfMapEntries; ++i) {
        success &= write_buffer_uint32_t (NativeOffsets[i], &buffer, &offset, &size, &fixedBuffer);
    }

Similar for structures each element is written by a specialized function as e.g.

uint32_t EventPipeWriteEventMonoProfilerJitDone_V1(
/* ... */
    const uint8_t *valuePtr = Values;
    for (uint32_t i = 0; i < Count; ++i) {
        uint64_t value_uint64_t;
        uint8_t value_uint8_t;
        memcpy (&value_uint8_t, valuePtr, sizeof (value_uint8_t));
        valuePtr += sizeof (uint8_t);
        success &= write_buffer_uint8_t (value_uint8_t, &buffer, &offset, &size, &fixedBuffer);
        memcpy (&value_uint64_t, valuePtr, sizeof (value_uint64_t));
        valuePtr += sizeof (uint64_t);
        success &= write_buffer_uint64_t (value_uint64_t, &buffer, &offset, &size, &fixedBuffer);
    }

Following the idea of genEventPipe.py where for each primitive type T a specialized serialization function write_buffer_T exists this patch introduces the same for block and fast serializer. For the former new functions of the form ep_write_buffer_T have been introduced and for the latter calls to ep_fast_serializer_write_buffer are replaced by specialized functions as e.g. ep_fast_serializer_write_uint32_t where the byte swap happens.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 28, 2022
@lateralusX
Copy link
Member

lateralusX commented Apr 29, 2022

Thank you so much for this work adding big endian support in Mono EventPipe! I have a little things on my backlog to complete first, but will review this PR during next week, but just did a quick look at it and overall it looks good.

Copy link
Member

@lateralusX lateralusX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks great! Thank you for putting effort into this, extending EventPipe to support big endian architectures. Left some comments/questions/ideas in the PR.

pack_list.append(
" success &= write_buffer((const uint8_t *)%s, %s, &buffer, &offset, &size, &fixedBuffer);" %
(paramName, size))
t = winTypeToFixedWidthType(parameter.winType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned that we will get additional overhead due to this on all Mono platforms, hopefully the compiler will do a good job inlining write_buffer_x in the loop and hopefully it will use intrinsic for memcpy, but we end up with many more calls that would need to be optimized. Could and option for the array to be one call wirte_buffer with the array type as argument and then you could only implement it on big endian platform? Alternative is to use different generation depending on target platform, so the below will just be a specialization for big endian platform, where the majority will keep existing behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the detailed review :)

My intention of having only one endian-independent implementation was that this will be better tested than having two obviously. Having a look at the assembly of a release build I see that for EventPipeWriteEventMethodDCEndILToNativeMap as well as EventPipeWriteEventMonoProfilerJitDone_V1 all calls to write_buffer_x and memcpy are inlined (on x64 and s390x), however, this does not include the call to write_buffer from eventpipehelpers.c of course. Thus, in order to keep the overhead minimal and not introduce any potential performance regression I restored the old behavior for little-endian architectures.

Note, while having a further look at all write_buffer_x functions I changed write_buffer_uintptr_t slightly in case of sizeof (uintptr_t) == 4.

pack_list.append(
" success &= write_buffer((const uint8_t *)%s, %s, &buffer, &offset, &size, &fixedBuffer);" %
(paramName, size))
pack_list.append(" const uint8_t *valuePtr = %s;" % paramName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned that we will get additional overhead due to this on all Mono platforms, hopefully the compiler will do a good job inlining write_buffer_x in the loop and hopefully it will use intrinsic for memcpy, but we end up with many more calls that would need to be optimized. Could and option be to use different generation depending on target platform, so the below will just be a specialization for big endian platform, where the majority will keep existing behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

@@ -851,6 +877,9 @@ def getMonoEventPipeImplFilePrefix():
size_t *size,
bool *fixed_buffer)
{
#if BIGENDIAN
Copy link
Member

@lateralusX lateralusX May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we utilize Mono existing macros for this, like GUINT16_TO_LE and friends (but making EventPipe specifics in ep.h) or use already defined EP_SWAPBYTES used elsewhere in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed EP_SWAPBYTES and introduced EP_UINT16_TO_LE and similar in ep.h.

@@ -909,12 +898,10 @@ ep_sequence_point_block_init (
EP_SERIALIZATION_FORMAT_NETTRACE_V4) != NULL);

const ep_timestamp_t timestamp = ep_sequence_point_get_timestamp (sequence_point);
memcpy (sequence_point_block->block.write_pointer, &timestamp, sizeof (timestamp));
sequence_point_block->block.write_pointer += sizeof (timestamp);
ep_write_buffer_int64_t (&sequence_point_block->block.write_pointer, timestamp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for more clear intent, maybe it would be good to have a ep_write_buffer_timestamp instead of using underlying int64_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added function ep_write_buffer_timestamp which serves as a wrapper for ep_write_buffer_int64_t.

ep_fast_serializer_write_buffer (fast_serializer, (const uint8_t *)&file->file_open_system_time, sizeof (file->file_open_system_time));
ep_fast_serializer_write_buffer (fast_serializer, (const uint8_t *)&file->file_open_timestamp, sizeof (file->file_open_timestamp));
ep_fast_serializer_write_buffer (fast_serializer, (const uint8_t *)&file->timestamp_frequency, sizeof (file->timestamp_frequency));
ep_fast_serializer_write_uint16_t (fast_serializer, file->file_open_system_time.year);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe implement a ep_fast_serializer_write_system_time and then only do serialize of individual fields on big endian platform and keep ep_fast_serializer_write_buffer on little endian platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ep_fast_serializer_write_uint16_t (fast_serializer, file->file_open_system_time.second);
ep_fast_serializer_write_uint16_t (fast_serializer, file->file_open_system_time.milliseconds);
ep_fast_serializer_write_int64_t (fast_serializer, file->file_open_timestamp);
ep_fast_serializer_write_int64_t (fast_serializer, file->timestamp_frequency);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you go with a ep_write_buffer_timestamp function it could be used here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear. I only used it for file->file_open_timestamp and left file->timestamp_frequency as is. Not sure whether an extra function is worthwhile for that or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp_frequency is typed as int64_t so that is fine to serialize as int64_t.

ep_write_buffer_string_utf16_t (uint8_t **buf, const ep_char16_t *str, size_t len)
{
#if BIGENDIAN
ep_char16_t *buf_str = (ep_char16_t *)*buf;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have similar logic spread out through the PR to do similar things, maybe we should share one common implementation handling utf16_t strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see three functions used for UTF-16 string handling:

bool write_buffer_string_utf8_to_utf16_t (const ep_char8_t *value, uint8_t **buffer, size_t *offset, size_t *size, bool *fixed_buffer);
void metadata_generator_write_string_to_buffer (uint8_t *buffer, size_t buffer_len, size_t *offset, const ep_char16_t *value);
void ep_write_buffer_string_utf16_t (uint8_t **buf, const ep_char16_t *str, size_t len);

The first one does a string conversion which the latter two don't. Furthermore, all metadata_generator_write_X_to_buffer do not alter the pointer but change the offset in contrast to ep_write_buffer_X which alter the buffer pointer. Thus the API differs quite a bit. Despite that the body more or less only consists of memcpy. So I'm not sure how one implementation should look like. Did you have anything in mind?

Copy link
Member

@lateralusX lateralusX May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, just saw that we did string conversions in several locations, but if there is nothing to reuse between the constructions, then you can keep it as is.

@lateralusX
Copy link
Member

//CC @josalem, @noahfalk

Copy link
Contributor

@josalem josalem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few thoughts on this:

  • Echoing Johan's concerns about potential overhead for little-endian systems.
  • The shared code paths, e.g., writing blocks, headers, metadata, etc. to the stream, should be little-endian for sure, but attempting to convert all the payloads seems like a Sisyphean task. It will probably be simpler to just allow payloads to be big endian and pay the price at the reader end to parse the payloads differently.
  • To that end, we would need a mechanism for identifying to the reader that the incoming stream was big-endian payloads. This could be as simple as the reader accepting the magic NETTRACE bytes in little-endian or big-endian order and then adjusting its parsing accordingly.
  • If we do things that way, CoreCLR would also getwork for big-endian archs. Let me validate CoreCLR support for big-endian archs before giving this feedback.

Lastly, out of curiosity, have you tried inspecting these big-endian files on a big-endian machine? Specifically, have you tried doing something like dotnet-trace convert --fornat speedscope ... and then opening that file elsewhere?

Comment on lines 2779 to 2780
uint32_t payload_data =
#if BIGENDIAN
__builtin_bswap32 (data->payload_data);
#else
data->payload_data;
#endif
ep_write_sample_profile_event (sampling_thread, sampling_event, &adapter, &data->stack_contents, (uint8_t *)&payload_data, sizeof (payload_data));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than do these payload conversions here, it may be simpler to identify a big-endian arch at the reader end and handle it there. Doing it here will require you to find every event, including user defined ones, with a payload and convert the contents to little-endian before it gets cast to a byte array. You'll either only have a few events with little-endian payloads, or you'll spend a lot more time converting everything and re-engineering all the code paths to swap bytes around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a question of scope. If @stefan-sf-ibm knows he won't care about EventSource generated events then the current approach handles it. @stefan-sf-ibm if you do care about those events then I could imagine a few ways to deal with it:
a. Do similar endian swapping loging in EventSource serializer and consider all events manually written via WriteEventCore as out of scope
b. Write some code that will parse event metadata to understand the schema of the payload and then endian swap all the fields (this handles the WriteEventCore case not handled in (a)
c. Make a breaking change to the file format allowing for big endian encodings and update format readers to accomodate it.

(b) adds more perf overhead, (c) probably creates a situation where tools based on sufficiently new TraceEvent support big endian data but older tools or non-TraceEvent based tools don't (I think we're up to at least three 3rd party implementations).

My inclination is to favor runtime complexity over file format complexity (favor the current PR approach or a/b options over the c option) assuming that big endian devices remain a tiny fraction of overall usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current PR approach and the scope that it covers, payloads generated through https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/ClrEtwAll.man and emitted by Mono runtime, is a good start, I guess it mainly covers the initial TARGET_S390X needs, AFIK, that is the only big endian architecture currently running Mono out of dotnet/runtime (don't think CoreCLR supports any big endian platforms). If more support for custom payloads are needed in the future for big endian platforms, I believe we could revisit this, but current PR will at least make sure all Mono runtime emitted events are readable by existing tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment regarding the overhead for little-endian architectures in the above comment.

Ideally we would have for nettrace format something similar like in the Common Trace Format used by LTTng where one can define globally or even per integer (probably not important for us) the byte order. This would have the least overhead while taking a trace which is performance critical. My experience with EventPipe is limited to the last couple of weeks where I basically only got the implementation running. So I'm not sure how important all this is and how it manifests in practice like skipping events due to high load only because of byte swaps and its side-effects from the implementation like non-inlined function calls and so on. I'm also not sure how likely it is to change the current format and in which time frame something like that could happen. If there is anything I can do I would be happy to jump in. Anyhow, in order to gain some experience this PR might suite as a starter where all Mono events are handled. Though, in the long run I'm still kind of worried about user events.

@josalem the TraceEvent library is also on my todo list. A nettrace file created with and without this PR is not convertible on a big-endian machine. With this PR it is at least convertible on a little-endian machine. However, I just gave the sample provider a chance yesterday and traced an example application and the trace looks broken to me, i.e., the flamegraph contains no function names. It looks like as I have missed something which I need to investigate ... setting this PR to WIP.

Copy link
Member

@lateralusX lateralusX May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you get the rundown events into the .nettrace files and that they are correct since they are needed for the symbolication of callstacks it tools, they have the mapping between method id in callstacks and the function names/signatures. You can look at all events included in the nettrace file in perfview, and on some, it will also display the payload, but not for all.

@stefan-sf-ibm stefan-sf-ibm marked this pull request as draft May 10, 2022 07:09
@@ -8,6 +8,22 @@
#include "ep-stack-contents.h"
#include "ep-rt.h"

#if FEATURE_CORECLR
Copy link
Member

@lateralusX lateralusX May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to but this into each runtimes ep-rt-mono|coreclr.h header, that's where we keep runtime specializations. You can look at some other macros with different implementation on each runtime to understand the pattern.

@stefan-sf-ibm
Copy link
Contributor Author

StackBlocks

The reason for missing all functions in the flame graph was that IPs of StackBlocks missed a byte swap. Fixed by adjusting ep_stack_contents_append and introducing EP_UINTPTR_TO_LE.

UTF-16 strings

Some of the UTF-16 strings generated by ep_rt_utf8_to_utf16_string were properly converted in the end to UTF-16LE e.g. via metadata_generator_write_string_to_buffer or ep_write_buffer_string_utf16_t but some were not. For the payload (command line, os information, arch information) of the event ProcessInfo of the Microsoft-DotNETCore-EventPipe provider those were still written in UTF-16BE. In order to fix this I went with doing the byte swap during conversion from UTF-8 to UTF-16 and introduced g_utf8_to_utf16le. This should also give a better performance since a UTF-16 string doesn't need to be reiterated. Furthermore, this should ensure that all UTF-16 strings which are allocated in the eventpipe namespace are little endian so we have a consistent layout of UTF-16 strings.

The converse, i.e., conversion from UTF-16 to UTF-8, needs some adjustment, too. Function ep_rt_utf16_to_utf8_string is used in two different contexts (ep-rt-mono.c, ds-eventpipe-protocol.c) where one time the input UTF-16 string is in host-endian order and the other time in little-endian order. In order to deal with both I added function ep_rt_utf16le_to_utf8_string. Thus following Mono's implementation which uses utf16, utf16be, utf16le for host-endian, big-endian, little-endian, respectively.

Currently the naming scheme differs for UTF-8 to UTF-16 conversion where ep_rt_utf8_to_utf16_string should actually be named ep_rt_utf8_to_utf16le_string. If you are fine I would like to rename this function.

DS_VAL{16,32,64}

I'm wondering whether it is fine to replace all DS_VAL{16,32,64} macros defined in ds-types.h by EP_UINT{16,32,64}_TO_LE? Would imply that ep-rt.h must be included in the diagnostics namespace. Not sure whether this is acceptable or not.

At least the current implementation is missing casts. For example, in here the parameter is of type uint32_t which is not upcasted to an uint64_t by DS_VAL64 which results in a bad shift. Thus, if EP_UINT{16,32,64}_TO_LE cannot be used I would like to replace those with GUINT{16,32,64}_TO_LE.

I will do some more testing in the following days.

Last but not least I just have to say that EventTrace is an awesome library which helped a lot in order to debug the binary trace file :)

@stefan-sf-ibm stefan-sf-ibm marked this pull request as ready for review June 7, 2022 16:08
@stefan-sf-ibm stefan-sf-ibm requested a review from vargaz as a code owner June 7, 2022 16:08
@stefan-sf-ibm
Copy link
Contributor Author

As previously mentioned I removed the macros DS_VAL{16,32,64} and streamlined them with EP_{U,}INT{16,32,64}_TO_LE into proper C functions ep_rt_val_uint32_t as for example. I went for C functions instead of macros in order to rule out that arguments are evaluated multiple times like done here

uint64_t pid = DS_VAL64 (ep_rt_current_process_get_id ());

Using attribute EP_ALWAYS_INLINE should ensure that there is no performance difference between a function and macro.

Beside that I fixed a couple of missing conversions in diagnostics. Since all my local tests are successful, now, removing draft flag.

@stefan-sf-ibm
Copy link
Contributor Author

While rebasing in order to fix some merge conflicts new events where introduced which require string handling as e.g. EventPipeWriteEventBulkType. Added support for those, too.

@stefan-sf-ibm
Copy link
Contributor Author

@lateralusX I did a couple of further tests and so far they all passed. Can we merge this?

@lateralusX
Copy link
Member

lateralusX commented Jun 28, 2022

@lateralusX I did a couple of further tests and so far they all passed. Can we merge this?

@stefan-sf-ibm I looked through the changes again I believe they look fine.

@noahfalk Any objections merging this? The initial concerns have been addressed (so little endian systems should have close to zero overhead) and it will make sure native runtime events works on big endian systems enable use of perfview on nettrace files captured on IBM S390X. It will not do anything with managed events since there is currently not hooked into the serializer mechanism of those events, but based on above discussion, at least getting the native runtime events working with existing tooling adds value for big endian platforms justifying doing the extension.

@lateralusX
Copy link
Member

Would imply that ep-rt.h must be included in the diagnostics namespace. Not sure whether this is acceptable or not.

That is OK, DS namespace already have relationship -> EP headers, what we should avoid is the other way around.

@uweigand
Copy link
Contributor

Merging this will also fix the ThreadPoolMinMaxThreadsEventTest regression on s390x discussed in #71173.

@noahfalk
Copy link
Member

No objections from me, sorry to keep you waiting I was out sick the last couple days.
cc @davmason

@lateralusX
Copy link
Member

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@davmason
Copy link
Member

No objections from me either

@lateralusX
Copy link
Member

Failures in runtime-extra-platforms looks unrelated.

@lateralusX
Copy link
Member

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stefan-sf-ibm
Copy link
Contributor Author

I could also try a rebase if that is fine for you. I remember using a two weeks old mainline version I ran into some weird problems which seem to be fixed meanwhile.

@lateralusX
Copy link
Member

@stefan-sf-ibm, yes please rebase and we can see if we get better pass rate on runtime-extra-platforms as well.

@lateralusX
Copy link
Member

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lateralusX lateralusX self-requested a review July 4, 2022 18:55
Copy link
Member

@lateralusX lateralusX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding support for big endian platform in EventPipe native runtime events!

@lateralusX lateralusX merged commit 136dfcd into dotnet:main Jul 4, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Tracing-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants