Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to read outside of stack #276

Closed
aristotelhs opened this issue Jun 11, 2023 · 43 comments
Closed

Attempting to read outside of stack #276

aristotelhs opened this issue Jun 11, 2023 · 43 comments
Labels
bug Something isn't working

Comments

@aristotelhs
Copy link
Contributor

Hello,
we are trying to use ddprof for our continuous profiling. Currently we are deploying ddprof as a sidecar container that is used to profile the main process. The main process is written in Julia and it is using the Julia jit for the execution of the end code.
Checking the result in the web interface, we can see that some frames are missing. Running ddprof in debug mode we notice a lot of error messages like:

<DEBUG>Jun 11 12:07:19.028064 ddprof[143]: Attempting to read outside of stack 0x7ffd39f95258 from 47, (0x7ffd39f8d0d0, 0x7ffd39f950d0)[0x7f418fb376b8, 0x7f418fb3f6b8]
<DEBUG>Jun 11 12:07:19.028067 ddprof[143]: Attempting to read outside of stack 0x7ffd39f95260 from 47, (0x7ffd39f8d0d0, 0x7ffd39f950d0)[0x7f418fb376b8, 0x7f418fb3f6b8]
<DEBUG>Jun 11 12:07:19.028070 ddprof[143]: Attempting to read outside of stack 0x7ffd39f95268 from 47, (0x7ffd39f8d0d0, 0x7ffd39f950d0)[0x7f418fb376b8, 0x7f418fb3f6b8]
<DEBUG>Jun 11 12:07:19.028073 ddprof[143]: Attempting to read outside of stack 0x7ffd39f95278 from 47, (0x7ffd39f8d0d0, 0x7ffd39f950d0)[0x7f418fb376b8, 0x7f418fb3f6b8]
<DEBUG>Jun 11 12:07:19.083331 ddprof[143]: Attempting to read outside of stack 0x7ffd39f92878 from 47, (0x7ffd39f8a730, 0x7ffd39f92730)[0x7f418faff790, 0x7f418fb07790]
<DEBUG>Jun 11 12:07:19.083354 ddprof[143]: Attempting to read outside of stack 0x7ffd39f928a0 from 47, (0x7ffd39f8a730, 0x7ffd39f92730)[0x7f418faff790, 0x7f418fb07790]
<DEBUG>Jun 11 12:07:19.083359 ddprof[143]: Attempting to read outside of stack 0x7ffd39f92880 from 47, (0x7ffd39f8a730, 0x7ffd39f92730)[0x7f418faff790, 0x7f418fb07790]
<DEBUG>Jun 11 12:07:19.083362 ddprof[143]: Attempting to read outside of stack 0x7ffd39f92888 from 47, (0x7ffd39f8a730, 0x7ffd39f92730)[0x7f418faff790, 0x7f418fb07790]

Also in the last part we get errors like this:

<DEBUG>Jun 11 12:28:33.665010 ddprof[143]: [DSO] Couldn't find DSO for [47](0x7f08633164ae). backpopulate
<DEBUG>Jun 11 12:28:33.665038 ddprof[143]: [UW] (PID47) DSO not found at 0x7f08633164ae (depth#15)
<DEBUG>Jun 11 12:28:33.665043 ddprof[143]: Error frame (depth#16)
<DEBUG>Jun 11 12:28:33.665047 ddprof[143]: [DSO] Couldn't find DSO for [47](0x7f0863316631). backpopulate
<DEBUG>Jun 11 12:28:33.665049 ddprof[143]: [UW] (PID47) DSO not found at 0x7f0863316631 (depth#16)
<DEBUG>Jun 11 12:28:33.665052 ddprof[143]: Error frame (depth#17)
<DEBUG>Jun 11 12:28:33.665676 ddprof[143]: [DSO] Couldn't find DSO for [47](0x7f08633164ae). backpopulate
<DEBUG>Jun 11 12:28:33.665684 ddprof[143]: [UW] (PID47) DSO not found at 0x7f08633164ae (depth#24)
<DEBUG>Jun 11 12:28:33.665686 ddprof[143]: Error frame (depth#25)

I also include the pid maps for the process.
47-map.txt

Is there a way to say ddprof to continue unwinding the stack even in cases that this is a non stack segment? There are legit reasons that this would happen during to code generation (and it points to executable maps in the file).

@r1viollet r1viollet added the bug Something isn't working label Jun 12, 2023
@r1viollet
Copy link
Collaborator

r1viollet commented Jun 12, 2023

Thanks for these logs and maps.

Unwinding (minor issue)

Your suggestion is what we apply: we try to continue the unwinding even if we are not finding the appropriate DSO. When this happens we rely on frame pointer unwinding.
We have a mechanism to avoid parsing the DSO files too often which is hurting us here. This is causing the failures to find the appropriate DSO. However, the DSO object should not be needed if we are symbolizing through JIT information.
This is something we can improve. We are trying to match DSOs before looking into the JIT information. This causes useless lookups. I will give a higher priority to finding runtime symbols in the JIT dump instead of checking maps first.
Though, I'm still surprised this is happening considering we listen to mmap events (so we should not need to parse the maps so often).

Out of bound read (this explain the issue you are describing)

We don't ask for the full stack as a sample. We only take 32 kilobytes
Here we are reaching the end of our sample and we can not continue. We could increase the size of samples to improve this however there is a tradeoff (size of ringbuffer, time to copy the samples...). We should discuss how important it is to have the full stacks: your feedback is welcome on this part.

@aristotelhs
Copy link
Contributor Author

If I read the first comment correctly this means that even I get this error message about the address being out of stack range this means that you continue the unwinding using FP. (We have compiled Julia to use FP so this should work as expected).

The out of bounds read is in different place that the second part of the log messages. So how did you correlate the out of bounds with the size of the sample?

For us it is important to have the full stacks and it seems that we do end up with deep stacks because there is a process of compiling our DSL to Julia and then to JIT. Not being able to go deeper in the stack we might loose some of the code paths that lead to specific execution. (@NHDaly might be better to give an insight on this). If in the 32kb samples the stack unwinding is exhausted , are the relevant stack traces send to DD, or they are dropped? (Trying to see how important the error messages are).

@NHDaly
Copy link

NHDaly commented Jun 13, 2023

+1 it's hard for us to answer without understanding those details.

If the samples are dropped, that's bad and we won't like it.
If the samples are sent and somehow can indicate that they're truncated, that would be more acceptable. Would they be truncated at the leaves or the roots? Or rather, which part would be preserved?

Obviously we'd most prefer to have the full stacks, but we could probably make do with partial stacks if it's our only option. ... I'm not sure if it would be enough of a detriment to prevent us from using the service or not. We'd have to see.

@r1viollet
Copy link
Collaborator

Our aim is never to drop samples. The frame would be truncated. The aim is for the profile to remain statistically correct.

To understand better the setting on sample size. We take samples by capturing the state of the stack. The larger the sample, the higher we can get in the stack. Higher samples would have more impact on your runtime (to be measured). We could make this into a setting.

@r1viollet
Copy link
Collaborator

Question around what parts are being kept: The leaves are kept, the roots are dropped (so frames closer to your main are dropped, whereas the leaf functions like memcpy are kept). But the sample as a whole is kept.

We capture a sample with the state in which the code is being executed and then we make our way up to the calling functions.
When we are unable to complete the unwinding, you will see an [incomplete] frame explaining that we were not able to fully unwind.

@NHDaly
Copy link

NHDaly commented Jun 14, 2023

aha, thanks, we had wondered what that meant. Makes sense! So [incomplete] indicates some number of frames that were dropped?


Higher samples would have more impact on your runtime (to be measured). We could make this into a setting.

Yes, a setting sounds great! I think our preference would likely be:

  • dramatically lower the sample frequency (we currently have it set to 19Hz i think(?), rather than the 100Hz default)
  • but increase the sample size so that we can keep the entire frame

With those two knobs, we should be able to keep the perf impact neutral, while retaining full profile information.

Right now, our application performance tuning questions are much more in the "why did this take 10 seconds" domain, than in the "why did this take 10 microseconds", so we strongly prefer full profiles over fine-grained samples.

@NHDaly
Copy link

NHDaly commented Jun 14, 2023

But, with all of this said: the above discussion still doesn't cover the scenario we were seeing, which seemed to be presenting as dropped samples, right?

Is your expectation that nothing we've shown indicates this is happening, @r1viollet?

@aristotelhs: can you run an experiment where you trigger high CPU load on an engine, and then check to see whether or not the profile reflects that high CPU usage as expected?

@r1viollet
Copy link
Collaborator

r1viollet commented Jun 15, 2023

Is your expectation that nothing we've shown indicates this is happening, @r1viollet?

Yes, this is correct.

The only scenario in which we would be loosing samples is if we are unable to keep up with the number of samples or if we are not scheduled to process the samples. I can check this if you connect the profiler to a statsd socket (our internal stats are not billed in any way). You would need to mount the agent's statsd socket in the container and provide it to the profiler

--internal_stats /var/run/datadog-agent/statsd.sock

edit: It could be simpler to just add the notice level of logs (-l notice). This line surfaces the lost samples

datadog.profiling.native.event.lost: 0

@aristotelhs
Copy link
Contributor Author

So [Incomplete] is when we reach the maximum depth of the stack unwinding. In a previous message I thought that Incomplete was created because there was a version mismatch on the libraries making unwinding fail, or can it be both? In our case I can see that almost all traces are Incomplete.

We run ddprof with debug (-l debug) so when I run the load I get this output:

<NOTICE>Jun 18 09:41:06.537646 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:42:05.464464 ddprof[143]: datadog.profiling.native.event.lost: 8
<NOTICE>Jun 18 09:43:04.463343 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:44:03.568167 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:45:02.454830 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:46:01.468762 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:47:00.567233 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:47:59.553123 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:48:58.544933 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:49:57.527206 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:50:56.510943 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:51:55.459613 ddprof[143]: datadog.profiling.native.event.lost: 6
<NOTICE>Jun 18 09:52:54.543147 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:53:53.533412 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:54:52.553264 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:55:51.493461 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:56:50.503270 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:57:49.525502 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:58:48.547727 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 09:59:47.484624 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 10:00:46.546879 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 10:01:45.461844 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 10:02:44.513070 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 10:03:43.508614 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 10:04:42.537522 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 10:05:41.467133 ddprof[143]: datadog.profiling.native.event.lost: 0
<NOTICE>Jun 18 10:06:40.551488 ddprof[143]: datadog.profiling.native.event.lost: 0

However in the test run that I do I am able to see some profiles coming from the test setup. (Not sure if something is missing). I will leave it running so that we can have a better look with @NHDaly .

@r1viollet
Copy link
Collaborator

r1viollet commented Jun 19, 2023

That looks like a healthy amount of lost events. The number of samples processed should also be in the logs.

I would recommend the notice setting which will give you general metrics on how the profiler is behaving while not suffering form some of the performance impacts of the debug logs.
[Incomplete] is when we don't detect a frame that matches what we would expect from a successful unwinding . Here are symbols we expect at the end of the unwinding:

"__clone", "__clone3","_exit","main","runtime.goexit.abi0","runtime.systemstack.abi0","_start","start_thread"

This can be due to the unwinding reaching the end of the buffer we sampled, or some missing information (such as missing unwinding or symbol information). Incomplete frames are usually a minor inconvenience (as the leaf nodes are the most important) and should not prevent you interpreting performance results.

@NHDaly
Copy link

NHDaly commented Jun 23, 2023

Thanks for the support so far. This is really helpful!

We spent some time looking into this today, and we do have some strange findings still.

Here is one example of the inconsistent CPU reporting between the metrics and the profiles:

The pod has 4 cores. The metrics dashboard reports somewhere between 1 and 3 cores in use during those 5 minutes.
But the ddprof profile reports only up to 1.0 core in use.
Shouldn't the ddprof profile get all the way up to 3+ cores, too?

Is something happening that is normalizing the profiles by number of cores? This is not expected, from my understanding.

Thanks in advance for the support!

@NHDaly
Copy link

NHDaly commented Jun 23, 2023

All the uploads during that window report the same (logs):

<NOTICE>Jun 23 14:56:47.839272 ddprof[40905]: datadog.profiling.native.event.lost: 0

@NHDaly
Copy link

NHDaly commented Jun 23, 2023

I do see this debug message over and over - not sure if it's an issue?:

<DEBUG>Jun 23 16:14:44.640740 ddprof[40905]: Stopped at 7ea7f1489f43 - dso PID[7] 7ea7f1400000-7ea7f14fffff 0 ()(T-Anonymous)(x)(ID#-1) - error no error ()
<DEBUG>Jun 23 16:14:44.640759 ddprof[40905]: Break out of unwinding (possible infinite loop)

@NHDaly
Copy link

NHDaly commented Jun 23, 2023

Finally, I wanted to point out that often the samples are pretty uselessly short right now, e.g.:
Screenshot 2023-06-23 at 10 28 26 AM

Do you think that we could increase the buffer length, as you suggested? Also, can you remind me what Anonymous means? It seems to correlate with the JITted frames.

@aristotelhs
Copy link
Contributor Author

Regarding the discussion about the Incomplete, I can notice that Incomplete exists in all of our traces. So after the top level entry point that is the executable the next top level trace is [Incomplete]. This means that our traces end up on something different of the ones that you expect from ddprof.

@NHDaly
Copy link

NHDaly commented Jun 26, 2023

+1 all of our profiles have incomplete at the top.

@r1viollet
Copy link
Collaborator

Thanks for the support so far. This is really helpful!

We spent some time looking into this today, and we do have some strange findings still.

....

The pod has 4 cores. The metrics dashboard reports somewhere between 1 and 3 cores in use during those 5 minutes. But the ddprof profile reports only up to 1.0 core in use. Shouldn't the ddprof profile get all the way up to 3+ cores, too?

Is something happening that is normalizing the profiles by number of cores? This is not expected, from my understanding.

Thanks in advance for the support!

We should be reporting the full CPU activity. Here is one of our services running at ~4 CPUs per minute.
image

I can not reproduce this on my side. Would you have any hints as to how I could create an example exposing such an issue ? Are you confident that the activity on that POD is only explained by this process ?

@r1viollet
Copy link
Collaborator

To recap what I understand. What we are sure about:

  • From some of the logs you added above, the symbolization flow can be improved.
    I made a small PR to improve the flow.

  • The size of samples are too small, we should make this configurable.
    This is the most likely cause for most of the unwinding issues we are seeing. You might need to change some ulimit settings to make the new samples size work. We use pinned memory to create the buffers. This is a kernel setting.

The remaining item that I find strange is the CPU accounting not matching what you would expect it to be.

@aristotelhs
Copy link
Contributor Author

Thanks for the reply.
The way we run ddprof is that we inject the ddprof container which when starts monitors the julia process by running ddprof -p $PID_OF_JULIA_PROCESS.
In the same pod it is the julia process, ddprof itself and a small container that is used to expose a socket from the host. The cpu limits exist only in the Julia process and not to the other pods.
When you saw the native-prof-analyzer process when you go to the right side of the window and you choose Thread-ID how many does it show? In our julia process when we go and ask to show by Thread-ID we only get one Thread-ID back.

Which are the new ulimits that we must add? Should we wait for the patch to be merged before we try?

@NHDaly
Copy link

NHDaly commented Jun 29, 2023

Are you confident that the activity on that POD is only explained by this process ?

I am confident, yes. We are running julia with multiple threads, and we know for sure that it can utilize multiple cores.

Would you have any hints as to how I could create an example exposing such an issue ?

Maybe we can get together, and pair on this sometime? It would be great to get your support if that's something you're able to do. I think we can get you the permissions so you can trigger workloads on our side to support your debugging!

@NHDaly
Copy link

NHDaly commented Jun 29, 2023

Thanks both! +1 to the discussion above

@r1viollet
Copy link
Collaborator

Thanks for the reply. The way we run ddprof is that we inject the ddprof container which when starts monitors the julia process by running ddprof -p $PID_OF_JULIA_PROCESS.

Thanks for this precision. There seems to be an issue with how we follow threads started before we instrument the PID (so a general issue with PID mode). I'll look into this. Using wrapper mode would fix this. We also have a user switch option if you want to have a different user between ddprof and the process being profiled.

Which are the new ulimits that we must add? Should we wait for the patch to be merged before we try?

I'll deliver the patch with a setting on buffer size and we will go back to this. 👍
Happy to pair if we get stuck! Though I think I mostly have some amount of coding to do right now 😄

@aristotelhs
Copy link
Contributor Author

Quick question: If we deploy ddprof as a daemonset will it suffer from the same issue regarding threads or is it using a different code path? (Having done the test yet, but it is good to have this in mind)

@r1viollet
Copy link
Collaborator

No, the daemonset using global mode won't suffer from this. This only relates to PID mode.

@r1viollet
Copy link
Collaborator

A new options is available in the pre-release version. It will increase the size of samples we take. You can find the binaries here. I will soon release a new version of ddprof.
The command line should like like this:

./ddprof --stack_sample_size=65528 -l notice --preset cpu_only julia main.jl

Can you confirm that this improves the stack traces ?

@r1viollet
Copy link
Collaborator

r1viollet commented Jul 12, 2023

The stack size configuration was released to v0.13.0
There is also a truncation metric in the logs that we will be able to monitor.

@aristotelhs
Copy link
Contributor Author

For the discussion on the incomplete profiles, Julia seems to have the start_task symbols that is used to label top level tasks. So this might improve some traces.
I will upgrade to v0.13.0 to see the changes. We tried the container-id option which seems to be able to isolate the container, but we cannot group both on container-id and pid.
The container-id is useful but we need an easy way to do the mapping to pod -> container-ids since most people are used to think in pods rather than container-ids.
Thanks for the new release.
FYI we use ddprof as a daemonset that runs in parallel to our apps

@aristotelhs
Copy link
Contributor Author

aristotelhs commented Jul 20, 2023

The logs from the new containers are the following:

<ERROR>Jul 20 18:36:55.904327 ddprof[1710354]: Could not mmap memory for watcher #0: Operation not permitted. Please increase kernel limits on pinned memory (ulimit -l). OR associate the IPC_LOCK capability to this process.
<ERROR>Jul 20 18:36:55.904339 ddprof[1710354]: PERFMMAP: error in mmap operations at /go/src/github.com/DataDog/apm-reliability/ddprof-build/ddprof/src/pevent_lib.cc:188
<ERROR>Jul 20 18:36:55.904343 ddprof[1710354]: Forward error at /go/src/github.com/DataDog/apm-reliability/ddprof-build/ddprof/src/pevent_lib.cc:227 - PERFMMAP: error in mmap operations
<NOTICE>Jul 20 18:36:55.904369 ddprof[1710354]: Retrying attachment without user override

The logs are with the IPC_LOCK capability added to the containers. I have added the --stack_sample_size=65528 to the running command. Is this parameter ignored then?

@r1viollet
Copy link
Collaborator

Filtering:

Thanks for this feedback. I have forwarded it. This requires rethinking some of the filtering mechanisms.
I will get back to you on these improvements.

Mmap log failure:

This is mostly a confusing log. We will retry mmaps with different users (because limitations are shared across containers for a given user). The first user failed, and we are now trying with the second user (still with a larger buffer). I'll look at how to make this less confusing.

@r1viollet
Copy link
Collaborator

As mentioned, the logs is misleading, the parameter is still taken into account. Could you clarify if the flamegraphs are deep enough ?

@aristotelhs
Copy link
Contributor Author

Hello, using this parameter we do have a better outcome for the stack depth. As can be seen in the attached picture.
image
The [Incomplete] part has shrug significantly. However in that part it looks strange because I can also see frames that do not look that deep as shown in the next pic:
image
The frames routed ijl_gc_* functions do not seem to be that deep to explain of being [Incomplete]. I am not sure if this is a case of being routed in some "fake" or special stack frames that results in the unwinding issues of showing as incomplete. (cc @vchuravy).
The anonymous frames are a very small percentage in this run (~0,25%) and they got improved by adding some debuginfo packages in our builds. Is there an easy way to see which debug symbols are missing, or if there is some mismatch on the C libraries to remove them?
image
One thing that I noticed in the logs are the following lines:

<NOTICE>Aug 04 10:33:57.087393 ddprof[2853012]: datadog.profiling.native.symbols.jit.reads: 2
<NOTICE>Aug 04 10:33:57.087394 ddprof[2853012]: datadog.profiling.native.symbols.jit.failed_lookups: 377
<NOTICE>Aug 04 10:33:57.087395 ddprof[2853012]: datadog.profiling.native.symbols.jit.symbol_count: 0

Does that mean that ddprof is unable to read the jitdump files for the processes? In the traces I do see symbols in the format julia_sortNOT._18923 which look like jitted functions.

We run ddprof as a pod in the cluster nodes and we monitor all the processes. The command line arguments are the following: ddprof -l notice --stack_sample_size=65528 -e sCPU,period=19 -g

@aristotelhs
Copy link
Contributor Author

Running some extra loads we get the following profile.

image

It seems the jitted code is not shown and the stack frames are missing. It looks like all the Anonymous frames that exist in the profile are the jitted code. I am not sure why they appear much less deep that they actual are ( I would expect to have the correct depth and on each level the function name would be anonymous, but it seems that an aggregation is happening).

@NHDaly
Copy link

NHDaly commented Aug 9, 2023

^ I don't think this is quite right. I don't think we know whether the Anonymous frames are the jitted code or not, but we know for sure that stacks with the jitted frames are completely missing.

We ran the above workload in both julia's native profiler and in DDProf, and you can see that whole samples are missing from the profile, which seem to correspond to samples with jitted frames:
Screenshot 2023-08-04 at 11 30 51 AM
Screenshot 2023-08-04 at 11 30 45 AM

In particular, the main column on the left, the run_job column, is completely absent.


and regarding the question of rooting: notice that the sort columns are rooted in start_task in the julia native profile, and they're reasonably short, so I don't see why those should end at incomplete.

@NHDaly
Copy link

NHDaly commented Aug 9, 2023

^ Until the missing stacks can be addressed, this is still not useful to us. Without them, it's missing a large chunk of the work that is going on on the engine.

@r1viollet
Copy link
Collaborator

Thanks for the logs and screen captures.

The most important issues seems to be that the JITDump lookups are failing. I am working on a reproducer and I will see how to fix this.
You highlighted other things that might need some attention:

  • Unwinding through signal handlers (looks like Julia has a segv handler)
  • Stack depth still not being deep enough.

Though we should come back to that once we have proper JIT symbols in the flamegraphs.

@r1viollet
Copy link
Collaborator

r1viollet commented Aug 24, 2023

I think this can relate to how jitdump files are specified in the maps

.debug/jit/llvm-IR-jit-20230611-ec3bc0/jit-47.dump

I am assuming this is current working directory.
I was expecting absolute paths.

/home/r1viollet/.debug/jit/llvm-IR-jit-20230824-dc3310/jit-1499363.dump

I will add a patch that takes this into account.

@r1viollet
Copy link
Collaborator

r1viollet commented Aug 24, 2023

As a workaround, can you try specifying the path to the jitdump folder explicitly ?
This variable should be added to the Julia process (not the profiler)

export JITDUMPDIR=<path_where_the_debug_folder_will_be_created>

Reference to the LLVM sources:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp#L297

@aristotelhs
Copy link
Contributor Author

If I read the code correctly this changes the directory where julia (through llvm) would create the debug dir. In our case we are in the case of home dir (https://github.com/llvm/llvm-project/blob/main/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp#L306) so we just add the . in front.
If we change JITDUMPDIR in the julia process how would ddprof catch that and find the directory?

@aristotelhs
Copy link
Contributor Author

If there is a docker image with the version of ddprof with the above patch, I can try to install that and test it in our infra.

@r1viollet
Copy link
Collaborator

From my tests with JITDUMPDIR, an absolute path was generated (you can check this in the /proc/<julia_process_pid>/maps, the line with the .dump file is either absolute or relative).

I will have a patch version you can try by end of day.

@aristotelhs
Copy link
Contributor Author

I am not sure this is the case because I just spinned up a new engine and then when I checked the maps file I do get the full path:

bash-5.1$ grep dump /proc/1/maps
7f8e53749000-7f8e5374a000 r-xp 00000000 00:ff 4644907                    /home/user/.debug/jit/llvm-IR-jit-20230805-bbab0f/jit-1.dump

I also created a new ddprof binary with the patches in r1viollet/jitdump_relative_path and if ddprof could not find the jitdump file then as I read it it should emit a log (WARN) that it is unable to open the jitdump file, which is not the case in our infra.

@r1viollet
Copy link
Collaborator

Indeed, I added a warning (from the latest patch). That occurs if we were able to find the .dump mapping.
The following release contains the patch.

Though you were able to run your own versions 👏 .
We can coordinate on Datadog public slack if that is OK to debug this together, I am not sure why we would not find this.

@r1viollet
Copy link
Collaborator

The latest 0.14.1 release includes a patch that allows the --global option to match the jitdump information.
@aristotelhs was able to confirm the improvement.

We also noticed an issue with the unwinding in the Julia segv handler (used by the runtime within the garbage collector). We will discuss this in a separate issue.

Thanks for the help investigating this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants