-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempting to read outside of stack #276
Comments
Thanks for these logs and maps. Unwinding (minor issue) Your suggestion is what we apply: we try to continue the unwinding even if we are not finding the appropriate DSO. When this happens we rely on frame pointer unwinding. Out of bound read (this explain the issue you are describing) We don't ask for the full stack as a sample. We only take 32 kilobytes |
If I read the first comment correctly this means that even I get this error message about the address being out of stack range this means that you continue the unwinding using FP. (We have compiled Julia to use FP so this should work as expected). The out of bounds read is in different place that the second part of the log messages. So how did you correlate the out of bounds with the size of the sample? For us it is important to have the full stacks and it seems that we do end up with deep stacks because there is a process of compiling our DSL to Julia and then to JIT. Not being able to go deeper in the stack we might loose some of the code paths that lead to specific execution. (@NHDaly might be better to give an insight on this). If in the 32kb samples the stack unwinding is exhausted , are the relevant stack traces send to DD, or they are dropped? (Trying to see how important the error messages are). |
+1 it's hard for us to answer without understanding those details. If the samples are dropped, that's bad and we won't like it. Obviously we'd most prefer to have the full stacks, but we could probably make do with partial stacks if it's our only option. ... I'm not sure if it would be enough of a detriment to prevent us from using the service or not. We'd have to see. |
Our aim is never to drop samples. The frame would be truncated. The aim is for the profile to remain statistically correct. To understand better the setting on sample size. We take samples by capturing the state of the stack. The larger the sample, the higher we can get in the stack. Higher samples would have more impact on your runtime (to be measured). We could make this into a setting. |
Question around what parts are being kept: The leaves are kept, the roots are dropped (so frames closer to your We capture a sample with the state in which the code is being executed and then we make our way up to the calling functions. |
aha, thanks, we had wondered what that meant. Makes sense! So
Yes, a setting sounds great! I think our preference would likely be:
With those two knobs, we should be able to keep the perf impact neutral, while retaining full profile information. Right now, our application performance tuning questions are much more in the "why did this take 10 seconds" domain, than in the "why did this take 10 microseconds", so we strongly prefer full profiles over fine-grained samples. |
But, with all of this said: the above discussion still doesn't cover the scenario we were seeing, which seemed to be presenting as dropped samples, right? Is your expectation that nothing we've shown indicates this is happening, @r1viollet? @aristotelhs: can you run an experiment where you trigger high CPU load on an engine, and then check to see whether or not the profile reflects that high CPU usage as expected? |
Yes, this is correct. The only scenario in which we would be loosing samples is if we are unable to keep up with the number of samples or if we are not scheduled to process the samples. I can check this if you connect the profiler to a statsd socket (our internal stats are not billed in any way). You would need to mount the agent's statsd socket in the container and provide it to the profiler
edit: It could be simpler to just add the notice level of logs (
|
So We run ddprof with debug (-l debug) so when I run the load I get this output:
However in the test run that I do I am able to see some profiles coming from the test setup. (Not sure if something is missing). I will leave it running so that we can have a better look with @NHDaly . |
That looks like a healthy amount of lost events. The number of samples processed should also be in the logs. I would recommend the
This can be due to the unwinding reaching the end of the buffer we sampled, or some missing information (such as missing unwinding or symbol information). Incomplete frames are usually a minor inconvenience (as the leaf nodes are the most important) and should not prevent you interpreting performance results. |
Thanks for the support so far. This is really helpful! We spent some time looking into this today, and we do have some strange findings still. Here is one example of the inconsistent CPU reporting between the metrics and the profiles:
The pod has 4 cores. The metrics dashboard reports somewhere between 1 and 3 cores in use during those 5 minutes. Is something happening that is normalizing the profiles by number of cores? This is not expected, from my understanding. Thanks in advance for the support! |
All the uploads during that window report the same (logs):
|
I do see this debug message over and over - not sure if it's an issue?:
|
Regarding the discussion about the Incomplete, I can notice that Incomplete exists in all of our traces. So after the top level entry point that is the executable the next top level trace is [Incomplete]. This means that our traces end up on something different of the ones that you expect from ddprof. |
+1 all of our profiles have incomplete at the top. |
To recap what I understand. What we are sure about:
The remaining item that I find strange is the CPU accounting not matching what you would expect it to be. |
Thanks for the reply. Which are the new ulimits that we must add? Should we wait for the patch to be merged before we try? |
I am confident, yes. We are running julia with multiple threads, and we know for sure that it can utilize multiple cores.
Maybe we can get together, and pair on this sometime? It would be great to get your support if that's something you're able to do. I think we can get you the permissions so you can trigger workloads on our side to support your debugging! |
Thanks both! +1 to the discussion above |
Thanks for this precision. There seems to be an issue with how we follow threads started before we instrument the PID (so a general issue with PID mode). I'll look into this. Using wrapper mode would fix this. We also have a user switch option if you want to have a different user between ddprof and the process being profiled.
I'll deliver the patch with a setting on buffer size and we will go back to this. 👍 |
Quick question: If we deploy ddprof as a daemonset will it suffer from the same issue regarding threads or is it using a different code path? (Having done the test yet, but it is good to have this in mind) |
No, the daemonset using global mode won't suffer from this. This only relates to PID mode. |
A new options is available in the pre-release version. It will increase the size of samples we take. You can find the binaries here. I will soon release a new version of
Can you confirm that this improves the stack traces ? |
The stack size configuration was released to v0.13.0 |
For the discussion on the incomplete profiles, Julia seems to have the start_task symbols that is used to label top level tasks. So this might improve some traces. |
The logs from the new containers are the following:
The logs are with the IPC_LOCK capability added to the containers. I have added the |
Filtering: Thanks for this feedback. I have forwarded it. This requires rethinking some of the filtering mechanisms. Mmap log failure: This is mostly a confusing log. We will retry mmaps with different users (because limitations are shared across containers for a given user). The first user failed, and we are now trying with the second user (still with a larger buffer). I'll look at how to make this less confusing. |
As mentioned, the logs is misleading, the parameter is still taken into account. Could you clarify if the flamegraphs are deep enough ? |
Hello, using this parameter we do have a better outcome for the stack depth. As can be seen in the attached picture.
Does that mean that ddprof is unable to read the jitdump files for the processes? In the traces I do see symbols in the format julia_sortNOT._18923 which look like jitted functions. We run ddprof as a pod in the cluster nodes and we monitor all the processes. The command line arguments are the following: |
^ Until the missing stacks can be addressed, this is still not useful to us. Without them, it's missing a large chunk of the work that is going on on the engine. |
Thanks for the logs and screen captures. The most important issues seems to be that the JITDump lookups are failing. I am working on a reproducer and I will see how to fix this.
Though we should come back to that once we have proper JIT symbols in the flamegraphs. |
I think this can relate to how jitdump files are specified in the maps
I am assuming this is current working directory.
I will add a patch that takes this into account. |
As a workaround, can you try specifying the path to the jitdump folder explicitly ?
Reference to the LLVM sources: |
If I read the code correctly this changes the directory where julia (through llvm) would create the debug dir. In our case we are in the case of home dir (https://github.com/llvm/llvm-project/blob/main/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp#L306) so we just add the . in front. |
If there is a docker image with the version of ddprof with the above patch, I can try to install that and test it in our infra. |
From my tests with I will have a patch version you can try by end of day. |
I am not sure this is the case because I just spinned up a new engine and then when I checked the maps file I do get the full path:
I also created a new ddprof binary with the patches in |
Indeed, I added a warning (from the latest patch). That occurs if we were able to find the .dump mapping. Though you were able to run your own versions 👏 . |
The latest 0.14.1 release includes a patch that allows the We also noticed an issue with the unwinding in the Julia segv handler (used by the runtime within the garbage collector). We will discuss this in a separate issue. Thanks for the help investigating this issue. |
Hello,
we are trying to use ddprof for our continuous profiling. Currently we are deploying ddprof as a sidecar container that is used to profile the main process. The main process is written in Julia and it is using the Julia jit for the execution of the end code.
Checking the result in the web interface, we can see that some frames are missing. Running ddprof in debug mode we notice a lot of error messages like:
Also in the last part we get errors like this:
I also include the pid maps for the process.
47-map.txt
Is there a way to say ddprof to continue unwinding the stack even in cases that this is a non stack segment? There are legit reasons that this would happen during to code generation (and it points to executable maps in the file).
The text was updated successfully, but these errors were encountered: