Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different statistics from mem_trace and opcode_hist #68

Open
mahmoodn opened this issue Sep 30, 2021 · 2 comments
Open

Different statistics from mem_trace and opcode_hist #68

mahmoodn opened this issue Sep 30, 2021 · 2 comments
Assignees

Comments

@mahmoodn
Copy link

mahmoodn commented Sep 30, 2021

Hi
I have noticed that some statistics are counted differently with different tools. For example in one of the codes in SDK, 6_Advanced/cdpQuadtree, I tried opcode_hist and mem_trace and saw that the number of STG and LDG instructions are different. I paste the output of opcode_hist for STG/LDG related instructions here:

GPU device GeForce RTX 3080 has compute capabilities (SM 8.6)
kernel 0 - void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 32
kernel 1 - void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 32
kernel 2 - void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 32
kernel 3 - void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 32
kernel 4- void thrust::cuda_cub::core::_kernel_agent...
  STG.E = 64
kernel 5 - void build_quadtree_kernel<128>...
  LDG.E = 4950
  LDG.E.64 = 18594
  STG.E = 3816
  STG.E.64 = 2268
Results: OK

Next, I ran mem_trace and redirected the output to a file and counted the number of STG.E instances.

$ grep STG.E mem_trace.txt | wc -l
868
$ grep STG.E.64 mem_trace.txt | wc -l
252
$ grep LDG.E.64 mem_trace.txt | wc -l
2066

I also see that memory traces for kernel_agent are not printed. I mean it lacks some STG.E strings.
Do you confirm that? Any thoughts?

UPDATE:

I see that for LDG.E.64 18594/2066 is 9 and for STG.E.64 2268/252 is also 9. Don't know what is the interpretation of that 9...

@ovilla
Copy link
Collaborator

ovilla commented Sep 30, 2021

It is possible there is a bug somewhere (and we will check for it), but the 2 tools are very different.

  • opcode_hist works counting at basic block level and ignores if each instruction within the basic block is predicated on/off.

  • mem_trace instead prints memory instructions (at instruction level) and does not print any line if the instruction is entirely predicated off.

You could say mem_trace is more accurate (and slower), while opcode_hist provides an upper bound of what is executed (but it is very fast as it only instruments once per basic block). But again the tools are doing very different things.

For instance, predicated off instructions are logically executed (with a zeroed mask) and thus could be of interest for something like "opcode_hist", this is not true if you only care about seeing memory references.

In general the tools are examples of how to use nvbit rather than provide specific profiling functionality, and for that NVIDIA supported profiling tools are already excellent.

Nvbit is more about giving you freedom to instrument exactly what you need. You should modify and adapt those tools to your needs.
For instance, do you care about predicated off instructions? Do you care about thread level count or warp level count? etc...

@mahmoodn
Copy link
Author

mahmoodn commented Oct 1, 2021

mem_trace instead prints memory instructions (at instruction level) and does not print any line if the instruction is entirely predicated off.

You mean this part in the mem_trace?

    /* if thread is predicated off, return */
    if (!pred) {
        return;
    }

I commented that part and reran mem_trace hoping to see memory traces of kernels that I didn't see in the original post. Still I don't see them. Don't know what is the effect of that condition then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants