Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with CLI analysis #103

Closed
gsitaram opened this issue Mar 31, 2023 · 7 comments
Closed

Errors with CLI analysis #103

gsitaram opened this issue Mar 31, 2023 · 7 comments
Assignees
Milestone

Comments

@gsitaram
Copy link

Can this error be worked around?

$ omniperf analyze -p workloads/current/mi200 

--------
Analyze
--------

/opt/omniperf/bin/omniperf_analyze/utils/parser.py:164: RuntimeWarning: invalid value encountered in scalar remainder
  return a % b
Traceback (most recent call last):
  File "/opt/omniperf/bin/omniperf", line 828, in <module>
    main()
  File "/opt/omniperf/bin/omniperf", line 808, in main
    analyze(args)
  File "/opt/omniperf/bin/omniperf_analyze/omniperf_analyze.py", line 284, in analyze
    run_cli(args, runs)
  File "/opt/omniperf/bin/omniperf_analyze/omniperf_analyze.py", line 198, in run_cli
    parser.load_table_data(
  File "/opt/omniperf/bin/omniperf_analyze/utils/parser.py", line 704, in load_table_data
    eval_metric(
  File "/opt/omniperf/bin/omniperf_analyze/utils/parser.py", line 487, in eval_metric
    ammolite__build_in[key] = eval(compile(s, "<string>", "eval"))
  File "<string>", line 2, in <module>
  File "/opt/omniperf/bin/omniperf_analyze/utils/parser.py", line 143, in to_int
    return int(a)
ValueError: cannot convert float NaN to integer

Omniperf version I am using:

$ omniperf --version
----------------------------------------
Omniperf version: 1.0.8-PR1 (release)
Git revision:     ac10ad2
----------------------------------------
@gsitaram
Copy link
Author

gsitaram commented Apr 4, 2023

We saw this error with another workload today. If there is any insight, would be good to have.

@coleramos425
Copy link
Collaborator

coleramos425 commented Apr 4, 2023

Hi Gina. Thank you for reporting this issue. Further investigation of your workload (specifically workloads/current/mi200/pmc_perf.csv) has uncovered multiple dispatches where GRBM_GUI_ACTIVE is 0. I'll have to confer with a hardware expert, but I believe this should always be non-zero.

This issue is arising because when attempting to eval the Python expression, it's attempting division by a NaN.
https://github.com/AMDResearch/omniperf/blob/9770396fa8d75e2d72ead30890cb9d232ff6ea4a/src/omniperf_analyze/utils/parser.py#L481-L492
This snowballs into a larger issue when we see metrics using GRBM_GUI_ACTIVE begin to report inf which can be attributed to Python eval()'s known inf issues.

$ ./src/omniperf analyze -p workloads/current/mi200/ -b 2.1.8 -g

--------
Analyze
--------

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
raw pmc df info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17845 entries, 0 to 17844
Columns: 1128 entries, ('SQ_IFETCH_LEVEL', 'Index') to ('pmc_perf', 'CompleteNs')
dtypes: float64(104), int64(1005), object(18), uint64(1)
memory usage: 153.6+ MB
None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
filtered pmc df info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17845 entries, 0 to 17844
Columns: 1128 entries, ('SQ_IFETCH_LEVEL', 'Index') to ('pmc_perf', 'CompleteNs')
dtypes: float64(104), int64(1005), object(18), uint64(1)
memory usage: 153.6+ MB
None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
Value = 
to_avg(((100 * raw_pmc_df.get('pmc_perf').get("SQ_ACTIVE_INST_SCA")) / (raw_pmc_df.get('pmc_perf').get("GRBM_GUI_ACTIVE") * ammolite__numCU)))

Inputs:
Var  ammolite__numCU : 104

Output:
inf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
Peak = 
100

Inputs:

Output:
100
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
PoP = 
to_avg(((100 * raw_pmc_df.get('pmc_perf').get("SQ_ACTIVE_INST_SCA")) / (raw_pmc_df.get('pmc_perf').get("GRBM_GUI_ACTIVE") * ammolite__numCU)))

Inputs:
Var  ammolite__numCU : 104

Output:
inf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--------------------------------------------------------------------------------
2. System Speed-of-Light
╒═════════╤═══════════╤═════════╤════════╤════════╤═══════╕
│ Index   │ Metric    │   Value │ Unit   │   Peak │   PoP │
╞═════════╪═══════════╪═════════╪════════╪════════╪═══════╡
│ 2.1.8   │ SALU Util │     inf │ Pct    │    100 │   inf │
╘═════════╧═══════════╧═════════╧════════╧════════╧═══════╛

Before implementing a full-fledged patch I'd like to understand why rocprof is reporting these numbers. At the very least we will update code to throw a warning if illogical GRBM_GUI_ACTIVE is detected.

@coleramos425 coleramos425 self-assigned this Apr 4, 2023
@coleramos425 coleramos425 added this to the v1.0.8 milestone Apr 4, 2023
@PaulMullowney
Copy link

I am seeing this in a situation where I am analyzing the top N kernels, 1 by 1. A small subset of the kernels are showing this error. Why wouldn't I see the error for all kernels?

@coleramos425
Copy link
Collaborator

@PaulMullowney we know this error is triggered when arithmetic encounters a dispatch where (GRBM_GUI_ACTIVE == 0). My guess is some kernels in your workload aren't hitting this condition and when you filter those kernels, the error goes away.

As discussed in Teams chat we have a few tests planned to help clarify why (GRBM_GUI_ACTIVE == 0) is being reported. One of which will be separating pmc_perf.txt input file line by line to rule out any merge issues...

I'll follow up in the next few days after running these tests

@coleramos425
Copy link
Collaborator

One of which will be separating pmc_perf.txt input file line by line to rule out any merge issues...
I'll follow up in the next few days after running these tests

Update:
Workaround mentioned above is now implemented in dev. Reaching out to Paul to see if this will solve his issue.

coleramos425 added a commit that referenced this issue May 25, 2023
Signed-off-by: coleramos425 <colramos@amd.com>
@coleramos425
Copy link
Collaborator

We've updated the Omniperf code s.t. anytime this GRBM_GUI_ACTIVE issue occurs we throw a helpful warning and fail gracefully.

Our custom merge utility didn't fix the original issue so we've passed the issue to rocprof team. Awaiting response...
https://ontrack-internal.amd.com/browse/SWDEV-402481

Pushing issue to a future milestone

@coleramos425 coleramos425 modified the milestones: v1.0.8, v.1.1.0 May 30, 2023
@coleramos425 coleramos425 modified the milestones: v.1.0.9, v1.1.0 Aug 2, 2023
feizheng10 pushed a commit to feizheng10/omniperf that referenced this issue Dec 6, 2023
Signed-off-by: coleramos425 <colramos@amd.com>
Signed-off-by: fei.zheng <fei.zheng@amd.com>
feizheng10 pushed a commit to feizheng10/omniperf that referenced this issue Dec 20, 2023
Signed-off-by: coleramos425 <colramos@amd.com>
Signed-off-by: fei.zheng <fei.zheng@amd.com>
coleramos425 added a commit that referenced this issue Mar 5, 2024
Signed-off-by: colramos-amd <colramos@amd.com>
@coleramos425
Copy link
Collaborator

coleramos425 commented Mar 5, 2024

While the underlying issue seems to still be present in rocprofiler:
https://ontrack-internal.amd.com/browse/SWDEV-402481

Omniperf will catch the bug and throw a warning via the above commits. Closing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants