Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MI300 details to docs #446

Draft
wants to merge 3 commits into
base: amd-staging
Choose a base branch
from

Conversation

peterjunpark
Copy link
Contributor

@peterjunpark peterjunpark commented Oct 9, 2024

demo build: https://advanced-micro-devices-demo--446.com.readthedocs.build/projects/omniperf/en/446/

Performance model

Pipeline descriptions

VALU

  • Add MI300 to list of products with MFMA units here
  • Update note at bottom of section to include MI300 in list of accelerators with 8 waveslots / SIMD here

AGPRs

  • Add MI300 to MI200 list here

Pipeline metrics

  • Need to add new MFMA instruction metrics for MI300 here
    • And FLOPs for the same here

L1

  • Need to update L1 cache-line size here to 128B for MI300+: here

UTCL1

  • MI300 fixes the bug where hit-on-miss isn't counted: update here

TA instruction counts

  • On MI300, we now theoretically use the scratch* instructions for stack/spill access, which ... invalidates a lot of this section. We need to figure out how to rework this

Scalar / Instruction cache

  • Need to update size and how many CUs it's shared between here
          - 64KB / shared between CUs on MI300

L2

  • L2 is no longer coherence point for MI300+
    • L2<->EA request flow diagram needs to be updated for MI300
            - Essentially, we need to add a 128B read request line and figure out how to represent this on the diagram
  • Update channel count in text for MI300 here
          - 16 channels per XCC, still 256B interleaved
  • Update Streaming requests text to also include 300
  • Update probe requests text for MI300
          - Likely more involved, need to write some tests to see what triggers these here
  • Update note at bottom of section to include MI300 here
          - [ ] 128B cache-line there as well
  • L2-Fabric Write and Atomic Bandwidth
          - All atomics are now counted as such on MI300, because they are not cached in L2 and must go to MALL
          - Same with:
                - HBM Write and Atomic Traffic
                - Remote Write and Atomic Traffic
                - Atomic Traffic
                - Uncached Write and Atomic Traffic
  • Detailed transaction metrics: here
          - Need to add 128B read request metric to table

Memory type

  • Need to update table for MI300, may need a better way to represent this as fine-grained/coarse-grained isn't super relevant there anymore.

New concepts

  • Need to discuss XCC / NPS / partitioning modes somewhere. There's no super logical place to do so, but we might do this in the definitions or as s seperate part of the performance model.
  • The key points for Omniperf are that:
          - [ ] Number of CUs depends on # of XCCs active in the current partitioning mode
          - [ ] Number of HBM channels per partition (and thus: the achievable L2<->EA bandwidth) depends on the NPS mode
  • Need to discuss MALL as coherence point somewhere
  • Neither of the above need to be in significant detail, IMO
  • Neither of these have specific metrics tied to them, but are important to understand how we're presenting data

References

  • Should add MI300 / CDNA3 ISA Guide

@peterjunpark peterjunpark added the documentation Improvements or additions to documentation label Oct 9, 2024
start adding MI300 content

Signed-off-by: Peter Park <peter.park@amd.com>
Signed-off-by: Peter Park <peter.park@amd.com>
Signed-off-by: Peter Park <peter.park@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant