Skip to content

Commit

Permalink
Merge tag 'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm…
Browse files Browse the repository at this point in the history
…/kernel

Pull drm updates from Dave Airlie:
 "There are two external interactions of note, the msm tree pull in some
  opp tree, hopefully the opp tree arrives from the same git tree
  however it normally does.

  There is also a new cgroup controller for device memory, that is used
  by drm, so is merging through my tree. This will hopefully help open
  up gpu cgroup usage a bit more and move us forward.

  There is a new accelerator driver for the AMD XDNA Ryzen AI NPUs.

  Then the usual xe/amdgpu/i915/msm leaders and lots of changes and
  refactors across the board:

  core:
   - device memory cgroup controller added
   - Remove driver date from drm_driver
   - Add drm_printer based hex dumper
   - drm memory stats docs update
   - scheduler documentation improvements

  new driver:
   - amdxdna - Ryzen AI NPU support

  connector:
   - add a mutex to protect ELD
   - make connector setup two-step

  panels:
   - Introduce backlight quirks infrastructure
   - New panels: KDB KD116N2130B12, Tianma TM070JDHG34-00,
   - Multi-Inno Technology MI1010Z1T-1CP11

  bridge:
   - ti-sn65dsi83: Add ti,lvds-vod-swing optional properties
   - Provide default implementation of atomic_check for HDMI bridges
   - it605: HDCP improvements, MCCS Support

  xe:
   - make OA buffer size configurable
   - GuC capture fixes
   - add ufence and g2h flushes
   - restore system memory GGTT mappings
   - ioctl fixes
   - SRIOV PF scheduling priority
   - allow fault injection
   - lots of improvements/refactors
   - Enable GuC's WA_DUAL_QUEUE for newer platforms
   - IRQ related fixes and improvements

  i915:
   - More accurate engine busyness metrics with GuC submission
   - Ensure partial BO segment offset never exceeds allowed max
   - Flush GuC CT receive tasklet during reset preparation
   - Some DG2 refactor to fix DG2 bugs when operating with certain CPUs
   - Fix DG1 power gate sequence
   - Enabling uncompressed 128b/132b UHBR SST
   - Handle hdmi connector init failures, and no HDMI/DP cases
   - More robust engine resets on Haswell and older

  i915/xe display:
   - HDCP fixes for Xe3Lpd
   - New GSC FW ARL-H/ARL-U
   - support 3 VDSC engines 12 slices
   - MBUS joining sanitisation
   - reconcile i915/xe display power mgmt
   - Xe3Lpd fixes
   - UHBR rates for Thunderbolt

  amdgpu:
   - DRM panic support
   - track BO memory stats at runtime
   - Fix max surface handling in DC
   - Cleaner shader support for gfx10.3 dGPUs
   - fix drm buddy trim handling
   - SDMA engine reset updates
   - Fix doorbell ttm cleanup
   - RAS updates
   - ISP updates
   - SDMA queue reset support
   - Rework DPM powergating interfaces
   - Documentation updates and cleanups
   - DCN 3.5 updates
   - Use a pm notifier to more gracefully handle VRAM eviction on
     suspend or hibernate
   - Add debugfs interfaces for forcing scheduling to specific engine
     instances
   - GG 9.5 updates
   - IH 4.4 updates
   - Make missing optional firmware less noisy
   - PSP 13.x updates
   - SMU 13.x updates
   - VCN 5.x updates
   - JPEG 5.x updates
   - GC 12.x updates
   - DC FAMS updates

  amdkfd:
   - GG 9.5 updates
   - Logging improvements
   - Shader debugger fixes
   - Trap handler cleanup
   - Cleanup includes
   - Eviction fence wq fix

  msm:
   - MDSS:
      - properly described UBWC registers
      - added SM6150 (aka QCS615) support
   - DPU:
      - added SM6150 (aka QCS615) support
      - enabled wide planes if virtual planes are enabled (by using two
        SSPPs for a single plane)
      - added CWB hardware blocks support
   - DSI:
      - added SM6150 (aka QCS615) support
   - GPU:
      - Print GMU core fw version
      - GMU bandwidth voting for a740 and a750
      - Expose uche trap base via uapi
      - UAPI error reporting

  rcar-du:
   - Add r8a779h0 Support

  ivpu:
   - Fix qemu crash when using passthrough

  nouveau:
   - expose GSP-RM logging buffers via debugfs

  panfrost:
   - Add MT8188 Mali-G57 MC3 support

  rockchip:
   - Gamma LUT support

  hisilicon:
   - new HIBMC support

  virtio-gpu:
   - convert to helpers
   - add prime support for scanout buffers

  v3d:
   - Add DRM_IOCTL_V3D_PERFMON_SET_GLOBAL

  vc4:
   - Add support for BCM2712

  vkms:
   - line-per-line compositing algorithm to improve performance

  zynqmp:
   - Add DP audio support

  mediatek:
   - dp: Add sdp path reset
   - dp: Support flexible length of DP calibration data

  etnaviv:
   - add fdinfo memory support
   - add explicit reset handling"

* tag 'drm-next-2025-01-17' of https://gitlab.freedesktop.org/drm/kernel: (1070 commits)
  drm/bridge: fix documentation for the hdmi_audio_prepare() callback
  doc/cgroup: Fix title underline length
  drm/doc: Include new drm-compute documentation
  cgroup/dmem: Fix parameters documentation
  cgroup/dmem: Select PAGE_COUNTER
  kernel/cgroup: Remove the unused variable climit
  drm/display: hdmi: Do not read EDID on disconnected connectors
  drm/tests: hdmi: Add connector disablement test
  drm/connector: hdmi: Do atomic check when necessary
  drm/amd/display: 3.2.316
  drm/amd/display: avoid reset DTBCLK at clock init
  drm/amd/display: improve dpia pre-train
  drm/amd/display: Apply DML21 Patches
  drm/amd/display: Use HW lock mgr for PSR1
  drm/amd/display: Revised for Replay Pseudo vblank control
  drm/amd/display: Add a new flag for replay low hz
  drm/amd/display: Remove unused read_ono_state function from Hwss module
  drm/amd/display: Do not elevate mem_type change to full update
  drm/amd/display: Do not wait for PSR disable on vbl enable
  drm/amd/display: Remove unnecessary eDP power down
  ...
  • Loading branch information
torvalds committed Jan 22, 2025
2 parents c0e7590 + 951a6bf commit 96c8470
Show file tree
Hide file tree
Showing 1,178 changed files with 50,401 additions and 14,781 deletions.
281 changes: 281 additions & 0 deletions Documentation/accel/amdxdna/amdnpu.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. include:: <isonum.txt>

=========
AMD NPU
=========

:Copyright: |copy| 2024 Advanced Micro Devices, Inc.
:Author: Sonal Santan <sonal.santan@amd.com>

Overview
========

AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator
integrated into AMD client APU. NPU enables efficient execution of Machine
Learning applications like CNN, LLM, etc. NPU is based on
`AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver.


Hardware Description
====================

AMD NPU consists of the following hardware components:

AMD XDNA Array
--------------

AMD XDNA Array comprises of 2D array of compute and memory tiles built with
`AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1
row of memory tile. Each compute tile contains a VLIW processor with its own
dedicated program and data memory. The memory tile acts as L2 memory. The 2D
array can be partitioned at a column boundary creating a spatially isolated
partition which can be bound to a workload context.

Each column also has dedicated DMA engines to move data between host DDR and
memory tile.

AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of
compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8
topology, i.e., 4 rows of compute tiles arranged into 8 columns.

Shared L2 Memory
----------------

The single row of memory tiles create a pool of software managed on chip L2
memory. DMA engines are used to move data between host DDR and memory tiles.
AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory.
AMD Strix Point NPU has a total of 4096 KB of L2 memory.

Microcontroller
---------------

A microcontroller runs NPU Firmware which is responsible for command processing,
XDNA Array partition setup, XDNA Array configuration, workload context
management and workload orchestration.

NPU Firmware uses a dedicated instance of an isolated non-privileged context
called ERT to service each workload context. ERT is also used to execute user
provided ``ctrlcode`` associated with the workload context.

NPU Firmware uses a single isolated privileged context called MERT to service
management commands from the amdxdna driver.

Mailboxes
---------

The microcontroller and amdxdna driver use a privileged channel for management
tasks like setting up of contexts, telemetry, query, error handling, setting up
user channel, etc. As mentioned before, privileged channel requests are
serviced by MERT. The privileged channel is bound to a single mailbox.

The microcontroller and amdxdna driver use a dedicated user channel per
workload context. The user channel is primarily used for submitting work to
the NPU. As mentioned before, a user channel requests are serviced by an
instance of ERT. Each user channel is bound to its own dedicated mailbox.

PCIe EP
-------

NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some
MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric
for reading or writing into host memory. Each instance of ERT gets its own
dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt.

The number of PCIe BARs varies depending on the specific device. Based on their
functions, PCIe BARs can generally be categorized into the following types.

* PSP BAR: Expose the AMD PSP (Platform Security Processor) function
* SMU BAR: Expose the AMD SMU (System Management Unit) function
* SRAM BAR: Expose ring buffers for the mailbox
* Mailbox BAR: Expose the mailbox control registers (head, tail and ISR
registers etc.)
* Public Register BAR: Expose public registers

On specific devices, the above-mentioned BAR type might be combined into a
single physical PCIe BAR. Or a module might require two physical PCIe BARs to
be fully functional. For example,

* On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0.
* On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR
index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR)
and PCIe BAR index 4 (PSP BAR).

Process Isolation Hardware
--------------------------

As explained before, XDNA Array can be dynamically divided into isolated
spatial partitions, each of which may have one or more columns. The spatial
partition is setup by programming the column isolation registers by the
microcontroller. Each spatial partition is associated with a PASID which is
also programmed by the microcontroller. Hence multiple spatial partitions in
the NPU can make concurrent host access protected by PASID.

The NPU FW itself uses microcontroller MMU enforced isolated contexts for
servicing user and privileged channel requests.


Mixed Spatial and Temporal Scheduling
=====================================

AMD XDNA architecture supports mixed spatial and temporal (time sharing)
scheduling of 2D array. This means that spatial partitions may be setup and
torn down dynamically to accommodate various workloads. A *spatial* partition
may be *exclusively* bound to one workload context while another partition may
be *temporarily* bound to more than one workload contexts. The microcontroller
updates the PASID for a temporarily shared partition to match the context that
has been bound to the partition at any moment.

Resource Solver
---------------

The Resource Solver component of the amdxdna driver manages the allocation
of 2D array among various workloads. Every workload describes the number
of columns required to run the NPU binary in its metadata. The Resource Solver
component uses hints passed by the workload and its own heuristics to
decide 2D array (re)partition strategy and mapping of workloads for spatial and
temporal sharing of columns. The FW enforces the context-to-column(s) resource
binding decisions made by the Resource Solver.

AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload
contexts. AMD Strix Point can support 16 concurrent workload contexts.


Application Binaries
====================

A NPU application workload is comprised of two separate binaries which are
generated by the NPU compiler.

1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition.
The overlay contains instructions for setting up the stream switch
configuration and ELF for the compute tiles. The overlay is loaded on the
spatial partition bound to the workload by the associated ERT instance.
Refer to the
`Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details.

2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial
partition. ``ctrlcode`` is executed by the ERT running in protected mode on
the microcontroller in the context of the workload. ``ctrlcode`` is made up
of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the
`AI Engine Run Time`_ for more details.


Special Host Buffers
====================

Per-context Instruction Buffer
------------------------------

Every workload context uses a host resident 64 MB buffer which is memory
mapped into the ERT instance created to service the workload. The ``ctrlcode``
used by the workload is copied into this special memory. This buffer is
protected by PASID like all other input/output buffers used by that workload.
Instruction buffer is also mapped into the user space of the workload.

Global Privileged Buffer
------------------------

In addition, the driver also allocates a single buffer for maintenance tasks
like recording errors from MERT. This global buffer uses the global IOMMU
domain and is only accessible by MERT.


High-level Use Flow
===================

Here are the steps to run a workload on AMD NPU:

1. Compile the workload into an overlay and a ``ctrlcode`` binary.
2. Userspace opens a context in the driver and provides the overlay.
3. The driver checks with the Resource Solver for provisioning a set of columns
for the workload.
4. The driver then asks MERT to create a context on the device with the desired
columns.
5. MERT then creates an instance of ERT. MERT also maps the Instruction Buffer
into ERT memory.
6. The userspace then copies the ``ctrlcode`` to the Instruction Buffer.
7. Userspace then creates a command buffer with pointers to input, output, and
instruction buffer; it then submits command buffer with the driver and goes
to sleep waiting for completion.
8. The driver sends the command over the Mailbox to ERT.
9. ERT *executes* the ``ctrlcode`` in the instruction buffer.
10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while
AMD XDNA Array is running.
11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion
signal to the driver which then wakes up the waiting workload.


Boot Flow
=========

amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot
of the NPU microcontroller. amdxdna driver then waits for the alive signal in
a special location on BAR 0. The NPU is switched off during SoC suspend and
turned on after resume where the NPU FW is reloaded, and the handshake is
performed again.


Userspace components
====================

Compiler
--------

Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile
available at:
https://github.com/Xilinx/llvm-aie

The open-source IREE compiler supports graph compilation of ML models for AMD
NPU and uses Peano underneath. It is available at:
https://github.com/nod-ai/iree-amd-aie

Usermode Driver (UMD)
---------------------

The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT
can be found at:
https://github.com/Xilinx/XRT

The open-source XRT shim for NPU is can be found at:
https://github.com/amd/xdna-driver


DMA Operation
=============

DMA operation instructions are encoded in the ``ctrlcode`` as
``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA
operations between host DDR and L2 memory are effected.


Error Handling
==============

When MERT detects an error in AMD XDNA Array, it pauses execution for that
workload context and sends an asynchronous message to the driver over the
privileged channel. The driver then sends a buffer pointer to MERT to capture
the register states for the partition bound to faulting workload context. The
driver then decodes the error by reading the contents of the buffer pointer.


Telemetry
=========

MERT can report various kinds of telemetry information like the following:

* L1 interrupt counter
* DMA counter
* Deep Sleep counter
* etc.


References
==========

- `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_
- `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_
- `Peano <https://github.com/Xilinx/llvm-aie>`_
- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_
- `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_
11 changes: 11 additions & 0 deletions Documentation/accel/amdxdna/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.. SPDX-License-Identifier: GPL-2.0-only
=====================================
accel/amdxdna NPU driver
=====================================

The accel/amdxdna driver supports the AMD NPU (Neural Processing Unit).

.. toctree::

amdnpu
1 change: 1 addition & 0 deletions Documentation/accel/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Compute Accelerators
:maxdepth: 1

introduction
amdxdna/index
qaic/index

.. only:: subproject and html
Expand Down
58 changes: 51 additions & 7 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
5-6. Device
5-7. RDMA
5-7-1. RDMA Interface Files
5-8. HugeTLB
5.8-1. HugeTLB Interface Files
5-9. Misc
5.9-1 Miscellaneous cgroup Interface Files
5.9-2 Migration and Ownership
5-10. Others
5-10-1. perf_event
5-8. DMEM
5-9. HugeTLB
5.9-1. HugeTLB Interface Files
5-10. Misc
5.10-1 Miscellaneous cgroup Interface Files
5.10-2 Migration and Ownership
5-11. Others
5-11-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
Expand Down Expand Up @@ -2626,6 +2627,49 @@ RDMA Interface Files
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23

DMEM
----

The "dmem" controller regulates the distribution and accounting of
device memory regions. Because each memory region may have its own page size,
which does not have to be equal to the system page size, the units are always bytes.

DMEM Interface Files
~~~~~~~~~~~~~~~~~~~~

dmem.max, dmem.min, dmem.low
A readwrite nested-keyed file that exists for all the cgroups
except root that describes current configured resource limit
for a region.

An example for xe follows::

drm/0000:03:00.0/vram0 1073741824
drm/0000:03:00.0/stolen max

The semantics are the same as for the memory cgroup controller, and are
calculated in the same way.

dmem.capacity
A read-only file that describes maximum region capacity.
It only exists on the root cgroup. Not all memory can be
allocated by cgroups, as the kernel reserves some for
internal use.

An example for xe follows::

drm/0000:03:00.0/vram0 8514437120
drm/0000:03:00.0/stolen 67108864

dmem.current
A read-only file that describes current resource usage.
It exists for all the cgroup except root.

An example for xe follows::

drm/0000:03:00.0/vram0 12550144
drm/0000:03:00.0/stolen 8650752

HugeTLB
-------

Expand Down
9 changes: 9 additions & 0 deletions Documentation/core-api/cgroup.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
==================
Cgroup Kernel APIs
==================

Device Memory Cgroup API (dmemcg)
=================================
.. kernel-doc:: kernel/cgroup/dmem.c
:export:

1 change: 1 addition & 0 deletions Documentation/core-api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ more memory-management documentation in Documentation/mm/index.rst.
dma-isa-lpc
swiotlb
mm-api
cgroup
genalloc
pin_user_pages
boot-time-mm
Expand Down
Loading

0 comments on commit 96c8470

Please sign in to comment.