diff --git a/.github/CHANGELOG.md b/.github/CHANGELOG.md
index e3e3790be1..513a198a24 100644
--- a/.github/CHANGELOG.md
+++ b/.github/CHANGELOG.md
@@ -2,6 +2,9 @@
 
 ### New features since last release
 
+* Add documentation updates for the `lightning_gpu` backend.
+  [(#525)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/525)
+
 * Add `SparseHamiltonian` support for Lightning-Qubit and Lightning-GPU.
   [(#526)] (https://github.com/PennyLaneAI/pennylane-lightning/pull/526)
 
@@ -28,10 +31,10 @@
 
 ### Breaking changes
 
-* Add `tests_gpu.yml` workflow to test the Lightning-Kokkos backend with CUDA-12. 
+* Add `tests_gpu.yml` workflow to test the Lightning-Kokkos backend with CUDA-12.
   [(#494)](https://github.com/PennyLaneAI/pennylane-lightning/pull/494)
 
-* Implement `LM::GeneratorDoubleExcitation`, `LM::GeneratorDoubleExcitationMinus`, `LM::GeneratorDoubleExcitationPlus` kernels. L-Qubit default kernels are now strictly from the `LM` implementation, which requires less memory and is faster for large state vectors.  
+* Implement `LM::GeneratorDoubleExcitation`, `LM::GeneratorDoubleExcitationMinus`, `LM::GeneratorDoubleExcitationPlus` kernels. L-Qubit default kernels are now strictly from the `LM` implementation, which requires less memory and is faster for large state vectors.
   [(#512)](https://github.com/PennyLaneAI/pennylane-lightning/pull/512)
 
 * Add workflows validating compatibility between PennyLane and Lightning's most recent stable releases and development (latest) versions.
diff --git a/.github/workflows/tests_linux_x86_mpi_gpu.yml b/.github/workflows/tests_linux_x86_mpi_gpu.yml
index e18a8f13e0..9d3edfe913 100644
--- a/.github/workflows/tests_linux_x86_mpi_gpu.yml
+++ b/.github/workflows/tests_linux_x86_mpi_gpu.yml
@@ -14,7 +14,7 @@ on:
   push:
     branches:
       - main
-  pull_request:
+  #pull_request:
 
 env:
   COVERAGE_FLAGS: "--cov=pennylane_lightning --cov-report=term-missing --cov-report=xml:./coverage.xml --no-flaky-report -p no:warnings --tb=native"
diff --git a/.readthedocs.yml b/.readthedocs.yml
index 00a2f7e4b8..e4d85ee56b 100644
--- a/.readthedocs.yml
+++ b/.readthedocs.yml
@@ -21,8 +21,11 @@ build:
     - libopenblas-base
     - libopenblas-dev
     - graphviz
+    - nvidia-cuda-toolkit
   jobs:
     pre_install:
       - echo "setuptools~=66.0\npip~=22.0" >> ci_build_requirements.txt
     post_install:
-      - PL_BACKEND="lightning_kokkos" pip install -e . -vv
+      - rm -rf ./build && PL_BACKEND="lightning_kokkos" python setup.py bdist_wheel
+      - rm -rf ./build && PL_BACKEND="lightning_gpu" python setup.py build_ext --define="PL_DISABLE_CUDA_SAFETY=1" && PL_BACKEND="lightning_gpu" python setup.py bdist_wheel
+      - python -m pip install ./dist/*.whl
diff --git a/README.rst b/README.rst
index 3e128524fd..627fe32383 100644
--- a/README.rst
+++ b/README.rst
@@ -41,60 +41,103 @@ The Lightning plugin ecosystem provides fast state-vector simulators written in
 learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
 PennyLane supports Python 3.9 and above.
 
-.. header-end-inclusion-marker-do-not-remove
+Features
+********
 
+PennyLane-Lightning high performance simulators include the following backends:
 
-Features
-========
+* ``lightning.qubit``: is a fast state-vector simulator written in C++.
+* ``lightning.gpu``: is a state-vector simulator based on the `NVIDIA cuQuantum SDK <https://developer.nvidia.com/cuquantum-sdk>`_. It notably implements a distributed state-vector simulator based on MPI.
+* ``lightning.kokkos``: is a state-vector simulator written with `Kokkos <https://kokkos.github.io/kokkos-core-wiki/index.html>`_. It can exploit the inherent parallelism of modern processing units supporting the `OpenMP <https://www.openmp.org/>`_, `CUDA <https://developer.nvidia.com/cuda-toolkit>`_ or `HIP <https://docs.amd.com/projects/HIP/en/docs-5.3.0/index.html>`_ programming models.
+
+.. header-end-inclusion-marker-do-not-remove
+
+The following table summarizes the supported platforms and the primary installation mode:
+
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+|           | L-Qubit | L-GPU  | L-GPU (MPI) | L-Kokkos (OMP) | L-Kokkos (CUDA) | L-Kokkos (HIP) |
++===========+=========+========+=============+================+=================+================+
+| Linux x86 | pip     | pip    | source      | pip            | source          | source         |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| Linux ARM | pip     | source |             | pip            | source          | source         |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| Linux PPC | pip     | source |             | pip            | source          | source         |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| MacOS x86 | pip     |        |             | pip            |                 |                |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| MacOS ARM | pip     |        |             | pip            |                 |                |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
+| Windows   | pip     |        |             |                |                 |                |
++-----------+---------+--------+-------------+----------------+-----------------+----------------+
 
-* Combine Lightning's high performance simulators with PennyLane's
-  automatic differentiation and optimization.
 
 .. installation_LQubit-start-inclusion-marker-do-not-remove
 
+Lightning-Qubit installation
+****************************
 
-Lightning Qubit installation
-============================
+PyPI wheels (pip)
+=================
 
-Lightning Qubit can be installed using ``pip``:
+Lightning plugins can be installed using ``pip`` as follows
 
 .. code-block:: console
 
     $ pip install pennylane-lightning
 
-To build Lightning from source you can run
+The above command will install the Lightning-Qubit plugin (the default since it is most broadly supported).
+In order to install the Lightning-GPU and Lightning-Kokkos (OpenMP) backends, you can respectively use the following commands:
 
 .. code-block:: console
 
-    $ pip install pybind11 pennylane-lightning --no-binary :all:
+    $ pip install pennylane-lightning[gpu]
+    $ pip install pennylane-lightning[kokkos]
+
+
+Install from source
+===================
+
+To build Lightning plugins from source you can run
+
+.. code-block:: console
+
+    $ PL_BACKEND=${PL_BACKEND} pip install pybind11 pennylane-lightning --no-binary :all:
+
+where ``${PL_BACKEND}`` can be ``lightning_qubit`` (default), ``lightning_gpu`` or ``lightning_kokkos``.
+The `pybind11 <https://pybind11.readthedocs.io/en/stable/>`_ library is required to bind the C++ functionality to Python.
 
 A C++ compiler such as ``g++``, ``clang++``, or ``MSVC`` is required.
 On Debian-based systems, this can be installed via ``apt``:
 
 .. code-block:: console
 
-    $ sudo apt install g++
+    $ sudo apt -y update && 
+    $ sudo apt install g++ libomp-dev
 
+where ``libomp-dev`` is included to also install OpenMP.
 On MacOS, we recommend using the latest version of ``clang++`` and ``libomp``:
 
 .. code-block:: console
 
     $ brew install llvm libomp
 
-The `pybind11 <https://pybind11.readthedocs.io/en/stable/>`_ library is also used for binding the
-C++ functionality to Python.
+The Lightning-GPU backend has several dependencies (e.g. ``CUDA``, ``custatevec-cu11``, etc.), and hence we recommend referring to `Lightning-GPU <lightning-gpu>`_ section below.
+Similarly, for Lightning-Kokkos it is recommended to configure and install Kokkos independently as prescribed in the `Lightning-Kokkos <lightning-kokkos>`_ section below.
 
-Alternatively, for development and testing, you can install by cloning the repository:
+Development installation
+========================
+
+For development and testing, you can install by cloning the repository:
 
 .. code-block:: console
 
     $ git clone https://github.com/PennyLaneAI/pennylane-lightning.git
     $ cd pennylane-lightning
     $ pip install -r requirements.txt
-    $ pip install -e .
+    $ PL_BACKEND=${PL_BACKEND} pip install -e . -vv
 
 Note that subsequent calls to ``pip install -e .`` will use cached binaries stored in the
-``build`` folder. Run ``make clean`` if you would like to recompile.
+``build`` folder. Run ``make clean`` if you would like to recompile from scratch.
 
 You can also pass ``cmake`` options with ``CMAKE_ARGS`` as follows:
 
@@ -109,26 +152,35 @@ or with ``build_ext`` and the ``--define`` flag as follows:
     $ python3 setup.py build_ext -i --define="ENABLE_OPENMP=OFF;ENABLE_BLAS=OFF"
     $ python3 setup.py develop
 
+where ``-D`` must not be included before ``;``-separated options.
 
-Testing
--------
+Compile MSVC (Windows)
+======================
 
-To test that the plugin is working correctly you can test the Python code within the cloned
-repository:
+Lightning-Qubit can be compiled on Windows using the
+`Microsoft Visual C++ <https://visualstudio.microsoft.com/vs/features/cplusplus/>`_ compiler.
+You need `cmake <https://cmake.org/download/>`_ and appropriate Python environment
+(e.g. using `Anaconda <https://www.anaconda.com/>`_).
+
+We recommend using ``[x64 (or x86)] Native Tools Command Prompt for VS [version]`` to compile the library.
+Be sure that ``cmake`` and ``python`` can be called within the prompt.
 
 .. code-block:: console
 
-    $ make test-python
+    $ cmake --version
+    $ python --version
 
-while the C++ code can be tested with
+Then a common command will work.
 
 .. code-block:: console
 
-    $ make test-cpp
+    $ pip install -r requirements.txt
+    $ pip install -e .
 
+Note that OpenMP and BLAS are disabled on this platform.
 
-CMake Support
--------------
+CMake support
+=============
 
 One can also build the plugin using CMake:
 
@@ -137,184 +189,213 @@ One can also build the plugin using CMake:
     $ cmake -S. -B build
     $ cmake --build build
 
-To test the C++ code:
+Supported options are
 
-.. code-block:: console
+- ``-DENABLE_WARNINGS:BOOL=ON``
+- ``-DENABLE_NATIVE:BOOL=ON`` (for ``-march=native``)
+- ``-DENABLE_BLAS:BOOL=ON``
+- ``-DENABLE_OPENMP:BOOL=ON``
+- ``-DENABLE_CLANG_TIDY:BOOL=ON``
 
-    $ mkdir build && cd build
-    $ cmake -DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Debug ..
-    $ make
+Testing
+=======
 
-Other supported options are
+To test that a plugin is working correctly, test the Python code with:
 
-- ``-DENABLE_WARNINGS=ON``
-- ``-DENABLE_NATIVE=ON`` (for ``-march=native``)
-- ``-DENABLE_BLAS=ON``
-- ``-DENABLE_OPENMP=ON``
-- ``-DENABLE_CLANG_TIDY=ON``
+.. code-block:: console
 
-Compile on Windows with MSVC
-----------------------------
+    $ make test-python device=${PL_DEVICE}
 
-You can also compile Lightning on Windows using
-`Microsoft Visual C++ <https://visualstudio.microsoft.com/vs/features/cplusplus/>`_ compiler.
-You need `cmake <https://cmake.org/download/>`_ and appropriate Python environment
-(e.g. using `Anaconda <https://www.anaconda.com/>`_).
+where ``${PL_DEVICE}`` can be ``lightning.qubit`` (default), ``lightning.gpu`` or ``lightning.kokkos``.
+These differ from ``${PL_BACKEND}`` by replacing the underscore by a dot.
+The C++ code can be tested with
 
+.. code-block:: console
 
-We recommend to use ``[x64 (or x86)] Native Tools Command Prompt for VS [version]`` for compiling the library.
-Be sure that ``cmake`` and ``python`` can be called within the prompt.
+    $ PL_BACKEND=${PL_BACKEND} make test-cpp
 
+.. installation_LQubit-end-inclusion-marker-do-not-remove
 
-.. code-block:: console
+.. installation_LGPU-start-inclusion-marker-do-not-remove
 
-    $ cmake --version
-    $ python --version
+.. _lightning-gpu:
 
-Then a common command will work.
+Lightning-GPU installation
+**************************
 
-.. code-block:: console
+Lightning-GPU can be installed using ``pip``:
 
-    $ pip install -r requirements.txt
-    $ pip install -e .
+.. code-block:: console
 
-Note that OpenMP and BLAS are disabled in this setting.
+    pip install pennylane-lightning[gpu]
 
+Lightning-GPU requires the `cuQuantum SDK <https://developer.nvidia.com/cuquantum-sdk>`_ (only the `cuStateVec <https://docs.nvidia.com/cuda/cuquantum/latest/custatevec/index.html>`_ library is required).
+The SDK may be installed within the Python environment ``site-packages`` directory using ``pip`` or ``conda`` or the SDK library path appended to the ``LD_LIBRARY_PATH`` environment variable.
+Please see the `cuQuantum SDK <https://developer.nvidia.com/cuquantum-sdk>`_ install guide for more information.
 
-.. installation_LQubit-end-inclusion-marker-do-not-remove
+Install L-GPU from source
+=========================
 
+To install Lightning-GPU from the package sources using the direct SDK path, Lightning-Qubit should be install before Lightning-GPU:
 
-.. installation_LKokkos-start-inclusion-marker-do-not-remove
+.. code-block:: console
 
-Lightning Kokkos installation
-=============================
+    git clone https://github.com/PennyLaneAI/pennylane-lightning.git
+    cd pennylane-lightning
+    pip install -r requirements.txt
+    PL_BACKEND="lightning_qubit" pip install -e . -vv
 
-For linux systems, `lightning.kokkos` and be readily installed with an OpenMP backend by providing the optional ``[kokkos]`` tag: 
+Then the `cuStateVec <https://docs.nvidia.com/cuda/cuquantum/latest/custatevec/index.html>`_ library can be installed and set a ``CUQUANTUM_SDK`` environment variable.
 
 .. code-block:: console
 
-    $ pip install pennylane-lightning[kokkos]
+    python -m pip install wheel custatevec-cu11
+    export CUQUANTUM_SDK=$(python -c "import site; print( f'{site.getsitepackages()[0]}/cuquantum/lib')")
 
-This can be explicitly installed through PyPI as:
+The Lightning-GPU can then be installed with ``pip``:
 
 .. code-block:: console
 
-    $ pip install pennylane-lightning-kokkos
+    PL_BACKEND="lightning_gpu" python -m pip install -e .
 
+To simplify the build, we recommend using the containerized build process described in section `Docker support <docker-support>`_.
 
-Building from source
---------------------
+Install L-GPU with MPI
+======================
 
-As Kokkos enables support for many different HPC-targetted hardware platforms, `lightning.kokkos` can be built to support any of these platforms when building from source.
+Building Lightning-GPU with MPI also requires the ``NVIDIA cuQuantum SDK`` (currently supported version: `custatevec-cu11 <https://pypi.org/project/cuquantum-cu11/>`_), ``mpi4py`` and ``CUDA-aware MPI`` (Message Passing Interface).
+``CUDA-aware MPI`` allows data exchange between GPU memory spaces of different nodes without the need for CPU-mediated transfers.
+Both the ``MPICH`` and ``OpenMPI`` libraries are supported, provided they are compiled with CUDA support.
+The path to ``libmpi.so`` should be found in ``LD_LIBRARY_PATH``.
+It is recommended to install the ``NVIDIA cuQuantum SDK`` and ``mpi4py`` Python package within ``pip`` or ``conda`` inside a virtual environment.
+Please consult the `cuQuantum SDK <https://developer.nvidia.com/cuquantum-sdk>`_ , `mpi4py <https://mpi4py.readthedocs.io/en/stable/install.html>`_,
+`MPICH <https://www.mpich.org/static/downloads/4.1.1/mpich-4.1.1-README.txt>`_, or `OpenMPI <https://www.open-mpi.org/faq/?category=buildcuda>`_ install guide for more information.
 
-We suggest first installing Kokkos with the wanted configuration following the instructions found in the `Kokkos documentation <https://kokkos.github.io/kokkos-core-wiki/building.html>`_.
-Next, append the install location to ``CMAKE_PREFIX_PATH``.
-If an installation is not found, our builder will clone and install it during the build process.
-
-The simplest way to install PennyLane-Lightning-Kokkos (OpenMP backend) is using ``pip``.
+Before installing Lightning-GPU with MPI support using the direct SDK path, please ensure Lightning-Qubit, ``CUDA-aware MPI`` and ``custatevec`` are installed and the environment variable ``CUQUANTUM_SDK`` is set properly.
+Then Lightning-GPU with MPI support can then be installed with ``pip``:
 
 .. code-block:: console
 
-   CMAKE_ARGS="-DKokkos_ENABLE_OPENMP=ON" PL_BACKEND="lightning_kokkos" python -m pip install .
+    CMAKE_ARGS="-DENABLE_MPI=ON"  PL_BACKEND="lightning_gpu" python -m pip install -e .
 
-or for an editable ``pip`` installation with:
 
-.. code-block:: console
+Test L-GPU with MPI
+===================
 
-   CMAKE_ARGS="-DKokkos_ENABLE_OPENMP=ON" PL_BACKEND="lightning_kokkos" python -m pip install -e .
-
-Alternatively, you can install the Python interface with:
+You may test the Python layer of the MPI enabled plugin as follows:
 
 .. code-block:: console
 
-   CMAKE_ARGS="-DKokkos_ENABLE_OPENMP=ON" PL_BACKEND="lightning_kokkos" python setup.py build_ext
-   python setup.py bdist_wheel
-   pip install ./dist/PennyLane*.whl --force-reinstall
+    mpirun -np 2 python -m pytest mpitests --tb=short
 
-To build the plugin directly with CMake:
+The C++ code is tested with
 
 .. code-block:: console
 
-   cmake -B build -DKokkos_ENABLE_OPENMP=ON -DPLKOKKOS_BUILD_TESTS=ON -DPL_BACKEND=lightning_kokkos -G Ninja
-   cmake --build build
+    rm -rf ./BuildTests
+    cmake . -BBuildTests -DBUILD_TESTS=1 -DBUILD_TESTS=1 -DENABLE_MPI=ON -DCUQUANTUM_SDK=<path to sdk>
+    cmake --build ./BuildTests --verbose
+    cd ./BuildTests
+    for file in *runner_mpi ; do mpirun -np 2 ./BuildTests/$file ; done;
 
-The supported backend options are "SERIAL", "OPENMP", "THREADS", "HIP" and "CUDA" and the corresponding build options are ``-DKokkos_ENABLE_XXX=ON``, where ``XXX`` needs be replaced by the backend name, for instance ``OPENMP``.
-One can activate simultaneously one serial, one parallel CPU host (e.g. "OPENMP", "THREADS") and one parallel GPU device backend (e.g. "HIP", "CUDA"), but not two of any category at the same time.
-For "HIP" and "CUDA", the appropriate software stacks are required to enable compilation and subsequent use.
-Similarly, the CMake option ``-DKokkos_ARCH_{...}=ON`` must also be specified to target a given architecture.
-A list of the architectures is found on the `Kokkos wiki <https://github.com/kokkos/kokkos/wiki/Macros#architectures>`_.
-Note that "THREADS" backend is not recommended since `Kokkos <https://github.com/kokkos/kokkos-core-wiki/blob/17f08a6483937c26e14ec3c93a2aa40e4ce081ce/docs/source/ProgrammingGuide/Initialization.md?plain=1#L67>`_ does not guarantee its safety.
+.. installation_LGPU-end-inclusion-marker-do-not-remove
 
+.. installation_LKokkos-start-inclusion-marker-do-not-remove
 
-Testing
-=======
+.. _lightning-kokkos:
 
-To test with the ROCm stack using a manylinux2014 container we must first mount the repository into the container:
+Lightning-Kokkos installation
+*****************************
 
-.. code-block:: console
+On linux systems, `lightning.kokkos` with the OpenMP backend can be installed by providing the optional ``[kokkos]`` tag:
 
-    docker run -v `pwd`:/io -it quay.io/pypa/manylinux2014_x86_64 bash
+.. code-block:: console
 
-Next, within the container, we install the ROCm software stack:
+    $ pip install pennylane-lightning[kokkos]
 
-.. code-block:: console
+Install L-Kokkos from source
+============================
 
-    yum install -y https://repo.radeon.com/amdgpu-install/21.40.2/rhel/7.9/amdgpu-install-21.40.2.40502-1.el7.noarch.rpm
-    amdgpu-install --usecase=hiplibsdk,rocm --no-dkms
+As Kokkos enables support for many different HPC-targeted hardware platforms, `lightning.kokkos` can be built to support any of these platforms when building from source.
 
-We next build the test suite, with a given AMD GPU target in mind, as listed `here <https://github.com/kokkos/kokkos/blob/master/Makefile.kokkos>`_.
+We suggest first installing Kokkos with the wanted configuration following the instructions found in the `Kokkos documentation <https://kokkos.github.io/kokkos-core-wiki/building.html>`_.
+For example, the following will build Kokkos for NVIDIA A100 cards
 
 .. code-block:: console
 
-    cd /io
-    export PATH=$PATH:/opt/rocm/bin/
-    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib
-    export CXX=/opt/rocm/hip/bin/hipcc
-    cmake -B build -DCMAKE_CXX_COMPILER=/opt/rocm/hip/bin/hipcc -DKokkos_ENABLE_HIP=ON -DPLKOKKOS_BUILD_TESTS=ON -DKokkos_ARCH_VEGA90A=ON
-    cmake --build build --verbose
+    cmake -S . -B build -G Ninja \
+        -DCMAKE_BUILD_TYPE=RelWithDebug \
+        -DCMAKE_INSTALL_PREFIX=/opt/kokkos/4.1.00/AMPERE80 \
+        -DCMAKE_CXX_STANDARD=20 \
+        -DBUILD_SHARED_LIBS:BOOL=ON \
+        -DBUILD_TESTING:BOOL=OFF \
+        -DKokkos_ENABLE_SERIAL:BOOL=ON \
+        -DKokkos_ENABLE_CUDA:BOOL=ON \
+        -DKokkos_ARCH_AMPERE80:BOOL=ON \
+        -DKokkos_ENABLE_EXAMPLES:BOOL=OFF \
+        -DKokkos_ENABLE_TESTS:BOOL=OFF \
+        -DKokkos_ENABLE_LIBDL:BOOL=OFF
+    cmake --build build && cmake --install build
+    echo export CMAKE_PREFIX_PATH=/opt/kokkos/4.1.00/AMPERE80:\$CMAKE_PREFIX_PATH
 
-We may now leave the container, and run the built test suite on a machine with access to the targeted GPU.
+Next, append the install location to ``CMAKE_PREFIX_PATH``.
+Note that the C++20 standard is required (``-DCMAKE_CXX_STANDARD=20`` option), and hence CUDA v12 is required for the CUDA backend.
+If an installation is not found, our builder will clone and install it during the build process.
 
-For a system with access to the ROCm stack outside of a manylinux container, an editable ``pip`` installation can be built and installed as:
+The simplest way to install Lightning-Kokkos (OpenMP backend) through ``pip``.
 
 .. code-block:: console
 
-   CMAKE_ARGS="-DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_VEGA90A=ON" PL_BACKEND="lightning_kokkos" python -m pip install -e .
+   CMAKE_ARGS="-DKokkos_ENABLE_OPENMP=ON" PL_BACKEND="lightning_kokkos" python -m pip install .
 
-.. installation_LKokkos-end-inclusion-marker-do-not-remove
+To build the plugin directly with CMake as above:
 
-Please refer to the `plugin documentation <https://docs.pennylane.ai/projects/lightning/>`_ as
-well as to the `PennyLane documentation <https://docs.pennylane.ai/>`_ for further reference.
+.. code-block:: console
 
+   cmake -B build -DKokkos_ENABLE_OPENMP=ON -DPL_BACKEND=lightning_kokkos -G Ninja
+   cmake --build build
 
-GPU support
------------
+The supported backend options are "SERIAL", "OPENMP", "THREADS", "HIP" and "CUDA" and the corresponding build options are ``-DKokkos_ENABLE_XXX=ON``, where ``XXX`` needs be replaced by the backend name, for instance ``OPENMP``.
+One can activate simultaneously one serial, one parallel CPU host (e.g. "OPENMP", "THREADS") and one parallel GPU device backend (e.g. "HIP", "CUDA"), but not two of any category at the same time.
+For "HIP" and "CUDA", the appropriate software stacks are required to enable compilation and subsequent use.
+Similarly, the CMake option ``-DKokkos_ARCH_{...}=ON`` must also be specified to target a given architecture.
+A list of the architectures is found on the `Kokkos wiki <https://github.com/kokkos/kokkos/wiki/Macros#architectures>`_.
+Note that "THREADS" backend is not recommended since `Kokkos <https://github.com/kokkos/kokkos-core-wiki/blob/17f08a6483937c26e14ec3c93a2aa40e4ce081ce/docs/source/ProgrammingGuide/Initialization.md?plain=1#L67>`_ does not guarantee its safety.
 
-For GPU support, `PennyLane-Lightning-GPU <https://github.com/PennyLaneAI/pennylane-lightning-gpu>`_
-can be installed by providing the optional ``[gpu]`` tag:
+.. installation_LKokkos-end-inclusion-marker-do-not-remove
 
-.. code-block:: console
+Please refer to the `plugin documentation <https://docs.pennylane.ai/projects/lightning/>`_ as
+well as to the `PennyLane documentation <https://docs.pennylane.ai/>`_ for further reference.
 
-    $ pip install pennylane-lightning[gpu]
+.. docker-start-inclusion-marker-do-not-remove
 
-For more information, please refer to the PennyLane Lightning GPU `documentation <https://docs.pennylane.ai/projects/lightning-gpu>`_.
+.. _docker-support:
 
-Docker Support
---------------
+Docker support
+**************
 
-One can also build the Lightning image using Docker:
+Docker images for the various backends are found on the
+`PennyLane Docker Hub <https://hub.docker.com/repository/docker/pennylaneai/pennylane/general>`_ page, where there is also a detailed description about PennyLane Docker support.
+Briefly, one can build the Docker Lightning images using:
 
 .. code-block:: console
 
     $ git clone https://github.com/PennyLaneAI/pennylane-lightning.git
     $ cd pennylane-lightning
-    $ docker build -t lightning/base -f docker/Dockerfile .
+    $ docker build -f docker/Dockerfile --target ${TARGET} .
+
+where ``${TARGET}`` is one of the following
 
-Please refer to the `PennyLane installation <https://docs.pennylane.ai/en/stable/development/guide/installation.html#installation>`_ for detailed description about PennyLane Docker support.
+* ``wheel-lightning-qubit``
+* ``wheel-lightning-gpu``
+* ``wheel-lightning-kokkos-openmp``
+* ``wheel-lightning-kokkos-cuda``
+* ``wheel-lightning-kokkos-rocm``
 
+.. docker-end-inclusion-marker-do-not-remove
 
 Contributing
-============
+************
 
 We welcome contributions - simply fork the repository of this plugin, and then make a
 `pull request <https://help.github.com/articles/about-pull-requests/>`_ containing your contribution.
@@ -333,9 +414,8 @@ The Python code is statically analyzed with `Pylint <https://pylint.readthedocs.
 We set up a pre-commit hook (see `Git hooks <https://git-scm.com/docs/githooks>`_) to run both of these on `git commit`.
 Please make your best effort to comply with `black` and `pylint` before using disabling pragmas (e.g. `# pylint: disable=missing-function-docstring`).
 
-
 Authors
-=======
+*******
 
 Lightning is the work of `many contributors <https://github.com/PennyLaneAI/pennylane-lightning/graphs/contributors>`_.
 
@@ -348,9 +428,8 @@ If you are doing research using PennyLane and Lightning, please cite `our paper
 
 .. support-start-inclusion-marker-do-not-remove
 
-
 Support
-=======
+*******
 
 - **Source Code:** https://github.com/PennyLaneAI/pennylane-lightning
 - **Issue Tracker:** https://github.com/PennyLaneAI/pennylane-lightning/issues
@@ -362,22 +441,24 @@ by asking a question in the forum.
 .. support-end-inclusion-marker-do-not-remove
 .. license-start-inclusion-marker-do-not-remove
 
-
 License
-=======
+*******
 
-The PennyLane lightning plugin is **free** and **open source**, released under
+The Lightning plugins are **free** and **open source**, released under
 the `Apache License, Version 2.0 <https://www.apache.org/licenses/LICENSE-2.0>`_.
+The Lightning-GPU plugin makes use of the NVIDIA cuQuantum SDK headers to
+enable the device bindings to PennyLane, which are held to their own respective license.
 
 .. license-end-inclusion-marker-do-not-remove
 .. acknowledgements-start-inclusion-marker-do-not-remove
 
 Acknowledgements
-================
+****************
 
 PennyLane Lightning makes use of the following libraries and tools, which are under their own respective licenses:
 
 - **pybind11:** https://github.com/pybind/pybind11
 - **Kokkos Core:** https://github.com/kokkos/kokkos
+- **NVIDIA cuQuantum:** https://developer.nvidia.com/cuquantum-sdk
 
-.. acknowledgements-end-inclusion-marker-do-not-remove
\ No newline at end of file
+.. acknowledgements-end-inclusion-marker-do-not-remove
diff --git a/doc/code/__init__.rst b/doc/code/__init__.rst
index bf68bf024a..1e4eb7d3c8 100644
--- a/doc/code/__init__.rst
+++ b/doc/code/__init__.rst
@@ -1,5 +1,5 @@
-pennylane_lightning
-===================
+Python API
+==========
 
 This section contains the API documentation for the Lightning packages.
 
@@ -18,6 +18,10 @@ This section contains the API documentation for the Lightning packages.
    :description: API documentation for the lightning_qubit package
    :link: ../lightning_qubit/package.html
 
+.. title-card::
+   :name: lightning_gpu
+   :description: API documentation for the lightning_gpu package
+   :link: ../lightning_gpu/package.html
 
 .. title-card::
    :name: lightning_kokkos
@@ -33,4 +37,5 @@ This section contains the API documentation for the Lightning packages.
    :hidden:
 
    ../lightning_qubit/package
+   ../lightning_gpu/package
    ../lightning_kokkos/package
diff --git a/doc/docker.rst b/doc/docker.rst
new file mode 100644
index 0000000000..85ae81ba73
--- /dev/null
+++ b/doc/docker.rst
@@ -0,0 +1,3 @@
+.. include:: ../README.rst
+  :start-after:	docker-start-inclusion-marker-do-not-remove
+  :end-before: docker-end-inclusion-marker-do-not-remove
diff --git a/doc/index.rst b/doc/index.rst
index c9316bd782..f48d86c567 100644
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -14,7 +14,7 @@ Lightning plugins
 
 
 Devices
--------
+*******
 
 The Lightning ecosystem provides the following devices:
 
@@ -23,6 +23,11 @@ The Lightning ecosystem provides the following devices:
     :description: A fast state-vector qubit simulator written in C++
     :link: lightning_qubit/device.html
 
+.. title-card::
+    :name: 'lightning.gpu'
+    :description: A heterogeneous backend state-vector simulator with NVIDIA cuQuantum library support.
+    :link: lightning_gpu/device.html
+
 .. title-card::
     :name: 'lightning.kokkos'
     :description: A heterogeneous backend state-vector simulator with Kokkos library support.
@@ -39,6 +44,7 @@ The Lightning ecosystem provides the following devices:
    :hidden:
 
    installation
+   docker
    support
 
 .. toctree::
@@ -47,6 +53,7 @@ The Lightning ecosystem provides the following devices:
    :hidden:
 
    lightning_qubit/device
+   lightning_gpu/device
    lightning_kokkos/device
 
 .. toctree::
diff --git a/doc/installation.rst b/doc/installation.rst
index d89f62a24c..c0b056f5c0 100644
--- a/doc/installation.rst
+++ b/doc/installation.rst
@@ -8,6 +8,10 @@ Each device in the Lightning ecosystem is a separate Python package. Select the
    :description: Guidelines to installing and testing the Lightning Qubit device.
    :link: ./lightning_qubit/installation.html
 
+.. title-card::
+   :name: Lightning GPU
+   :description: Guidelines to installing and testing the Lightning GPU device
+   :link: ./lightning_gpu/installation.html
 
 .. title-card::
    :name: Lightning Kokkos
@@ -23,4 +27,5 @@ Each device in the Lightning ecosystem is a separate Python package. Select the
    :hidden:
 
    lightning_qubit/installation
+   lightning_gpu/installation
    lightning_kokkos/installation
diff --git a/doc/lightning_gpu/device.rst b/doc/lightning_gpu/device.rst
new file mode 100644
index 0000000000..49ad3acf37
--- /dev/null
+++ b/doc/lightning_gpu/device.rst
@@ -0,0 +1,284 @@
+Lightning GPU device
+======================
+
+The ``lightning.gpu`` device is an extension of PennyLane's built-in ``lightning.qubit`` device.
+It extends the CPU-focused Lightning simulator to run using the NVIDIA cuQuantum SDK, enabling GPU-accelerated simulation of quantum state-vector evolution.
+
+A ``lightning.gpu`` device can be loaded using:
+
+.. code-block:: python
+
+    import pennylane as qml
+    dev = qml.device("lightning.gpu", wires=2)
+
+If the NVIDIA cuQuantum libraries are available, the above device will allow all operations to be performed on a CUDA capable GPU of generation SM 7.0 (Volta) and greater. If the libraries are not correctly installed, or available on path, the device will fall-back to ``lightning.qubit`` and perform all simulation on the CPU.
+
+The ``lightning.gpu`` device also directly supports quantum circuit gradients using the adjoint differentiation method. This can be enabled at the PennyLane QNode level with:
+
+.. code-block:: python
+
+    qml.qnode(dev, diff_method="adjoint")
+    def circuit(params):
+        ...
+
+Check out the :doc:`/lightning_gpu/installation` guide for more information.
+
+Supported operations and observables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Supported operations:**
+
+.. raw:: html
+
+    <div class="summary-table">
+
+.. autosummary::
+    :nosignatures:
+
+    ~pennylane.BasisState
+    ~pennylane.CNOT
+    ~pennylane.ControlledPhaseShift
+    ~pennylane.ControlledQubitUnitary
+    ~pennylane.CPhase
+    ~pennylane.CRot
+    ~pennylane.CRX
+    ~pennylane.CRY
+    ~pennylane.CRZ
+    ~pennylane.CSWAP
+    ~pennylane.CY
+    ~pennylane.CZ
+    ~pennylane.DiagonalQubitUnitary
+    ~pennylane.DoubleExcitation
+    ~pennylane.DoubleExcitationMinus
+    ~pennylane.DoubleExcitationPlus
+    ~pennylane.ECR
+    ~pennylane.Hadamard
+    ~pennylane.Identity
+    ~pennylane.IsingXX
+    ~pennylane.IsingXY
+    ~pennylane.IsingYY
+    ~pennylane.IsingZZ
+    ~pennylane.ISWAP
+    ~pennylane.MultiControlledX
+    ~pennylane.MultiRZ
+    ~pennylane.OrbitalRotation
+    ~pennylane.PauliX
+    ~pennylane.PauliY
+    ~pennylane.PauliZ
+    ~pennylane.PhaseShift
+    ~pennylane.PSWAP
+    ~pennylane.QFT
+    ~pennylane.QubitCarry
+    ~pennylane.QubitStateVector
+    ~pennylane.QubitSum
+    ~pennylane.QubitUnitary
+    ~pennylane.Rot
+    ~pennylane.RX
+    ~pennylane.RY
+    ~pennylane.RZ
+    ~pennylane.S
+    ~pennylane.SingleExcitation
+    ~pennylane.SingleExcitationMinus
+    ~pennylane.SingleExcitationPlus
+    ~pennylane.SISWAP
+    ~pennylane.SQISW
+    ~pennylane.SWAP
+    ~pennylane.SX
+    ~pennylane.T
+    ~pennylane.Toffoli
+
+.. raw:: html
+
+    </div>
+
+**Supported observables:**
+
+.. raw:: html
+
+    <div class="summary-table">
+
+.. autosummary::
+    :nosignatures:
+
+    ~pennylane.ops.op_math.Exp
+    ~pennylane.Hadamard
+    ~pennylane.Hamiltonian
+    ~pennylane.Hermitian
+    ~pennylane.Identity
+    ~pennylane.PauliX
+    ~pennylane.PauliY
+    ~pennylane.PauliZ
+    ~pennylane.ops.op_math.Prod
+    ~pennylane.Projector
+    ~pennylane.SparseHamiltonian
+    ~pennylane.ops.op_math.SProd
+    ~pennylane.ops.op_math.Sum
+
+.. raw:: html
+
+    </div>
+
+
+
+**Parallel adjoint differentiation support:**
+
+The ``lightning.gpu`` device directly supports the `adjoint differentiation method <https://pennylane.ai/qml/demos/tutorial_adjoint_diff.html>`__, and enables parallelization over the requested observables. This supports direct controlling of observable batching, which can be used to run concurrent calculations across multiple available GPUs.
+
+If you are computing a large number of expectation values, or if you are using a large number of wires on your device, it may be best to evenly divide the number of expectation value calculations across all available GPUs. This will reduce the overall memory cost of the observables per GPU, at the cost of additional compute time. Assuming `m` observables, and `n` GPUs, the default behaviour is to pre-allocate all storage for `n` observables on a single GPU. To divide the workload amongst many GPUs, initialize a ``lightning.gpu`` device with the ``batch_obs=True`` keyword argument, as:
+
+.. code-block:: python
+
+    import pennylane as qml
+    dev = qml.device("lightning.gpu", wires=20, batch_obs=True)
+
+With the above, each GPU will see at most `m/n` observables to process, reducing the preallocated memory footprint.
+
+Additionally, there can be situations where even with the above distribution, and limited GPU memory, the overall problem does not fit on the requested GPU devices. You can further reduce the concurrent allocations on available GPUs by providing an integer value to the `batch_obs` keyword. For example, to batch evaluate observables with at most 1 observable allocation per GPU, define the device as:
+
+.. code-block:: python
+
+    import pennylane as qml
+    dev = qml.device("lightning.gpu", wires=27, batch_obs=1)
+
+Each problem is unique, so it can often be best to choose the default behaviour up-front, and tune with the above only if necessary.
+ 
+**Multi-GPU/multi-node support:**
+
+The ``lightning.gpu`` device allows users to leverage the computational power of many GPUs sitting on separate nodes for running large-scale simulations. 
+Provided that NVIDIA ``cuQuantum`` libraries, a ``CUDA-aware MPI`` library and ``mpi4py`` are properly installed and the path to the ``libmpi.so`` is 
+added to the ``LD_LIBRARY_PATH`` environment variable, the following requirements should be met to enable multi-node and multi-GPU simulations:
+
+1. The ``mpi`` keyword argument should be set as ``True`` when initializing a ``lightning.gpu`` device.
+2. Both the total number of MPI processes and MPI processes per node must be powers of 2. For example, 2, 4, 8, 16, etc.. Each MPI process is responsible for managing one GPU. 
+
+The workflow for the multi-node/GPUs feature is as follows:
+
+.. code-block:: python
+
+    from mpi4py import MPI
+    import pennylane as qml
+    dev = qml.device('lightning.gpu', wires=8, mpi=True)
+    @qml.qnode(dev)
+    def circuit_mpi():
+        qml.PauliX(wires=[0])
+        return qml.state()
+    local_state_vector = circuit_mpi()
+
+Currently, a ``lightning.gpu`` device with the MPI multi-GPU backend supports all the ``gate operations`` and ``observables`` that a ``lightning.gpu`` device with a single GPU/node backend supports.
+
+By default, each MPI process will return the overall simulation results, except for the ``qml.state()`` and ``qml.prob()`` methods for which each MPI process only returns the local simulation
+results for the ``qml.state()`` and ``qml.prob()`` methods to avoid buffer overflow. It is the user's responsibility to ensure correct data collection for those two methods. Here are examples of collecting
+the local simulation results for ``qml.state()`` and ``qml.prob()`` methods:
+
+The workflow for collecting local state vector (using the ``qml.state()`` method) to ``rank 0`` is as follows:
+
+.. code-block:: python
+
+    from mpi4py import MPI
+    import pennylane as qml
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank() 
+    dev = qml.device('lightning.gpu', wires=8, mpi=True)
+    @qml.qnode(dev)
+    def circuit_mpi():
+        qml.PauliX(wires=[0])
+        return qml.state()
+    local_state_vector = circuit_mpi()
+    #rank 0 will collect the local state vector
+    state_vector = comm.gather(local_state_vector, root=0)
+    if rank == 0:
+        print(state_vector)
+    
+The workflow for collecting local probability (using the ``qml.prob()`` method) to ``rank 0`` is as follows:
+
+.. code-block:: python
+    
+    from mpi4py import MPI
+    import pennylane as qml
+    import numpy as np
+
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()
+    dev = qml.device('lightning.gpu', wires=8, mpi=True)
+    prob_wires = [0, 1]
+
+    @qml.qnode(dev)
+    def mpi_circuit():
+        qml.Hadamard(wires=1)
+        return qml.probs(wires=prob_wires)
+
+    local_probs = mpi_circuit()
+ 
+    #For data collection across MPI processes.
+    recv_counts = comm.gather(len(local_probs),root=0)
+    if rank == 0:
+        probs = np.zeros(2**len(prob_wires))
+    else:
+        probs = None
+
+    comm.Gatherv(local_probs,[probs,recv_counts],root=0)
+    if rank == 0:
+        print(probs)
+
+Then the python script can be executed with the following command:
+
+.. code-block:: console
+    
+    $ mpirun -np 4 python yourscript.py
+
+Furthermore, users can optimize the performance of their applications by allocating the appropriate amount of GPU memory for MPI operations with the ``mpi_buf_size`` keyword argument. To allocate ``n`` mebibytes (MiB, `2^20` bytes) of GPU memory for MPI operations, initialize a ``lightning.gpu`` device with the ``mpi_buf_size=n`` keyword argument, as follows:
+
+.. code-block:: python
+
+    from mpi4py import MPI
+    import pennylane as qml
+    n = 8
+    dev = qml.device("lightning.gpu", wires=20, mpi=True, mpi_buf_size=n)
+
+Note the value of ``mpi_buf_size`` should also be a power of ``2``. Remember to carefully manage the ``mpi_buf_size`` parameter, taking into account the available GPU memory and the memory 
+requirements of the local state vector, to prevent memory overflow issues and ensure optimal performance. By default (``mpi_buf_size=0``), the GPU memory allocated for MPI operations 
+will match the size of the local state vector, with a limit of ``64 MiB``. Please be aware that a runtime warning will occur if the local GPU memory buffer for MPI operations exceeds
+the GPU memory allocated to the local state vector.
+
+**Multi-GPU/multi-node support for adjoint method:**
+
+The ``lightning.gpu`` device with the multi-GPU/multi-node backend also directly supports the `adjoint differentiation method <https://pennylane.ai/qml/demos/tutorial_adjoint_diff.html>`__. Instead of batching observables across the multiple GPUs available within a node, the state vector is distributed among the available GPUs with the multi-GPU/multi-node backend.
+By default, the adjoint method with MPI support follows the performance-oriented implementation of the single GPU backend. This means that a separate ``bra`` is created for each observable and the ``ket`` is updated only once for each operation, regardless of the number of observables.
+
+The workflow for the default adjoint method with MPI support is as follows:
+
+.. code-block:: python
+    
+    from mpi4py import MPI
+    import pennylane as qml
+    from pennylane import numpy as np
+  
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()
+    n_wires = 20
+    n_layers = 2
+  
+    dev = qml.device('lightning.gpu', wires= n_wires, mpi=True)
+    @qml.qnode(dev, diff_method="adjoint")
+    def circuit_adj(weights):
+        qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
+        return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(n_wires)])
+  
+    if rank == 0:
+        params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
+    else:
+        params = None
+  
+    params = comm.bcast(params, root=0)
+    jac = qml.jacobian(circuit_adj)(params)
+
+If users aim to handle larger system sizes with limited hardware resources, the memory-optimized adjoint method with MPI support is more appropriate. The memory-optimized adjoint method with MPI support employs a single ``bra`` object that is reused for all observables.
+This approach results in a notable reduction in the required GPU memory when dealing with a large number of observables. However, it's important to note that the reduction in memory requirement may come at the expense of slower execution due to the multiple ``ket`` updates per gate operation.
+
+To enable the memory-optimized adjoint method with MPI support, ``batch_obs`` should be set as ``True`` and the workflow follows:
+
+.. code-block:: python
+    
+    dev = qml.device('lightning.gpu', wires= n_wires, mpi=True, batch_obs=True)
+
+For the adjoint method, each MPI process will provide the overall simulation results.
\ No newline at end of file
diff --git a/doc/lightning_gpu/installation.rst b/doc/lightning_gpu/installation.rst
new file mode 100644
index 0000000000..9754aae396
--- /dev/null
+++ b/doc/lightning_gpu/installation.rst
@@ -0,0 +1,3 @@
+.. include:: ../../README.rst
+  :start-after:	installation_LGPU-start-inclusion-marker-do-not-remove
+  :end-before: installation_LGPU-end-inclusion-marker-do-not-remove
\ No newline at end of file
diff --git a/doc/lightning_gpu/package.rst b/doc/lightning_gpu/package.rst
new file mode 100644
index 0000000000..6630d64cd8
--- /dev/null
+++ b/doc/lightning_gpu/package.rst
@@ -0,0 +1,19 @@
+lightning_gpu
+================
+
+.. automodapi:: pennylane_lightning.lightning_gpu
+    :no-heading:
+    :include-all-objects:
+
+.. raw:: html
+
+        <div style='clear:both'></div>
+        </br>
+
+Directly importing the device class:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: python3
+
+    from pennylane_lightning.lightning_gpu import LightningGPU
+
diff --git a/doc/lightning_qubit/development/index.rst b/doc/lightning_qubit/development/index.rst
index 7eb4a918e4..90489e166f 100644
--- a/doc/lightning_qubit/development/index.rst
+++ b/doc/lightning_qubit/development/index.rst
@@ -20,5 +20,5 @@ Lightning Qubit
 .. toctree::
    :hidden:
 
-   avx_kernels/index
    add_gate_kernel
+   avx_kernels/index
diff --git a/doc/requirements.txt b/doc/requirements.txt
index 3acc41f79a..f6d4fac3d1 100644
--- a/doc/requirements.txt
+++ b/doc/requirements.txt
@@ -6,3 +6,5 @@ pybind11
 sphinx
 sphinx-automodapi
 pennylane-sphinx-theme
+custatevec-cu11
+wheel
diff --git a/pennylane_lightning/lightning_gpu/lightning_gpu.py b/pennylane_lightning/lightning_gpu/lightning_gpu.py
index 98de0e9512..177275ec13 100644
--- a/pennylane_lightning/lightning_gpu/lightning_gpu.py
+++ b/pennylane_lightning/lightning_gpu/lightning_gpu.py
@@ -204,9 +204,17 @@ def _mebibytesToBytes(mebibytes):
     }
 
     class LightningGPU(LightningBase):  # pylint: disable=too-many-instance-attributes
-        """PennyLane-Lightning-GPU device.
+        """PennyLane Lightning GPU device.
+
+        A GPU-backed Lightning device using NVIDIA cuQuantum SDK.
+
+        Use of this device requires pre-built binaries or compilation from source. Check out the
+        :doc:`/lightning_gpu/installation` guide for more details.
+
         Args:
             wires (int): the number of wires to initialize the device with
+            mpi (bool): enable MPI support. MPI support will be enabled if ``mpi`` is set as``True``.
+            mpi_buf_size (int): size of GPU memory (in MiB) set for MPI operation and its default value is 64 MiB.
             sync (bool): immediately sync with host-sv after applying operations
             c_dtype: Datatypes for statevector representation. Must be one of ``np.complex64`` or ``np.complex128``.
             shots (int): How many times the circuit should be evaluated (or sampled) to estimate
@@ -216,7 +224,7 @@ class LightningGPU(LightningBase):  # pylint: disable=too-many-instance-attribut
             batch_obs (Union[bool, int]): determine whether to use multiple GPUs within the same node or not
         """
 
-        name = "PennyLane plugin for GPU-backed Lightning device using NVIDIA cuQuantum SDK"
+        name = "Lightning GPU PennyLane plugin"
         short_name = "lightning.gpu"
 
         operations = allowed_operations
@@ -283,6 +291,7 @@ def __init__(
             self._create_basis_state(0)
 
         def _mpi_init_helper(self, num_wires):
+            """Set up MPI checks."""
             if not MPI_SUPPORT:
                 raise ImportError("MPI related APIs are not found.")
             # initialize MPIManager and config check in the MPIManager ctor
@@ -545,6 +554,7 @@ def apply_lightning(self, operations):
 
         # pylint: disable=unused-argument
         def apply(self, operations, rotations=None, **kwargs):
+            """Applies a list of operations to the state tensor."""
             # State preparation is currently done in Python
             if operations:  # make sure operations[0] exists
                 if isinstance(operations[0], StatePrep):
@@ -635,6 +645,12 @@ def _init_process_jacobian_tape(self, tape, starting_state, use_device_state):
             return self._gpu_state
 
         def adjoint_jacobian(self, tape, starting_state=None, use_device_state=False):
+            """Implements the adjoint method outlined in
+            `Jones and Gacon <https://arxiv.org/abs/2009.02823>`__ to differentiate an input tape.
+
+            After a forward pass, the circuit is reversed by iteratively applying adjoint
+            gates to scan backwards through the circuit.
+            """
             if self.shots is not None:
                 warn(
                     "Requested adjoint differentiation to be computed with finite shots."
@@ -697,7 +713,42 @@ def adjoint_jacobian(self, tape, starting_state=None, use_device_state=False):
 
         # pylint: disable=inconsistent-return-statements, line-too-long, missing-function-docstring
         def vjp(self, measurements, grad_vec, starting_state=None, use_device_state=False):
-            """Generate the processing function required to compute the vector-Jacobian products of a tape."""
+            """Generate the processing function required to compute the vector-Jacobian products
+            of a tape.
+
+            This function can be used with multiple expectation values or a quantum state.
+            When a quantum state is given,
+
+            .. code-block:: python
+
+                vjp_f = dev.vjp([qml.state()], grad_vec)
+                vjp = vjp_f(tape)
+
+            computes :math:`w = (w_1,\\cdots,w_m)` where
+
+            .. math::
+
+                w_k = \\langle v| \\frac{\\partial}{\\partial \\theta_k} | \\psi_{\\pmb{\\theta}} \\rangle.
+
+            Here, :math:`m` is the total number of trainable parameters,
+            :math:`\\pmb{\\theta}` is the vector of trainable parameters and
+            :math:`\\psi_{\\pmb{\\theta}}` is the output quantum state.
+
+            Args:
+                measurements (list): List of measurement processes for vector-Jacobian product.
+                    Now it must be expectation values or a quantum state.
+                grad_vec (tensor_like): Gradient-output vector. Must have shape matching the output
+                    shape of the corresponding tape, i.e. number of measurements if the return
+                    type is expectation or :math:`2^N` if the return type is statevector
+                starting_state (tensor_like): post-forward pass state to start execution with.
+                    It should be complex-valued. Takes precedence over ``use_device_state``.
+                use_device_state (bool): use current device state to initialize.
+                    A forward pass of the same circuit should be the last thing the device
+                    has executed. If a ``starting_state`` is provided, that takes precedence.
+
+            Returns:
+                The processing function required to compute the vector-Jacobian products of a tape.
+            """
             if self.shots is not None:
                 warn(
                     "Requested adjoint differentiation to be computed with finite shots."
@@ -742,6 +793,7 @@ def processing_fn(tape):
 
         # pylint: disable=attribute-defined-outside-init
         def sample(self, observable, shot_range=None, bin_size=None, counts=False):
+            """Return samples of an observable."""
             if observable.name != "PauliZ":
                 self.apply_lightning(observable.diagonalizing_gates())
                 self._samples = self.generate_samples()
@@ -763,6 +815,19 @@ def generate_samples(self):
 
         # pylint: disable=protected-access, missing-function-docstring
         def expval(self, observable, shot_range=None, bin_size=None):
+            """Expectation value of the supplied observable.
+
+            Args:
+                observable: A PennyLane observable.
+                shot_range (tuple[int]): 2-tuple of integers specifying the range of samples
+                    to use. If not specified, all samples are used.
+                bin_size (int): Divides the shot range into bins of size ``bin_size``, and
+                    returns the measurement statistic separately over each bin. If not
+                    provided, the entire shot range is treated as a single bin.
+
+            Returns:
+                Expectation value of the observable
+            """
             if self.shots is not None:
                 # estimate the expectation value
                 samples = self.sample(observable, shot_range=shot_range, bin_size=bin_size)
@@ -814,6 +879,15 @@ def expval(self, observable, shot_range=None, bin_size=None):
             return self.measurements.expval(observable.name, observable_wires)
 
         def probability_lightning(self, wires=None):
+            """Return the probability of each computational basis state.
+
+            Args:
+                wires (Iterable[Number, str], Number, str, Wires): wires to return
+                    marginal probabilities for. Wires not provided are traced out of the system.
+
+            Returns:
+                array[float]: list of the probabilities
+            """
             # translate to wire labels used by device
             observable_wires = self.map_wires(wires)
             # Device returns as col-major orderings, so perform transpose on data for bit-index shuffle for now.
@@ -825,6 +899,19 @@ def probability_lightning(self, wires=None):
 
         # pylint: disable=missing-function-docstring
         def var(self, observable, shot_range=None, bin_size=None):
+            """Variance of the supplied observable.
+
+            Args:
+                observable: A PennyLane observable.
+                shot_range (tuple[int]): 2-tuple of integers specifying the range of samples
+                    to use. If not specified, all samples are used.
+                bin_size (int): Divides the shot range into bins of size ``bin_size``, and
+                    returns the measurement statistic separately over each bin. If not
+                    provided, the entire shot range is treated as a single bin.
+
+            Returns:
+                Variance of the observable
+            """
             if self.shots is not None:
                 # estimate the var
                 # Lightning doesn't support sampling yet
@@ -858,7 +945,7 @@ def var(self, observable, shot_range=None, bin_size=None):
 
     class LightningGPU(LightningBaseFallBack):  # pragma: no cover
         # pylint: disable=missing-class-docstring, too-few-public-methods
-        name = "PennyLane plugin for GPU-backed Lightning device using NVIDIA cuQuantum SDK: [No binaries found - Fallback: default.qubit]"
+        name = "Lightning GPU PennyLane plugin: [No binaries found - Fallback: default.qubit]"
         short_name = "lightning.gpu"
 
         def __init__(self, wires, *, c_dtype=np.complex128, **kwargs):
diff --git a/requirements-dev.txt b/requirements-dev.txt
index 642a74ad27..a9602d9073 100644
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
@@ -9,4 +9,5 @@ pytest-mock
 pre-commit>=2.19.0
 black==23.7.0
 clang-format==14
-pylint
\ No newline at end of file
+custatevec-cu11
+pylint
diff --git a/requirements.txt b/requirements.txt
index af606a496a..ef5c73ca83 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -5,3 +5,4 @@ pybind11
 pytest
 pytest-cov
 pytest-mock
+custatevec-cu11
\ No newline at end of file