From 5a72c02e6c9a6aacc95a2cb6bfa641874e34a18c Mon Sep 17 00:00:00 2001
From: Rafey Iqbal Rahman <59226057+RafeyIqbalRahman@users.noreply.github.com>
Date: Fri, 5 Mar 2021 22:48:16 +0500
Subject: [PATCH 1/7] Fix grammar, capitalization, text inconsistencies (#900)

Co-authored-by: Christopher J. Wood <cjwood@us.ibm.com>
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
---
 CONTRIBUTING.md | 169 ++++++++++++++++++++++++------------------------
 1 file changed, 84 insertions(+), 85 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 910cf58085..a3db828f52 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,14 +1,14 @@
 # Contributing
 
 First read the overall project contributing guidelines. These are all
-included in the qiskit documentation:
+included in the Qiskit documentation:
 
 https://qiskit.org/documentation/contributing_to_qiskit.html
 
 ## Contributing to Qiskit Aer
 
-In addition to the general guidelines there are specific details for
-contributing to aer, these are documented below.
+In addition to the general guidelines, there are specific details for
+contributing to Aer. These are documented below.
 
 ### Pull request checklist
 
@@ -23,21 +23,21 @@ please ensure that:
    *docstring* accordingly.
 3. If it makes sense for your change that you have added new tests that
    cover the changes.
-4. Ensure that if your change has an end user facing impact (new feature,
-   deprecation, removal etc) that you have added a reno release note for that
+4. Ensure that if your change has an enduser-facing impact (new feature,
+   deprecation, removal, etc.), you have added a reno release note for that
    change and that the PR is tagged for the changelog.
 
 ### Changelog generation
 
 The changelog is automatically generated as part of the release process
 automation. This works through a combination of the git log and the pull
-request. When a release is tagged and pushed to github the release automation
+request. When a release is tagged and pushed to GitHub, the release automation
 bot looks at all commit messages from the git log for the release. It takes the
 PR numbers from the git log (assuming a squash merge) and checks if that PR had
 a `Changelog:` label on it. If there is a label it will add the git commit
 message summary line from the git log for the release to the changelog.
 
-If there are multiple `Changelog:` tags on a PR the git commit message summary
+If there are multiple `Changelog:` tags on a PR, the git commit message summary
 line from the git log will be used for each changelog category tagged.
 
 The current categories for each label are as follows:
@@ -52,22 +52,22 @@ The current categories for each label are as follows:
 
 ### Release Notes
 
-When making any end user facing changes in a contribution we have to make sure
+When making any end user-facing changes in a contribution, we have to make sure
 we document that when we release a new version of qiskit-aer. The expectation
-is that if your code contribution has user facing changes that you will write
+is that if your code contribution has user-facing changes that you will write
 the release documentation for these changes. This documentation must explain
 what was changed, why it was changed, and how users can either use or adapt
-to the change. The idea behind release documentation is that when a naive
+to the change. The idea behind the release documentation is that when a naive
 user with limited internal knowledge of the project is upgrading from the
 previous release to the new one, they should be able to read the release notes,
-understand if they need to update their program which uses qiskit, and how they
+understand if they need to update their program which uses Qiskit, and how they
 would go about doing that. It ideally should explain why they need to make
 this change too, to provide the necessary context.
 
-To make sure we don't forget a release note or if the details of user facing
-changes over a release cycle we require that all user facing changes include
-documentation at the same time as the code. To accomplish this we use the
-[reno](https://docs.openstack.org/reno/latest/) tool which enables a git based
+To make sure we don't forget a release note or if the details of user-facing
+changes over a release cycle, we require that all user facing changes include
+documentation at the same time as the code. To accomplish this, we use the
+[reno](https://docs.openstack.org/reno/latest/) tool which enables a git-based
 workflow for writing and compiling release notes.
 
 #### Adding a new release note
@@ -77,21 +77,21 @@ installed with::
 
     pip install -U reno
 
-Once you have reno installed you can make a new release note by running in
+Once you have reno installed, you can make a new release note by running in
 your local repository checkout's root::
 
     reno new short-description-string
 
 where short-description-string is a brief string (with no spaces) that describes
 what's in the release note. This will become the prefix for the release note
-file. Once that is run it will create a new yaml file in releasenotes/notes.
+file. Once that is run, it will create a new yaml file in releasenotes/notes.
 Then open that yaml file in a text editor and write the release note. The basic
 structure of a release note is restructured text in yaml lists under category
 keys. You add individual items under each category and they will be grouped
 automatically by release when the release notes are compiled. A single file
 can have as many entries in it as needed, but to avoid potential conflicts
-you'll want to create a new file for each pull request that has user facing
-changes. When you open the newly created file it will be a full template of
+you'll want to create a new file for each pull request that has user-facing
+changes. When you open the newly created file, it will be a full template of
 the different categories with a description of a category as a single entry
 in each category. You'll want to delete all the sections you aren't using and
 update the contents for those you are. For example, the end result should
@@ -132,19 +132,19 @@ deprecations:
 You can also look at other release notes for other examples.
 
 You can use any restructured text feature in them (code sections, tables,
-enumerated lists, bulleted list, etc) to express what is being changed as
-needed. In general you want the release notes to include as much detail as
+enumerated lists, bulleted list, etc.) to express what is being changed as
+needed. In general, you want the release notes to include as much detail as
 needed so that users will understand what has changed, why it changed, and how
 they'll have to update their code.
 
-After you've finished writing your release notes you'll want to add the note
+After you've finished writing your release notes, you'll want to add the note
 file to your commit with `git add` and commit them to your PR branch to make
 sure they're included with the code in your PR.
 
 ##### Linking to issues
 
-If you need to link to an issue or other github artifact as part of the release
-note this should be done using an inline link with the text being the issue
+If you need to link to an issue or other GitHub artifact as part of the release
+note, this should be done using an inline link with the text being the issue
 number. For example you would write a release note with a link to issue 12345
 as:
 
@@ -158,12 +158,12 @@ fixes:
 
 #### Generating the release notes
 
-After release notes have been added if you want to see what the full output of
-the release notes. In general the output from reno that we'll get is a rst
+After release notes have been added, if you want to see the full output of
+the release notes, you'll get the output as an rst
 (ReStructuredText) file that can be compiled by
-[sphinx](https://www.sphinx-doc.org/en/master/). To generate the rst file you
-use the ``reno report`` command. If you want to generate the full aer release
-notes for all releases (since we started using reno during 0.9) you just run::
+[sphinx](https://www.sphinx-doc.org/en/master/). To generate the rst file, you
+use the ``reno report`` command. If you want to generate the full Aer release
+notes for all releases (since we started using reno during 0.9), you just run::
 
     reno report
 
@@ -172,7 +172,7 @@ it has been tagged::
 
     reno report --version 0.5.0
 
-At release time ``reno report`` is used to generate the release notes for the
+At release time, ``reno report`` is used to generate the release notes for the
 release and the output will be submitted as a pull request to the documentation
 repository's [release notes file](
 https://github.com/Qiskit/qiskit/blob/master/docs/release_notes.rst)
@@ -180,18 +180,18 @@ https://github.com/Qiskit/qiskit/blob/master/docs/release_notes.rst)
 #### Building release notes locally
 
 Building The release notes are part of the standard qiskit-aer documentation
-builds. To check what the rendered html output of the release notes will look
-like for the current state of the repo you can run: `tox -edocs` which will
+builds. To check what the rendered HTML output of the release notes will look
+like for the current state of the repo, you can run: `tox -edocs` which will
 build all the documentation into `docs/_build/html` and the release notes in
 particular will be located at `docs/_build/html/release_notes.html`
 
 ### Development Cycle
 
 The development cycle for qiskit-aer is all handled in the open using
-the project boards in Github for project management. We use milestones
-in Github to track work for specific releases. The features or other changes
-that we want to include in a release will be tagged and discussed in Github.
-As we're preparing a new release we'll document what has changed since the
+the project boards in GitHub for project management. We use milestones
+in GitHub to track work for specific releases. The features or other changes
+that we want to include in a release will be tagged and discussed in GitHub.
+As we're preparing a new release, we'll document what has changed since the
 previous version in the release notes.
 
 ### Branches
@@ -211,7 +211,7 @@ merged to it are bugfixes.
 
 ### Release cycle
 
-When it is time to release a new minor version of qiskit-aer we will:
+When it is time to release a new minor version of qiskit-aer, we will:
 
 1.  Create a new tag with the version number and push it to github
 2.  Change the `master` version to the next release version.
@@ -222,7 +222,7 @@ the following steps:
 1.  Create a stable branch for the new minor version from the release tag
     on the `master` branch
 2.  Build and upload binary wheels to pypi
-3.  Create a github release page with a generated changelog
+3.  Create a GitHub release page with a generated changelog
 4.  Generate a PR on the meta-repository to bump the Aer version and
     meta-package version.
 
@@ -275,7 +275,7 @@ You're now ready to build from source! Follow the instructions for your platform
 
 ### Linux
 
-Qiskit is officially supported on Red Hat, CentOS, Fedora and Ubuntu distributions, as long as you can install a GCC version that is C++14 compatible and the few dependencies we need.
+Qiskit is officially supported on Red Hat, CentOS, Fedora, and Ubuntu distributions, as long as you can install a GCC version that is C++14 compatible and a few dependencies we need.
 
 #### <a name="linux-dependencies"> Dependencies </a>
 
@@ -310,7 +310,7 @@ Ubuntu
     $ sudo apt install libopenblas-dev
 
 
-And of course, `git` is required in order to build from repositories
+And of course, `git` is required to build from repositories
 
 CentOS/Red Hat
 
@@ -328,17 +328,17 @@ Ubuntu
 
 There are two ways of building `Aer` simulators, depending on your goal:
 
-1. Build a python extension that works with Terra.
+1. Build a Python extension that works with Terra.
 2. Build a standalone executable.
 
 **Python extension**
 
-As any other python package, we can install from source code by just running:
+As any other Python package, we can install from source code by just running:
 
     qiskit-aer$ pip install .
 
 This will build and install `Aer` with the default options which is probably suitable for most of the users.
-There's another pythonic approach to build and install software: build the wheels distributable file.
+There's another Pythonic approach to build and install software: build the wheels distributable file.
 
     qiskit-aer$ python ./setup.py bdist_wheel
 
@@ -374,9 +374,9 @@ the `dist/` directory, so next step is installing it:
 
 **Standalone Executable**
 
-If we want to build a standalone executable, we have to use *CMake* directly.
+If you want to build a standalone executable, you have to use *CMake* directly.
 The preferred way *CMake* is meant to be used, is by setting up an "out of
-source" build. So in order to build our standalone executable, we have to follow
+source" build. So in order to build your standalone executable, you have to follow
 these steps:
 
     qiskit-aer$ mkdir out
@@ -396,8 +396,8 @@ option):
 **Advanced options**
 
 Because the standalone version of `Aer` doesn't need Python at all, the build system is
-based on CMake, just like most of other C++ projects. So in order to pass all the different
-options we have on `Aer` to CMake we use it's native mechanism:
+based on CMake, just like most of other C++ projects. So to pass all the different
+options we have on `Aer` to CMake, we use its native mechanism:
 
     qiskit-aer/out$ cmake -DCMAKE_CXX_COMPILER=g++-9 -DAER_BLAS_LIB_PATH=/path/to/my/blas ..
 
@@ -421,17 +421,17 @@ You further need to have *Xcode Command Line Tools* installed on macOS:
 
 There are two ways of building `Aer` simulators, depending on your goal:
 
-1. Build a python extension that works with Terra;
+1. Build a Python extension that works with Terra;
 2. Build a standalone executable.
 
 **Python extension**
 
-As any other python package, we can install from source code by just running:
+As any other Python package, we can install from source code by just running:
 
     qiskit-aer$ pip install .
 
 This will build and install `Aer` with the default options which is probably suitable for most of the users.
-There's another pythonic approach to build and install software: build the wheels distributable file.
+There's another Pythonic approach to build and install software: build the wheels distributable file.
 
 
    qiskit-aer$ python ./setup.py bdist_wheel
@@ -467,9 +467,9 @@ the `dist/` directory, so next step is installing it:
 
 **Standalone Executable**
 
-If we want to build a standalone executable, we have to use **CMake** directly.
+If you want to build a standalone executable, you have to use **CMake** directly.
 The preferred way **CMake** is meant to be used, is by setting up an "out of
-source" build. So in order to build our standalone executable, we have to follow
+source" build. So in order to build your standalone executable, you have to follow
 these steps:
 
     qiskit-aer$ mkdir out
@@ -488,8 +488,8 @@ option):
 ***Advanced options***
 
 Because the standalone version of `Aer` doesn't need Python at all, the build system is
-based on CMake, just like most of other C++ projects. So in order to pass all the different
-options we have on `Aer` to CMake we use it's native mechanism:
+based on CMake, just like most of other C++ projects. So to pass all the different
+options we have on `Aer` to CMake, we use its native mechanism:
 
     qiskit-aer/out$ cmake -DCMAKE_CXX_COMPILER=g++-9 -DAER_BLAS_LIB_PATH=/path/to/my/blas ..
 
@@ -499,7 +499,7 @@ options we have on `Aer` to CMake we use it's native mechanism:
 
 #### <a name="win-dependencies"> Dependencies </a>
 
-On Windows, you must have *Anaconda3* installed. We recommend also installing
+On Windows, you must have *Anaconda3* installed. We also recommend installing
 *Visual Studio 2017 Community Edition* or *Visual Studio 2019 Community Edition*.
 
 >*Anaconda 3* can be installed from their web:
@@ -518,19 +518,19 @@ create an Anaconda virtual environment or activate it if you already have create
 We only support *Visual Studio* compilers on Windows, so if you have others installed in your machine (MinGW, TurboC)
 you have to make sure that the path to the *Visual Studio* tools has precedence over others so that the build system
 can get the correct one.
-There's a (recommended) way to force the build system to use the one you want by using CMake `-G` parameter. Will talk
+There's a (recommended) way to force the build system to use the one you want by using CMake `-G` parameter. We will talk
 about this and other parameters later.
 
 #### <a name="win-build"> Build </a>
 
 **Python extension**
 
-As any other python package, we can install from source code by just running:
+As any other Python package, we can install from source code by just running:
 
     (QiskitDevEnv) qiskit-aer > pip install .
 
 This will build and install `Aer` with the default options which is probably suitable for most of the users.
-There's another pythonic approach to build and install software: build the wheels distributable file.
+There's another Pythonic approach to build and install software: build the wheels distributable file.
 
 
    (QiskitDevEnv) qiskit-aer > python ./setup.py bdist_wheel
@@ -566,9 +566,9 @@ the `dist/` directory, so next step is installing it:
 
 **Standalone Executable**
 
-If we want to build a standalone executable, we have to use **CMake** directly.
+If you want to build a standalone executable, you have to use **CMake** directly.
 The preferred way **CMake** is meant to be used, is by setting up an "out of
-source" build. So in order to build our standalone executable, we have to follow
+source" build. So in order to build our standalone executable, you have to follow
 these steps:
 
     (QiskitDevEnv) qiskit-aer> mkdir out
@@ -587,8 +587,8 @@ option):
 ***Advanced options***
 
 Because the standalone version of `Aer` doesn't need Python at all, the build system is
-based on CMake, just like most of other C++ projects. So in order to pass all the different
-options we have on `Aer` to CMake we use it's native mechanism:
+based on CMake, just like most of other C++ projects. So to pass all the different
+options we have on `Aer` to CMake, we use its native mechanism:
 
     (QiskitDevEnv) qiskit-aer\out> cmake -G "Visual Studio 15 2017" -DAER_BLAS_LIB_PATH=c:\path\to\my\blas ..
 
@@ -596,11 +596,11 @@ options we have on `Aer` to CMake we use it's native mechanism:
 ### Building with GPU support
 
 Qiskit Aer can exploit GPU's horsepower to accelerate some simulations, specially the larger ones.
-GPU access is supported via CUDA® (NVIDIA® chipset), so in order to build with GPU support we need
+GPU access is supported via CUDA® (NVIDIA® chipset), so to build with GPU support, you need
 to have CUDA® >= 10.1 preinstalled. See install instructions [here](https://developer.nvidia.com/cuda-toolkit-archive)
 Please note that we only support GPU acceleration on Linux platforms at the moment.
 
-Once CUDA® is properly installed, we only need to set a flag so the build system knows what to do:
+Once CUDA® is properly installed, you only need to set a flag so the build system knows what to do:
 
 ```
 AER_THRUST_BACKEND=CUDA
@@ -610,8 +610,8 @@ For example,
 
     qiskit-aer$ python ./setup.py bdist_wheel -- -DAER_THRUST_BACKEND=CUDA
 
-If we want to specify the CUDA® architecture instead of letting the build system
-auto detect it, we can use the AER_CUDA_ARCH flag (can also be set as an ENV variable
+If you want to specify the CUDA® architecture instead of letting the build system 
+auto detect it, you can use the AER_CUDA_ARCH flag (can also be set as an ENV variable
 with the same name, although the flag takes precedence). For example:
 
     qiskit-aer$ python ./setup.py bdist_wheel -- -DAER_THRUST_BACKEND=CUDA -DAER_CUDA_ARCH="5.2"
@@ -800,7 +800,7 @@ pass them right after ``-D`` CMake argument. Example:
 qiskit-aer/out$ cmake -DUSEFUL_FLAG=Value ..
 ```
 
-In the case of building the Qiskit python extension, you have to pass these flags after writing
+In the case of building the Qiskit Python extension, you have to pass these flags after writing
 ``--`` at the end of the python command line, eg:
 
 ```
@@ -820,8 +820,7 @@ These are the flags:
 * AER_BLAS_LIB_PATH
 
     Tells CMake the directory to look for the BLAS library instead of the usual paths.
-    If no BLAS library is found under that directory, CMake will raise an error and stop.
-
+    If no BLAS library is found under that directory, CMake will raise an error and terminate.
     It can also be set as an ENV variable with the same name, although the flag takes precedence.
 
     Values: An absolute path.
@@ -847,8 +846,8 @@ These are the flags:
 
 * AER_THRUST_BACKEND
 
-    We use Thrust library for GPU support through CUDA. If we want to build a version of `Aer` with GPU acceleration, we need to install CUDA and set this variable to the value: "CUDA".
-    There are other values that will use different CPU methods depending on the kind of backend we want to use:
+    We use Thrust library for GPU support through CUDA. If you want to build a version of `Aer` with GPU acceleration, you need to install CUDA and set this variable to the value: "CUDA".
+    There are other values that will use different CPU methods depending on the kind of backend you want to use:
     - "OMP": For OpenMP support
     - "TBB": For Intel Threading Building Blocks
 
@@ -858,7 +857,7 @@ These are the flags:
 
 * AER_CUDA_ARCH
 
-    This flag allows us we to specify the CUDA architecture instead of letting the build system auto detect it.
+    This flag allows you to specify the CUDA architecture instead of letting the build system auto detect it.
     It can also be set as an ENV variable with the same name, although the flag takes precedence.
 
     Values:  Auto | Common | All | List of valid CUDA architecture(s).
@@ -908,13 +907,13 @@ These are the flags:
 
 ## Tests
 
-Code contribution are expected to include tests that provide coverage for the
+Code contributions are expected to include tests that provide coverage for the
 changes being made.
 
 We have two types of tests in the codebase: Qiskit Terra integration tests and
 Standalone integration tests.
 
-For Qiskit Terra integration tests, you first need to build and install the Qiskit python extension, and then run `unittest` Python framework.
+For Qiskit Terra integration tests, you first need to build and install the Qiskit Python extension, and then run `unittest` Python framework.
 
 ```
 qiskit-aer$ pip install .
@@ -923,7 +922,7 @@ qiskit-aer$ stestr run
 
 Manual for `stestr` can be found [here](https://stestr.readthedocs.io/en/latest/MANUAL.html#).
 
-The integration tests for Qiskit python extension are included in: `test/terra`.
+The integration tests for Qiskit Python extension are included in: `test/terra`.
 
 ## C++ Tests
 
@@ -952,17 +951,17 @@ corresponding tests to verify this compatibility.
 
 ## Debug
 
-We have to build in debug mode if we want to start a debugging session with tools like `gdb` or `lldb`.
-In order to create a Debug build for all platforms, we just need to pass a parameter while invoking the build to
+You have to build in debug mode if you want to start a debugging session with tools like `gdb` or `lldb`.
+To create a Debug build for all platforms, you just need to pass a parameter while invoking the build to
 create the wheel file:
 
     qiskit-aer$> python ./setup.py bdist_wheel --build-type=Debug
 
-If you want to debug the standalone executable, then the parameter changes to:
+If you want to debug the standalone executable, the parameter changes to:
 
     qiskit-aer/out$> cmake -DCMAKE_BUILD_TYPE=Debug
 
-There are three different build configurations: `Release`, `Debug`, and `Release with Debug Symbols`, which parameters are:
+There are three different build configurations: `Release`, `Debug`, and `Release with Debug Symbols`, whose parameters are:
 `Release`, `Debug`, `RelWithDebInfo` respectively.
 
 We recommend building in verbose mode and dump all the output to a file so it's easier to inspect possible build issues:
@@ -976,7 +975,7 @@ On Windows:
     qisikt-aer> set VERBOSE=1
     qiskit-aer> python ./setup.py bdist_wheel --build-type=Debug 1> build.log 2>&1
 
-We encourage to always send the whole `build.log` file when reporting a build issue, otherwise we will ask for it :)
+We encourage you to always send the whole `build.log` file when reporting a build issue, otherwise we will ask for it :)
 
 
 **Stepping through the code**
@@ -986,9 +985,9 @@ Standalone version doesn't require anything special, just use your debugger like
     qiskit-aer/out/Debug$ gdb qasm_simulator
 
 Stepping through the code of a Python extension is another story, trickier, but possible. This is because Python interpreters
-usually load Python extensions dynamically, so we need to start debugging the python interpreter and set our breakpoints ahead of time, before any of our python extension symbols are loaded into the process.
+usually load Python extensions dynamically, so we need to start debugging the Python interpreter and set our breakpoints ahead of time, before any of our Python extension symbols are loaded into the process.
 
-Once built and installed we have to run the debugger with the python interpreter:
+Once built and installed, we have to run the debugger with the Python interpreter:
 
     $ lldb python
 
@@ -1004,9 +1003,9 @@ Then we have to set our breakpoints:
     Breakpoint 1: no locations (pending).
     WARNING:  Unable to resolve breakpoint to any actual locations.
 
-Here the message is clear, it can't find the function: `AER::controller_execute` because our python extension hasn't been loaded yet
- by the python interpreter, so it's "on-hold" hoping to find the function later in the execution.
-Now we can run the python interpreter and pass the arguments (the python file to execute):
+Here the message is clear, it can't find the function: `AER::controller_execute` because our Python extension hasn't been loaded yet
+ by the Python interpreter, so it's "on-hold" hoping to find the function later in the execution.
+Now we can run the Python interpreter and pass the arguments (the python file to execute):
 
     (lldb) r test_qiskit_program.py
     Process 24896 launched: '/opt/anaconda3/envs/aer37/bin/python' (x86_64)

From feb9fb26fb8eb37097de20163e53eea06ecb24f2 Mon Sep 17 00:00:00 2001
From: Amir Ebrahimi <github@aebrahimi.com>
Date: Fri, 5 Mar 2021 11:31:37 -0800
Subject: [PATCH 2/7] Update README.md to mention Linux-only GPU support
 (#1095)

Co-authored-by: Christopher J. Wood <cjwood@us.ibm.com>
Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
---
 README.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 822563289e..dce7fad753 100755
--- a/README.md
+++ b/README.md
@@ -20,7 +20,7 @@ To install from source, follow the instructions in the [contribution guidelines]
 
 ## Installing GPU support
 
-In order to install and run the GPU supported simulators, you need CUDA&reg; 10.1 or newer previously installed.
+In order to install and run the GPU supported simulators on Linux, you need CUDA&reg; 10.1 or newer previously installed.
 CUDA&reg; itself would require a set of specific GPU drivers. Please follow CUDA&reg; installation procedure in the NVIDIA&reg; [web](https://www.nvidia.com/drivers).
 
 If you want to install our GPU supported simulators, you have to install this other package:
@@ -33,6 +33,11 @@ This will overwrite your current `qiskit-aer` package installation giving you
 the same functionality found in the canonical `qiskit-aer` package, plus the
 ability to run the GPU supported simulators: statevector, density matrix, and unitary.
 
+**Note**: This package is only available on x86_64 Linux. For other platforms
+that have CUDA support you will have to build from source. You can refer to
+the [contributing guide](https://github.com/Qiskit/qiskit-aer/blob/master/CONTRIBUTING.md#building-with-gpu-support)
+for instructions on doing this.
+
 ## Simulating your first quantum program with Qiskit Aer
 Now that you have Qiskit Aer installed, you can start simulating quantum circuits with noise. Here is a basic example:
 

From 7fe4f9eadc835705c25c13c753b8550b199efa44 Mon Sep 17 00:00:00 2001
From: "Christopher J. Wood" <cjwood@us.ibm.com>
Date: Mon, 8 Mar 2021 15:50:25 -0500
Subject: [PATCH 3/7] Fix expval tests (#1173)

---
 .../qasm_simulator/qasm_save_expval.py        | 20 +++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/test/terra/backends/qasm_simulator/qasm_save_expval.py b/test/terra/backends/qasm_simulator/qasm_save_expval.py
index ed615f143a..e96c6b720f 100644
--- a/test/terra/backends/qasm_simulator/qasm_save_expval.py
+++ b/test/terra/backends/qasm_simulator/qasm_save_expval.py
@@ -46,9 +46,9 @@ def test_save_expval_stabilizer_pauli(self, pauli):
 
         # Stabilizer test circuit
         state_circ = qi.random_clifford(2, seed=SEED).to_circuit()
-        oper = qi.Pauli(pauli)
+        oper = qi.Operator(qi.Pauli(pauli))
         state = qi.Statevector(state_circ)
-        target = state.expectation_value(oper).real.round(10)
+        target = state.expectation_value(oper).real
 
         # Snapshot circuit
         opts = self.BACKEND_OPTS.copy()
@@ -78,7 +78,7 @@ def test_save_expval_var_stabilizer_pauli(self, pauli):
 
         # Stabilizer test circuit
         state_circ = qi.random_clifford(2, seed=SEED).to_circuit()
-        oper = qi.Pauli(pauli)
+        oper = qi.Operator(qi.Pauli(pauli))
         state = qi.Statevector(state_circ)
         expval = state.expectation_value(oper).real
         variance = state.expectation_value(oper ** 2).real - expval ** 2
@@ -178,9 +178,9 @@ def test_save_expval_nonstabilizer_pauli(self, pauli):
 
         # Stabilizer test circuit
         state_circ = QuantumVolume(2, 1, seed=SEED)
-        oper = qi.Pauli(pauli)
+        oper = qi.Operator(qi.Pauli(pauli))
         state = qi.Statevector(state_circ)
-        target = state.expectation_value(oper).real.round(10)
+        target = state.expectation_value(oper).real
 
         # Snapshot circuit
         opts = self.BACKEND_OPTS.copy()
@@ -209,7 +209,7 @@ def test_save_expval_var_nonstabilizer_pauli(self, pauli):
 
         # Stabilizer test circuit
         state_circ = QuantumVolume(2, 1, seed=SEED)
-        oper = qi.Pauli(pauli)
+        oper = qi.Operator(qi.Pauli(pauli))
         state = qi.Statevector(state_circ)
         expval = state.expectation_value(oper).real
         variance = state.expectation_value(oper ** 2).real - expval ** 2
@@ -244,7 +244,7 @@ def test_save_expval_nonstabilizer_hermitian(self, qubits):
         state_circ = QuantumVolume(3, 1, seed=SEED)
         oper = qi.random_hermitian(4, traceless=True, seed=SEED)
         state = qi.Statevector(state_circ)
-        target = state.expectation_value(oper, qubits).real.round(10)
+        target = state.expectation_value(oper, qubits).real
 
         # Snapshot circuit
         opts = self.BACKEND_OPTS.copy()
@@ -305,7 +305,7 @@ def test_save_expval_cptp_pauli(self, pauli):
         opts = self.BACKEND_OPTS.copy()
         if opts.get('method') in SUPPORTED_METHODS:
 
-            oper = qi.Pauli(pauli)
+            oper = qi.Operator(qi.Pauli(pauli))
 
             # CPTP channel test circuit
             channel = qi.random_quantum_channel(4, seed=SEED)
@@ -313,7 +313,7 @@ def test_save_expval_cptp_pauli(self, pauli):
             state_circ.append(channel, range(2))
 
             state = qi.DensityMatrix(state_circ)
-            target = state.expectation_value(oper).real.round(10)
+            target = state.expectation_value(oper).real
 
             # Snapshot circuit
             circ = transpile(state_circ, self.SIMULATOR)
@@ -337,7 +337,7 @@ def test_save_expval_var_cptp_pauli(self, pauli):
         opts = self.BACKEND_OPTS.copy()
         if opts.get('method') in SUPPORTED_METHODS:
 
-            oper = qi.Pauli(pauli)
+            oper = qi.Operator(qi.Operator(qi.Pauli(pauli)))
 
             # CPTP channel test circuit
             channel = qi.random_quantum_channel(4, seed=SEED)

From d69f7e921f120a7db62705f4cf0498eaee9b02dc Mon Sep 17 00:00:00 2001
From: "Christopher J. Wood" <cjwood@us.ibm.com>
Date: Tue, 9 Mar 2021 01:43:24 -0500
Subject: [PATCH 4/7] Fix extended stabilizer method basis gates (#1175)

---
 qiskit/providers/aer/backends/qasm_simulator.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/qiskit/providers/aer/backends/qasm_simulator.py b/qiskit/providers/aer/backends/qasm_simulator.py
index 56a8ba75fd..e9f0eefdab 100644
--- a/qiskit/providers/aer/backends/qasm_simulator.py
+++ b/qiskit/providers/aer/backends/qasm_simulator.py
@@ -524,8 +524,8 @@ def _method_configuration(method=None):
             config.custom_instructions = sorted(['roerror', 'snapshot', 'save_statevector',
                                                  'save_expval', 'save_expval_var'])
             config.basis_gates = sorted([
-                'cx', 'cz', 'id', 'x', 'y', 'z', 'h', 's', 'sdg', 'sx', 'swap',
-                'u0', 'u1', 'p', 'ccx', 'ccz', 'delay'
+                'cx', 'cz', 'id', 'x', 'y', 'z', 'h', 's', 'sdg', 'sx',
+                'swap', 'u0', 't', 'tdg', 'u1', 'p', 'ccx', 'ccz', 'delay'
             ] + config.custom_instructions)
 
         return config

From 3d2575ae5fa2cd584210c3a4938b8afe4f3adb75 Mon Sep 17 00:00:00 2001
From: "Christopher J. Wood" <cjwood@us.ibm.com>
Date: Tue, 9 Mar 2021 01:46:41 -0500
Subject: [PATCH 5/7] Update CODEOWNERS (#1174)

---
 .github/CODEOWNERS | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
index ec0b63bb2b..8734413ddc 100644
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -8,15 +8,13 @@
 
 # Generic rule for the repository. This pattern is actually the one that will
 # apply unless specialized by a later rule
-* @chriseclectic @vvilpas @atilag
+* @chriseclectic @vvilpas
 
 # Individual folders on root directory
-/qiskit         @chriseclectic @atilag @vvilpas
-/cmake          @atilag @vvilpas
-/doc            @chriseclectic @atilag @vvilpas
-/examples       @chriseclectic @atilag @vvilpas
-/contrib        @chriseclectic @atilag @vvilpas
-/test           @chriseclectic @atilag @vvilpas
-/src            @chriseclectic @atilag @vvilpas
-
-# AER specific folders
+/qiskit         @chriseclectic @vvilpas @mtreinish
+/test           @chriseclectic @vvilpas @mtreinish
+/doc            @chriseclectic @vvilpas @mtreinish
+/releasenotes   @chriseclectic @vvilpas @mtreinish
+/cmake          @vvilpas
+/contrib        @chriseclectic @vvilpas @hhorii
+/src            @chriseclectic @vvilpas @hhorii

From acd216d040c0d9ec1161c82331820841cb13386f Mon Sep 17 00:00:00 2001
From: Jun Doi <doichan@jp.ibm.com>
Date: Wed, 10 Mar 2021 19:17:46 +0900
Subject: [PATCH 6/7] Fixes of multi-chunk State implementation (#1149)

Co-authored-by: Victor Villar <vvilpas@gmail.com>
Co-authored-by: Christopher J. Wood <cjwood@us.ibm.com>
---
 CONTRIBUTING.md                               |   3 +
 src/controllers/controller.hpp                |  55 +++
 src/controllers/qasm_controller.hpp           |  94 ++--
 src/controllers/statevector_controller.hpp    |  38 +-
 src/controllers/unitary_controller.hpp        |  36 +-
 .../density_matrix/densitymatrix.hpp          |  27 ++
 .../density_matrix/densitymatrix_state.hpp    |   4 +-
 .../densitymatrix_state_chunk.hpp             | 425 ++++++++++++------
 .../density_matrix/densitymatrix_thrust.hpp   |  63 +++
 src/simulators/state.hpp                      |   2 +-
 src/simulators/state_chunk.hpp                | 102 +++--
 src/simulators/statevector/chunk/chunk.hpp    |   2 +
 .../statevector/chunk/chunk_container.hpp     |   3 -
 .../chunk/device_chunk_container.hpp          |   5 +-
 .../chunk/host_chunk_container.hpp            |   3 +
 .../statevector/qubitvector_thrust.hpp        |  10 +-
 .../statevector/statevector_state.hpp         |   4 +-
 .../statevector/statevector_state_chunk.hpp   | 236 +++++++++-
 src/simulators/unitary/unitary_state.hpp      |   4 +-
 .../unitary/unitary_state_chunk.hpp           | 129 ++++--
 src/transpile/cacheblocking.hpp               |   2 +-
 .../backends/qasm_simulator/qasm_chunk.py     | 136 ++++++
 ...est_qasm_simulator_density_matrix_chunk.py |  74 +++
 .../test_qasm_simulator_density_matrix_mpi.py |  84 ----
 ... test_qasm_simulator_statevector_chunk.py} |  49 +-
 25 files changed, 1113 insertions(+), 477 deletions(-)
 create mode 100644 test/terra/backends/qasm_simulator/qasm_chunk.py
 create mode 100644 test/terra/backends/test_qasm_simulator_density_matrix_chunk.py
 delete mode 100644 test/terra/backends/test_qasm_simulator_density_matrix_mpi.py
 rename test/terra/backends/{test_qasm_simulator_statevector_mpi.py => test_qasm_simulator_statevector_chunk.py} (56%)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index a3db828f52..44a59da025 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -681,7 +681,10 @@ This technique allows applying quantum gates to each chunk independently without
 Before the actual simulation, we apply transpilation to remap the input circuits to the equivalent circuits that has all the quantum gates on the lower qubits than the chunk's number of qubits.
 And the (noiseless) swap gates are inserted to exchange data. 
 
+Please refer to this paper (https://arxiv.org/abs/2102.02957) for more detailed algorithm and implementation of parallel simulation.
+
 So to simulate by using multiple GPUs or multiple nodes on the cluster, following configurations should be set to backend options.
+(If there is not enough memory to simulate the input circuit, Qiskit Aer automatically set following options, but it is recommended to explicitly set them)
 
  - blocking_enable
 
diff --git a/src/controllers/controller.hpp b/src/controllers/controller.hpp
index 8babb798ef..f20e0e3e3f 100755
--- a/src/controllers/controller.hpp
+++ b/src/controllers/controller.hpp
@@ -51,6 +51,7 @@
 #include "noise/noise_model.hpp"
 #include "transpile/basic_opts.hpp"
 #include "transpile/truncate_qubits.hpp"
+#include "transpile/cacheblocking.hpp"
 
 namespace AER {
 namespace Base {
@@ -216,8 +217,19 @@ class Controller {
   set_distributed_parallelization(const std::vector<Circuit> &circuits,
                                   const std::vector<Noise::NoiseModel> &noise);
 
+  virtual bool multiple_chunk_required(const Circuit &circuit,
+                                  const Noise::NoiseModel &noise) const;
+
   void save_exception_to_results(Result &result,const std::exception &e);
 
+
+  //setting cache blocking transpiler
+  Transpile::CacheBlocking transpile_cache_blocking(const Circuit& circ,
+                                     const Noise::NoiseModel& noise,
+                                     const json_t& config,
+                                     const size_t complex_size,bool is_matrix) const;
+
+
   // Get system memory size
   size_t get_system_memory_mb();
   size_t get_gpu_memory_mb();
@@ -274,6 +286,8 @@ class Controller {
   //process information (MPI)
   int myrank_ = 0;
   int num_processes_ = 1;
+
+  uint_t cache_block_qubit_ = 0;
 };
 
 //=========================================================================
@@ -348,6 +362,11 @@ void Controller::set_config(const json_t &config) {
     JSON::get_value(accept_distributed_results_, "accept_distributed_results", config);
   }
 
+  //enable multiple qregs if cache blocking is enabled
+  cache_block_qubit_ = 0;
+  if(JSON::check_key("blocking_qubits", config)){
+    JSON::get_value(cache_block_qubit_,"blocking_qubits", config);
+  }
 }
 
 void Controller::clear_config() {
@@ -535,6 +554,21 @@ uint_t Controller::get_distributed_num_processes(bool par_shots) const
   }
 }
 
+bool Controller::multiple_chunk_required(const Circuit &circ,
+                                const Noise::NoiseModel &noise) const
+{
+  if(circ.num_qubits < 3)
+    return false;
+
+  if(num_process_per_experiment_ > 1 || Controller::get_min_memory_mb() < required_memory_mb(circ, noise))
+    return true;
+
+  if(cache_block_qubit_ >= 2 && cache_block_qubit_ < circ.num_qubits)
+    return true;
+
+  return false;
+}
+
 size_t Controller::get_system_memory_mb() {
   size_t total_physical_memory = 0;
 #if defined(__linux__) || defined(__APPLE__)
@@ -654,6 +688,27 @@ void Controller::save_exception_to_results(Result &result,const std::exception &
   }
 }
 
+Transpile::CacheBlocking Controller::transpile_cache_blocking(const Circuit& circ,
+                                     const Noise::NoiseModel& noise,
+                                     const json_t& config,
+                                     const size_t complex_size,bool is_matrix) const
+{
+  Transpile::CacheBlocking cache_block_pass;
+
+  cache_block_pass.set_config(config);
+  if(!cache_block_pass.enabled()){
+    //if blocking is not set by config, automatically set if required
+    if(multiple_chunk_required(circ,noise)){
+      int nplace = num_process_per_experiment_;
+      if(num_gpus_ > 0)
+        nplace *= num_gpus_;
+      cache_block_pass.set_blocking(circ.num_qubits, get_min_memory_mb() << 20, nplace, complex_size,is_matrix);
+    }
+  }
+
+  return cache_block_pass;
+}
+
 //-------------------------------------------------------------------------
 // Qobj execution
 //-------------------------------------------------------------------------
diff --git a/src/controllers/qasm_controller.hpp b/src/controllers/qasm_controller.hpp
index ba903aa45e..a408b2e83d 100755
--- a/src/controllers/qasm_controller.hpp
+++ b/src/controllers/qasm_controller.hpp
@@ -215,11 +215,6 @@ class QasmController : public Base::Controller {
                                      const Operations::OpSet &opset,
                                      const json_t& config) const;
 
-
-  Transpile::CacheBlocking transpile_cache_blocking(const Circuit& circ,
-                                     const Noise::NoiseModel& noise,
-                                     const json_t& config) const;
-
   //----------------------------------------------------------------
   // Run circuit helpers
   //----------------------------------------------------------------
@@ -306,9 +301,6 @@ class QasmController : public Base::Controller {
 
   // Controller-level parameter for CH method
   bool extended_stabilizer_measure_sampling_ = false;
-
-  //using multiple chunks
-  bool multiple_qregs_ = false;
 };
 
 //=========================================================================
@@ -381,11 +373,6 @@ void QasmController::set_config(const json_t& config) {
           "QasmController: initial_statevector is not a unit vector");
     }
   }
-
-  //enable multiple qregs if cache blocking is enabled
-  if(JSON::check_key("blocking_enable", config)){
-    JSON::get_value(multiple_qregs_,"blocking_enable", config);
-  }
 }
 
 void QasmController::clear_config() {
@@ -407,7 +394,7 @@ void QasmController::run_circuit(const Circuit& circ,
   // Validate circuit for simulation method
   switch (simulation_method(circ, noise, true)) {
     case Method::statevector: {
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (simulation_precision_ == Precision::double_precision) {
           // Double-precision Statevector simulation
           return run_circuit_helper<StatevectorChunk::State<QV::QubitVector<double>>>(
@@ -440,7 +427,7 @@ void QasmController::run_circuit(const Circuit& circ,
           "QasmController: method statevector_gpu is not supported on this "
           "system");
 #else
-      if(multiple_qregs_ || (parallel_shots_ > 1 || parallel_experiments_ > 1)){
+      if(Base::Controller::multiple_chunk_required(circ,noise) || (parallel_shots_ > 1 || parallel_experiments_ > 1)){
         if (simulation_precision_ == Precision::double_precision) {
           // Double-precision Statevector simulation
           return run_circuit_helper<
@@ -478,7 +465,7 @@ void QasmController::run_circuit(const Circuit& circ,
           "QasmController: method statevector_thrust is not supported on this "
           "system");
 #else
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (simulation_precision_ == Precision::double_precision) {
           // Double-precision Statevector simulation
           return run_circuit_helper<
@@ -511,7 +498,7 @@ void QasmController::run_circuit(const Circuit& circ,
 #endif
     }
     case Method::density_matrix: {
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (simulation_precision_ == Precision::double_precision) {
           // Double-precision density matrix simulation
           return run_circuit_helper<
@@ -548,7 +535,7 @@ void QasmController::run_circuit(const Circuit& circ,
           "QasmController: method density_matrix_gpu is not supported on this "
           "system");
 #else
-      if(multiple_qregs_ || (parallel_shots_ > 1 || parallel_experiments_ > 1)){
+      if(Base::Controller::multiple_chunk_required(circ,noise) || (parallel_shots_ > 1 || parallel_experiments_ > 1)){
         if (simulation_precision_ == Precision::double_precision) {
           // Double-precision density matrix simulation
           return run_circuit_helper<
@@ -586,7 +573,7 @@ void QasmController::run_circuit(const Circuit& circ,
             "this "
             "system");
 #else
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (simulation_precision_ == Precision::double_precision) {
           // Double-precision density matrix simulation
           return run_circuit_helper<
@@ -938,42 +925,6 @@ Transpile::Fusion QasmController::transpile_fusion(Method method,
   return fusion_pass;
 }
 
-Transpile::CacheBlocking QasmController::transpile_cache_blocking(const Circuit& circ,
-                                   const Noise::NoiseModel& noise,
-                                   const json_t& config) const
-{
-  Transpile::CacheBlocking cache_block_pass;
-
-  cache_block_pass.set_config(config);
-  if(!cache_block_pass.enabled()){
-    //if blocking is not set by config, automatically set if required
-    if(Base::Controller::num_process_per_experiment_ > 1 || Base::Controller::get_min_memory_mb() < required_memory_mb(circ, noise)){
-      int nplace = Base::Controller::num_process_per_experiment_;
-      if(Base::Controller::num_gpus_ > 0)
-        nplace *= Base::Controller::num_gpus_;
-
-      size_t complex_size = (simulation_precision_ == Precision::single_precision) ? sizeof(std::complex<float>) : sizeof(std::complex<double>);
-
-      switch (simulation_method(circ, noise, false)) {
-        case Method::statevector:
-        case Method::statevector_thrust_cpu:
-        case Method::statevector_thrust_gpu:
-          cache_block_pass.set_blocking(circ.num_qubits, Base::Controller::get_min_memory_mb() << 20, nplace, complex_size,false);
-          break;
-        case Method::density_matrix:
-        case Method::density_matrix_thrust_cpu:
-        case Method::density_matrix_thrust_gpu:
-          cache_block_pass.set_blocking(circ.num_qubits, Base::Controller::get_min_memory_mb() << 20, nplace, complex_size,true);
-          break;
-        default:
-          throw std::runtime_error("QasmController: No enough memory to simulate this method on the sysytem");
-      }
-    }
-  }
-
-  return cache_block_pass;
-}
-
 void QasmController::set_parallelization_circuit(
     const Circuit& circ,
     const Noise::NoiseModel& noise_model) {
@@ -1148,9 +1099,19 @@ void QasmController::run_circuit_helper(const Circuit& circ,
   auto fusion_pass = transpile_fusion(method, opt_circ.opset(), config);
   fusion_pass.optimize_circuit(opt_circ, dummy_noise, state.opset(), result);
 
-  auto cache_block_pass = transpile_cache_blocking(opt_circ,noise,config);
+  bool is_matrix = false;
+  if(method == Method::density_matrix || method == Method::density_matrix_thrust_gpu || method == Method::density_matrix_thrust_cpu)
+   is_matrix = true;
+  auto cache_block_pass = transpile_cache_blocking(opt_circ,noise,config,(simulation_precision_ == Precision::single_precision) ? sizeof(std::complex<float>) : sizeof(std::complex<double>),is_matrix);
   cache_block_pass.optimize_circuit(opt_circ, dummy_noise, state.opset(), result);
 
+  uint_t block_bits = 0;
+  if(cache_block_pass.enabled())
+    block_bits = cache_block_pass.block_bits();
+
+  //allocate qubit register
+  state.allocate(Base::Controller::max_qubits_,block_bits);
+
   // Run simulation
   run_multi_shot(opt_circ, shots, state, initial_state, method, result, rng);
 }
@@ -1179,9 +1140,6 @@ void QasmController::run_multi_shot(const Circuit& circ,
     // Implement measure sampler
     auto pos = circ.first_measure_pos;  // Position of first measurement op
 
-    //allocate qubit register
-    state.allocate(Base::Controller::max_qubits_);
-
     // Run circuit instructions before first measure
     std::vector<Operations::Op> ops(circ.ops.begin(),
                                     circ.ops.begin() + pos);
@@ -1197,9 +1155,6 @@ void QasmController::run_multi_shot(const Circuit& circ,
     // Add measure sampling metadata
     result.metadata.add(true, "measure_sampling");
   } else {
-    //allocate qubit register
-    state.allocate(Base::Controller::max_qubits_);
-
     // Perform standard execution if we cannot apply the
     // measurement sampling optimization
     while (shots-- > 0) {
@@ -1225,10 +1180,10 @@ void QasmController::run_circuit_with_sampled_noise(const Circuit& circ,
   measure_pass.set_config(config);
   Noise::NoiseModel dummy_noise;
 
-  auto cache_block_pass = transpile_cache_blocking(circ,noise,config);
-
-  //allocate qubit register
-  state.allocate(Base::Controller::max_qubits_);
+  bool is_matrix = false;
+  if(method == Method::density_matrix || method == Method::density_matrix_thrust_gpu || method == Method::density_matrix_thrust_cpu)
+   is_matrix = true;
+  auto cache_block_pass = transpile_cache_blocking(circ,noise,config,(simulation_precision_ == Precision::single_precision) ? sizeof(std::complex<float>) : sizeof(std::complex<double>),is_matrix);
 
   // Sample noise using circuit method
   while (shots-- > 0) {
@@ -1238,6 +1193,13 @@ void QasmController::run_circuit_with_sampled_noise(const Circuit& circ,
     fusion_pass.optimize_circuit(noise_circ, dummy_noise, state.opset(), result);
     cache_block_pass.optimize_circuit(noise_circ, dummy_noise, state.opset(), result);
 
+    uint_t block_bits = 0;
+    if(cache_block_pass.enabled())
+      block_bits = cache_block_pass.block_bits();
+
+    //allocate qubit register
+    state.allocate(Base::Controller::max_qubits_,block_bits);
+
     run_single_shot(noise_circ, state, initial_state, result, rng);
   }
 }
diff --git a/src/controllers/statevector_controller.hpp b/src/controllers/statevector_controller.hpp
index b851632c31..db5c9a9cfe 100755
--- a/src/controllers/statevector_controller.hpp
+++ b/src/controllers/statevector_controller.hpp
@@ -124,9 +124,6 @@ class StatevectorController : public Base::Controller {
   // Precision of statevector
   Precision precision_ = Precision::double_precision;
 
-  //using multiple chunks
-  bool multiple_qregs_ = false;
-
 };
 
 //=========================================================================
@@ -182,11 +179,6 @@ void StatevectorController::set_config(const json_t& config) {
       precision_ = Precision::single_precision;
     }
   }
-
-  //enable multiple qregs if cache blocking is enabled
-  if(JSON::check_key("blocking_enable", config)){
-    JSON::get_value(multiple_qregs_,"blocking_enable", config);
-  }
 }
 
 void StatevectorController::clear_config() {
@@ -215,7 +207,7 @@ void StatevectorController::run_circuit(
   switch (method_) {
     case Method::automatic:
     case Method::statevector_cpu: {
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (precision_ == Precision::double_precision) {
           // Double-precision Statevector simulation
           return run_circuit_helper<StatevectorChunk::State<QV::QubitVector<double>>>(
@@ -240,7 +232,7 @@ void StatevectorController::run_circuit(
     }
     case Method::statevector_thrust_gpu: {
 #ifdef AER_THRUST_CUDA
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (precision_ == Precision::double_precision) {
           // Double-precision Statevector simulation
           return run_circuit_helper<
@@ -275,7 +267,7 @@ void StatevectorController::run_circuit(
     }
     case Method::statevector_thrust_cpu: {
 #ifdef AER_THRUST_CPU
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (precision_ == Precision::double_precision) {
           // Double-precision Statevector simulation
           return run_circuit_helper<
@@ -353,34 +345,32 @@ void StatevectorController::run_circuit_helper(
   result.set_config(config);
 
   // Optimize circuit
-  const std::vector<Operations::Op>* op_ptr = &circ.ops;
   Transpile::Fusion fusion_pass;
-  Transpile::CacheBlocking cache_block_pass;
-
   fusion_pass.set_config(config);
-  cache_block_pass.set_config(config);
-
   fusion_pass.set_parallelization(parallel_state_update_);
 
-  Circuit opt_circ;
+  Circuit opt_circ = circ; // copy circuit
+  Noise::NoiseModel dummy_noise; // dummy object for transpile pass
   if (fusion_pass.active && circ.num_qubits >= fusion_pass.threshold) {
-    opt_circ = circ; // copy circuit
-    Noise::NoiseModel dummy_noise; // dummy object for transpile pass
     fusion_pass.optimize_circuit(opt_circ, dummy_noise, state.opset(), result);
-    cache_block_pass.optimize_circuit(opt_circ, dummy_noise, state.opset(), result);
-    op_ptr = &opt_circ.ops;
   }
 
-  // Run single shot collecting measure data or snapshots
-  state.allocate(Base::Controller::max_qubits_);
+  Transpile::CacheBlocking cache_block_pass = transpile_cache_blocking(opt_circ,dummy_noise,config,(precision_ == Precision::single_precision) ? sizeof(std::complex<float>) : sizeof(std::complex<double>),false);
+  cache_block_pass.optimize_circuit(opt_circ, dummy_noise, state.opset(), result);
 
+  uint_t block_bits = 0;
+  if(cache_block_pass.enabled())
+    block_bits = cache_block_pass.block_bits();
+  state.allocate(Base::Controller::max_qubits_,block_bits);
+
+  // Run single shot collecting measure data or snapshots
   if (initial_state_.empty()) {
     state.initialize_qreg(circ.num_qubits);
   } else {
     state.initialize_qreg(circ.num_qubits, initial_state_);
   }
   state.initialize_creg(circ.num_memory, circ.num_registers);
-  state.apply_ops(*op_ptr, result, rng);
+  state.apply_ops(opt_circ.ops, result, rng);
   Base::Controller::save_count_data(result, state.creg());
 
   // Add final state to the data
diff --git a/src/controllers/unitary_controller.hpp b/src/controllers/unitary_controller.hpp
index 935ca69dc6..f54f52d5b2 100755
--- a/src/controllers/unitary_controller.hpp
+++ b/src/controllers/unitary_controller.hpp
@@ -113,10 +113,6 @@ class UnitaryController : public Base::Controller {
 
   // Precision of a unitary matrix
   Precision precision_ = Precision::double_precision;
-
-  //using multiple chunks
-  bool multiple_qregs_ = false;
-
 };
 
 //=========================================================================
@@ -172,11 +168,6 @@ void UnitaryController::set_config(const json_t &config) {
       precision_ = Precision::single_precision;
     }
   }
-
-  //enable multiple qregs if cache blocking is enabled
-  if(JSON::check_key("blocking_enable", config)){
-    JSON::get_value(multiple_qregs_,"blocking_enable", config);
-  }
 }
 
 void UnitaryController::clear_config() {
@@ -207,7 +198,7 @@ void UnitaryController::run_circuit(const Circuit &circ,
   switch (method_) {
     case Method::automatic:
     case Method::unitary_cpu: {
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (precision_ == Precision::double_precision) {
           // Double-precision unitary simulation
           return run_circuit_helper<
@@ -236,7 +227,7 @@ void UnitaryController::run_circuit(const Circuit &circ,
     }
     case Method::unitary_thrust_gpu: {
 #ifdef AER_THRUST_CUDA
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (precision_ == Precision::double_precision) {
           // Double-precision unitary simulation
           return run_circuit_helper<
@@ -270,7 +261,7 @@ void UnitaryController::run_circuit(const Circuit &circ,
     }
     case Method::unitary_thrust_cpu: {
 #ifdef AER_THRUST_CPU
-      if(multiple_qregs_){
+      if(Base::Controller::multiple_chunk_required(circ,noise)){
         if (precision_ == Precision::double_precision) {
           // Double-precision unitary simulation
           return run_circuit_helper<
@@ -354,25 +345,26 @@ void UnitaryController::run_circuit_helper(
   result.metadata.add(state.name(), "method");
 
   // Optimize circuit
-  const std::vector<Operations::Op>* op_ptr = &circ.ops;
   Transpile::Fusion fusion_pass;
-  Transpile::CacheBlocking cache_block_pass;
   fusion_pass.threshold /= 2;  // Halve default threshold for unitary simulator
   fusion_pass.set_config(config);
-  cache_block_pass.set_config(config);
   fusion_pass.set_parallelization(parallel_state_update_);
 
-  Circuit opt_circ;
+  Circuit opt_circ = circ; // copy circuit
+  Noise::NoiseModel dummy_noise; // dummy object for transpile pass
   if (fusion_pass.active && circ.num_qubits >= fusion_pass.threshold) {
-    opt_circ = circ; // copy circuit
-    Noise::NoiseModel dummy_noise; // dummy object for transpile pass
     fusion_pass.optimize_circuit(opt_circ, dummy_noise, state.opset(), result);
-    cache_block_pass.optimize_circuit(opt_circ, dummy_noise, state.opset(), result);
-    op_ptr = &opt_circ.ops;
   }
 
+  Transpile::CacheBlocking cache_block_pass = transpile_cache_blocking(opt_circ,dummy_noise,config,(precision_ == Precision::single_precision) ? sizeof(std::complex<float>) : sizeof(std::complex<double>),true);
+  cache_block_pass.optimize_circuit(opt_circ, dummy_noise, state.opset(), result);
+
+  uint_t block_bits = 0;
+  if(cache_block_pass.enabled())
+    block_bits = cache_block_pass.block_bits();
+  state.allocate(Base::Controller::max_qubits_,block_bits);
+
   // Run single shot collecting measure data or snapshots
-  state.allocate(Base::Controller::max_qubits_);
 
   if (initial_unitary_.empty()) {
     state.initialize_qreg(circ.num_qubits);
@@ -380,7 +372,7 @@ void UnitaryController::run_circuit_helper(
     state.initialize_qreg(circ.num_qubits, initial_unitary_);
   }
   state.initialize_creg(circ.num_memory, circ.num_registers);
-  state.apply_ops(*op_ptr, result, rng);
+  state.apply_ops(opt_circ.ops, result, rng);
   Base::Controller::save_count_data(result, state.creg());
 
   // Add final state unitary to the data
diff --git a/src/simulators/density_matrix/densitymatrix.hpp b/src/simulators/density_matrix/densitymatrix.hpp
index a013296702..2e2b9ad833 100755
--- a/src/simulators/density_matrix/densitymatrix.hpp
+++ b/src/simulators/density_matrix/densitymatrix.hpp
@@ -131,6 +131,7 @@ class DensityMatrix : public UnitaryMatrix<data_t> {
 
   // Return Pauli expectation value
   double expval_pauli(const reg_t &qubits, const std::string &pauli,const complex_t initial_phase=1.0) const;
+  double expval_pauli_non_diagonal_chunk(const reg_t &qubits, const std::string &pauli,const complex_t initial_phase=1.0) const;
 
 protected:
 
@@ -400,6 +401,32 @@ double DensityMatrix<data_t>::expval_pauli(const reg_t &qubits,
     std::move(lambda), size_t(0), nrows >> 1));
 }
 
+template <typename data_t>
+double DensityMatrix<data_t>::expval_pauli_non_diagonal_chunk(const reg_t &qubits,
+                                           const std::string &pauli,const complex_t initial_phase) const 
+{
+  uint_t x_mask, z_mask, num_y, x_max;
+  std::tie(x_mask, z_mask, num_y, x_max) = QV::pauli_masks_and_phase(qubits, pauli);
+
+  // Size of density matrix 
+  const size_t nrows = BaseMatrix::rows_;
+
+  auto phase = std::complex<data_t>(initial_phase);
+  QV::add_y_phase(num_y, phase);
+
+  auto lambda = [&](const int_t i, double &val_re, double &val_im)->void {
+    (void)val_im; // unused
+    auto idx_mat = i ^ x_mask + nrows * i;
+    auto val = std::real(phase * BaseVector::data_[idx_mat]);
+    if (z_mask && (AER::Utils::popcount(i & z_mask) & 1)) {
+      val = - val;
+    }
+    val_re += val;
+  };
+  return std::real(BaseVector::apply_reduction_lambda(
+    std::move(lambda), size_t(0), nrows));
+}
+
 //-----------------------------------------------------------------------
 // Z-measurement outcome probabilities
 //-----------------------------------------------------------------------
diff --git a/src/simulators/density_matrix/densitymatrix_state.hpp b/src/simulators/density_matrix/densitymatrix_state.hpp
index 19bf2b43f8..25c6b80322 100644
--- a/src/simulators/density_matrix/densitymatrix_state.hpp
+++ b/src/simulators/density_matrix/densitymatrix_state.hpp
@@ -129,7 +129,7 @@ class State : public Base::State<densmat_t> {
   virtual std::vector<reg_t> sample_measure(const reg_t &qubits, uint_t shots,
                                             RngEngine &rng) override;
 
-  virtual void allocate(uint_t num_qubits) override;
+  virtual void allocate(uint_t num_qubits,uint_t block_bits) override;
 
   //-----------------------------------------------------------------------
   // Additional methods
@@ -359,7 +359,7 @@ const stringmap_t<Snapshots> State<densmat_t>::snapshotset_(
 // Initialization
 //-------------------------------------------------------------------------
 template <class densmat_t>
-void State<densmat_t>::allocate(uint_t num_qubits)
+void State<densmat_t>::allocate(uint_t num_qubits,uint_t block_bits)
 {
   BaseState::qreg_.chunk_setup(num_qubits*2,num_qubits*2,0,1);
 }
diff --git a/src/simulators/density_matrix/densitymatrix_state_chunk.hpp b/src/simulators/density_matrix/densitymatrix_state_chunk.hpp
index 2a625d7d13..31128fc989 100644
--- a/src/simulators/density_matrix/densitymatrix_state_chunk.hpp
+++ b/src/simulators/density_matrix/densitymatrix_state_chunk.hpp
@@ -27,36 +27,34 @@
 #include "densitymatrix_thrust.hpp"
 #endif
 
-//#include "densitymatrix_state.h"
-
 namespace AER {
 namespace DensityMatrixChunk {
 
+using OpType = Operations::OpType;
+
 // OpSet of supported instructions
 const Operations::OpSet StateOpSet(
     // Op types
-    {Operations::OpType::gate, Operations::OpType::measure,
-     Operations::OpType::reset, Operations::OpType::snapshot,
-     Operations::OpType::barrier, Operations::OpType::bfunc,
-     Operations::OpType::roerror, Operations::OpType::matrix,
-     Operations::OpType::diagonal_matrix, Operations::OpType::kraus,
-     Operations::OpType::superop, Operations::OpType::save_expval,
-     Operations::OpType::save_expval_var},
+    {OpType::gate, OpType::measure,
+     OpType::reset, OpType::snapshot,
+     OpType::barrier, OpType::bfunc,
+     OpType::roerror, OpType::matrix,
+     OpType::diagonal_matrix, OpType::kraus,
+     OpType::superop, OpType::save_expval,
+     OpType::save_expval_var, OpType::save_densmat,
+     OpType::save_probs, OpType::save_probs_ket,
+     OpType::save_amps_sq
+     },
     // Gates
     {"U",    "CX",  "u1", "u2",  "u3", "u",   "cx",   "cy",  "cz",
      "swap", "id",  "x",  "y",   "z",  "h",   "s",    "sdg", "t",
      "tdg",  "ccx", "r",  "rx",  "ry", "rz",  "rxx",  "ryy", "rzz",
      "rzx",  "p",   "cp", "cu1", "sx", "x90", "delay", "pauli"},
     // Snapshots
-    {"memory", "register", "probabilities",
+    {"density_matrix", "memory", "register", "probabilities",
      "probabilities_with_variance", "expectation_value_pauli",
      "expectation_value_pauli_with_variance"});
 
-// Allowed gates enum class
-enum class Gates {
-  u1, u2, u3, r, rx,ry, rz, id, x, y, z, h, s, sdg, sx, t, tdg,
-  cx, cy, cz, swap, rxx, ryy, rzz, rzx, ccx, cp, pauli
-};
 
 //=========================================================================
 // DensityMatrix State subclass
@@ -115,8 +113,9 @@ class State : public Base::StateChunk<densmat_t> {
   void initialize_omp();
 
   auto move_to_matrix();
-
+  auto copy_to_matrix();
 protected:
+  auto apply_to_matrix(bool copy = false);
 
   //-----------------------------------------------------------------------
   // Apply instructions
@@ -170,10 +169,28 @@ class State : public Base::StateChunk<densmat_t> {
   // Save data instructions
   //-----------------------------------------------------------------------
 
+  // Save the current density matrix or reduced density matrix
+  void apply_save_density_matrix(const Operations::Op &op,
+                                 ExperimentResult &result,
+                                 bool last_op = false);
+
+  // Helper function for computing expectation value
+  void apply_save_probs(const Operations::Op &op,
+                        ExperimentResult &result);
+
+  // Helper function for saving amplitudes squared
+  void apply_save_amplitudes_sq(const Operations::Op &op,
+                                ExperimentResult &result);
+
   // Helper function for computing expectation value
   virtual double expval_pauli(const reg_t &qubits,
                               const std::string& pauli) override;
 
+  // Return the reduced density matrix for the simulator
+  cmatrix_t reduced_density_matrix(const reg_t &qubits, bool last_op = false);
+  cmatrix_t reduced_density_matrix_helper(const reg_t &qubits,
+                                          const reg_t &qubits_sorted);
+
   //-----------------------------------------------------------------------
   // Measurement Helpers
   //-----------------------------------------------------------------------
@@ -230,8 +247,6 @@ class State : public Base::StateChunk<densmat_t> {
                               ExperimentResult &result,
                               bool variance);
 
-  // Return the reduced density matrix for the simulator
-  cmatrix_t reduced_density_matrix(const reg_t &qubits, const reg_t& qubits_sorted);
 
   //-----------------------------------------------------------------------
   // Single-qubit gate helpers
@@ -276,7 +291,7 @@ void State<densmat_t>::initialize_qreg(uint_t num_qubits)
 
   if(BaseState::chunk_bits_ == BaseState::num_qubits_){
     for(i=0;i<BaseState::num_local_chunks_;i++){
-      BaseState::qregs_[i].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[i].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[i].zero();
       BaseState::qregs_[i].initialize();
     }
@@ -285,7 +300,7 @@ void State<densmat_t>::initialize_qreg(uint_t num_qubits)
 
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(i) 
     for(i=0;i<BaseState::num_local_chunks_;i++){
-      BaseState::qregs_[i].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[i].set_num_qubits(BaseState::chunk_bits_);
       if(BaseState::global_chunk_index_ + i == 0 || this->num_qubits_ == this->chunk_bits_){
         BaseState::qregs_[i].initialize();
       }
@@ -309,7 +324,7 @@ void State<densmat_t>::initialize_qreg(uint_t num_qubits,
   int_t iChunk;
   if(BaseState::chunk_bits_ == BaseState::num_qubits_){
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[iChunk].initialize_from_data(state.data(), 1ULL << 2 * num_qubits);
     }
   }
@@ -318,21 +333,19 @@ void State<densmat_t>::initialize_qreg(uint_t num_qubits,
 
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk) 
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      uint_t local_row_offset = (BaseState::global_chunk_index_ + iChunk) & ((1ull << (BaseState::num_qubits_/2 - BaseState::chunk_bits_/2))-1);
-      uint_t local_col_offset = (BaseState::global_chunk_index_ + iChunk) >> (BaseState::num_qubits_/2 - BaseState::chunk_bits_/2);
-      local_row_offset <<= (BaseState::chunk_bits_/2);
-      local_col_offset <<= (BaseState::chunk_bits_/2);
+      uint_t irow_chunk = ((iChunk + BaseState::global_chunk_index_) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_))) << (BaseState::chunk_bits_);
+      uint_t icol_chunk = ((iChunk + BaseState::global_chunk_index_) & ((1ull << ((BaseState::num_qubits_ - BaseState::chunk_bits_)))-1)) << (BaseState::chunk_bits_);
 
       //copy part of state for this chunk
       uint_t i,row,col;
       cvector_t tmp(1ull << BaseState::chunk_bits_);
       for(i=0;i<(1ull << BaseState::chunk_bits_);i++){
-        uint_t row = i & ((1ull << (BaseState::chunk_bits_/2))-1);
-        uint_t col = i >> (BaseState::chunk_bits_/2);
-        tmp[i] = input[local_row_offset + row + ((local_col_offset + col) << (BaseState::num_qubits_/2))];
+        uint_t icol = i & ((1ull << (BaseState::chunk_bits_))-1);
+        uint_t irow = i >> (BaseState::chunk_bits_);
+        tmp[i] = input[icol_chunk + icol + ((irow_chunk + irow) << (BaseState::num_qubits_))];
       }
 
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[iChunk].initialize_from_vector(tmp);
     }
   }
@@ -350,7 +363,7 @@ void State<densmat_t>::initialize_qreg(uint_t num_qubits,
   int_t iChunk;
   if(BaseState::chunk_bits_ == BaseState::num_qubits_){
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[iChunk].initialize_from_matrix(state);
     }
   }
@@ -358,21 +371,19 @@ void State<densmat_t>::initialize_qreg(uint_t num_qubits,
 
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk) 
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      uint_t local_row_offset = (BaseState::global_chunk_index_ + iChunk) & ((1ull << (BaseState::num_qubits_/2 - BaseState::chunk_bits_/2))-1);
-      uint_t local_col_offset = (BaseState::global_chunk_index_ + iChunk) >> (BaseState::num_qubits_/2 - BaseState::chunk_bits_/2);
-      local_row_offset <<= (BaseState::chunk_bits_/2);
-      local_col_offset <<= (BaseState::chunk_bits_/2);
+      uint_t irow_chunk = ((iChunk + BaseState::global_chunk_index_) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_))) << (BaseState::chunk_bits_);
+      uint_t icol_chunk = ((iChunk + BaseState::global_chunk_index_) & ((1ull << ((BaseState::num_qubits_ - BaseState::chunk_bits_)))-1)) << (BaseState::chunk_bits_);
 
       //copy part of state for this chunk
       uint_t i,row,col;
       cvector_t tmp(1ull << BaseState::chunk_bits_);
       for(i=0;i<(1ull << BaseState::chunk_bits_);i++){
-        uint_t row = i & ((1ull << (BaseState::chunk_bits_/2))-1);
-        uint_t col = i >> (BaseState::chunk_bits_/2);
-        tmp[i] = state[local_row_offset + row + ((local_col_offset + col) << (BaseState::num_qubits_/2))];
+        uint_t icol = i & ((1ull << (BaseState::chunk_bits_))-1);
+        uint_t irow = i >> (BaseState::chunk_bits_);
+        tmp[i] = state[icol_chunk + icol + ((irow_chunk + irow) << (BaseState::num_qubits_))];
       }
 
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[iChunk].initialize_from_vector(tmp);
     }
   }
@@ -391,7 +402,7 @@ void State<densmat_t>::initialize_qreg(uint_t num_qubits,
   int_t iChunk;
   if(BaseState::chunk_bits_ == BaseState::num_qubits_){
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[iChunk].initialize_from_vector(state);
     }
   }
@@ -399,21 +410,19 @@ void State<densmat_t>::initialize_qreg(uint_t num_qubits,
 
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk) 
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      uint_t local_row_offset = (BaseState::global_chunk_index_ + iChunk) & ((1ull << (BaseState::num_qubits_/2 - BaseState::chunk_bits_/2))-1);
-      uint_t local_col_offset = (BaseState::global_chunk_index_ + iChunk) >> (BaseState::num_qubits_/2 - BaseState::chunk_bits_/2);
-      local_row_offset <<= (BaseState::chunk_bits_/2);
-      local_col_offset <<= (BaseState::chunk_bits_/2);
+      uint_t irow_chunk = ((iChunk + BaseState::global_chunk_index_) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_))) << (BaseState::chunk_bits_);
+      uint_t icol_chunk = ((iChunk + BaseState::global_chunk_index_) & ((1ull << ((BaseState::num_qubits_ - BaseState::chunk_bits_)))-1)) << (BaseState::chunk_bits_);
 
       //copy part of state for this chunk
       uint_t i,row,col;
       cvector_t tmp(1ull << BaseState::chunk_bits_);
       for(i=0;i<(1ull << BaseState::chunk_bits_);i++){
-        uint_t row = i & ((1ull << (BaseState::chunk_bits_/2))-1);
-        uint_t col = i >> (BaseState::chunk_bits_/2);
-        tmp[i] = state[local_row_offset + row + ((local_col_offset + col) << (BaseState::num_qubits_/2))];
+        uint_t icol = i & ((1ull << (BaseState::chunk_bits_))-1);
+        uint_t irow = i >> (BaseState::chunk_bits_);
+        tmp[i] = state[icol_chunk + icol + ((irow_chunk + irow) << (BaseState::num_qubits_))];
       }
 
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[iChunk].initialize_from_vector(tmp);
     }
   }
@@ -437,32 +446,94 @@ auto State<densmat_t>::move_to_matrix()
   if(BaseState::num_global_chunks_ == 1){
     return BaseState::qregs_[0].move_to_matrix();
   }
-  else{
-    int_t iChunk;
-    auto state = BaseState::qregs_[0].vector();
+  return apply_to_matrix(false);
+}
+
+template <class densmat_t>
+auto State<densmat_t>::copy_to_matrix()
+{
+  if(BaseState::num_global_chunks_ == 1){
+    return BaseState::qregs_[0].copy_to_matrix();
+  }
+  return apply_to_matrix(true);
+}
+
+template <class densmat_t>
+auto State<densmat_t>::apply_to_matrix(bool copy)
+{
+  int_t iChunk;
+  uint_t size = 1ull << (BaseState::chunk_bits_*2);
+  uint_t mask = (1ull << (BaseState::chunk_bits_)) - 1;
+  uint_t num_threads = BaseState::qregs_[0].get_omp_threads();
 
+  auto matrix = BaseState::qregs_[0].copy_to_matrix();
+
+  if(BaseState::distributed_rank_ == 0){
     //TO DO check memory availability
-    state.resize(BaseState::num_local_chunks_ << BaseState::chunk_bits_);
+    matrix.resize(1ull << (BaseState::num_qubits_),1ull << (BaseState::num_qubits_));
 
-#pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk)
-    for(iChunk=1;iChunk<BaseState::num_local_chunks_;iChunk++){
-      auto tmp = BaseState::qregs_[iChunk].vector();
-      uint_t j,offset = iChunk << BaseState::chunk_bits_;
-      for(j=0;j<tmp.size();j++){
-        state[offset + j] = tmp[j];
+#ifdef AER_MPI
+    auto recv = BaseState::qregs_[0].copy_to_matrix();
+    //gather states from other processes
+    for(iChunk=BaseState::num_local_chunks_;iChunk<BaseState::num_global_chunks_;iChunk++){
+      BaseState::recv_data(recv.data(),size,0,iChunk);
+
+      int_t i;
+      uint_t irow_chunk = ((iChunk) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_))) << (BaseState::chunk_bits_);
+      uint_t icol_chunk = ((iChunk) & ((1ull << ((BaseState::num_qubits_ - BaseState::chunk_bits_)))-1)) << (BaseState::chunk_bits_);
+#pragma omp parallel for if(num_threads > 1) num_threads(num_threads)
+      for(i=0;i<size;i++){
+        uint_t irow = i >> (BaseState::chunk_bits_);
+        uint_t icol = i & mask;
+        matrix(icol_chunk+icol,irow_chunk+irow) = recv(icol,irow);
       }
     }
+#endif
 
+    for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
+      int_t i;
+      uint_t irow_chunk = ((iChunk + BaseState::global_chunk_index_) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_))) << (BaseState::chunk_bits_);
+      uint_t icol_chunk = ((iChunk + BaseState::global_chunk_index_) & ((1ull << ((BaseState::num_qubits_ - BaseState::chunk_bits_)))-1)) << (BaseState::chunk_bits_);
+      if(copy){
+        auto tmp = BaseState::qregs_[iChunk].copy_to_matrix();
+#pragma omp parallel for if(num_threads > 1) num_threads(num_threads)
+        for(i=0;i<size;i++){
+          uint_t irow = i >> (BaseState::chunk_bits_);
+          uint_t icol = i & mask;
+          matrix(icol_chunk+icol,irow_chunk+irow) = tmp(icol,irow);
+        }
+      }
+      else{
+        auto tmp = BaseState::qregs_[iChunk].move_to_matrix();
+#pragma omp parallel for if(num_threads > 1) num_threads(num_threads)
+        for(i=0;i<size;i++){
+          uint_t irow = i >> (BaseState::chunk_bits_);
+          uint_t icol = i & mask;
+          matrix(icol_chunk+icol,irow_chunk+irow) = tmp(icol,irow);
+        }
+      }
+    }
+  }
+  else{
 #ifdef AER_MPI
-    BaseState::gather_state(state);
+    //send matrices to process 0
+    for(iChunk=0;iChunk<BaseState::num_global_chunks_;iChunk++){
+      uint_t iProc = BaseState::get_process_by_chunk(iChunk);
+      if(iProc == BaseState::distributed_rank_){
+        if(copy){
+          auto tmp = BaseState::qregs_[iChunk-BaseState::global_chunk_index_].copy_to_matrix();
+          BaseState::send_data(tmp.data(),size,iChunk,0);
+        }
+        else{
+          auto tmp = BaseState::qregs_[iChunk-BaseState::global_chunk_index_].move_to_matrix();
+          BaseState::send_data(tmp.data(),size,iChunk,0);
+        }
+      }
+    }
 #endif
-
-    //type of matrix cam not be discovered from State class, so make from matrix
-    auto matrix = BaseState::qregs_[0].move_to_matrix();
-    matrix.resize(1ull << (BaseState::num_qubits_/2),1ull << (BaseState::num_qubits_/2));
-    matrix.copy_from_buffer(1ull << (BaseState::num_qubits_/2),1ull << (BaseState::num_qubits_/2),&state[0]);
-    return matrix;
   }
+
+  return matrix;
 }
 
 //-------------------------------------------------------------------------
@@ -540,13 +611,20 @@ void State<densmat_t>::apply_op(const int_t iChunk,const Operations::Op &op,
       case Operations::OpType::superop:
         BaseState::qregs_[iChunk].apply_superop_matrix(op.qubits, Utils::vectorize_matrix(op.mats[0]));
         break;
-      case Operations::OpType::kraus:
-        apply_kraus(op.qubits, op.mats);
-        break;
       case Operations::OpType::save_expval:
       case Operations::OpType::save_expval_var:
         BaseState::apply_save_expval(op, result);
         break;
+      case Operations::OpType::save_densmat:
+        apply_save_density_matrix(op, result, final_ops);
+        break;
+      case Operations::OpType::save_probs:
+      case Operations::OpType::save_probs_ket:
+        apply_save_probs(op, result);
+        break;
+      case Operations::OpType::save_amps_sq:
+          apply_save_amplitudes_sq(op, result);
+          break;
       default:
         throw std::invalid_argument("DensityMatrix::State::invalid instruction \'" +
                                     op.name + "\'.");
@@ -561,26 +639,26 @@ void State<densmat_t>::apply_chunk_swap(const reg_t &qubits)
   uint_t q0,q1;
   q0 = qubits[0];
   q1 = qubits[1];
-  if(qubits[0] >= BaseState::chunk_bits_/2){
-    q0 += BaseState::chunk_bits_/2;
+  if(qubits[0] >= BaseState::chunk_bits_){
+    q0 += BaseState::chunk_bits_;
   }
-  if(qubits[1] >= BaseState::chunk_bits_/2){
-    q1 += BaseState::chunk_bits_/2;
+  if(qubits[1] >= BaseState::chunk_bits_){
+    q1 += BaseState::chunk_bits_;
   }
   reg_t qs0 = {{q0, q1}};
   BaseState::apply_chunk_swap(qs0);
 
-  if(qubits[0] >= BaseState::chunk_bits_/2){
-    q0 += (BaseState::num_qubits_ - BaseState::chunk_bits_)/2;
+  if(qubits[0] >= BaseState::chunk_bits_){
+    q0 += (BaseState::num_qubits_ - BaseState::chunk_bits_);
   }
   else{
-    q0 += BaseState::chunk_bits_/2;
+    q0 += BaseState::chunk_bits_;
   }
-  if(qubits[1] >= BaseState::chunk_bits_/2){
-    q1 += (BaseState::num_qubits_ - BaseState::chunk_bits_)/2;
+  if(qubits[1] >= BaseState::chunk_bits_){
+    q1 += (BaseState::num_qubits_ - BaseState::chunk_bits_);
   }
   else{
-    q1 += BaseState::chunk_bits_/2;
+    q1 += BaseState::chunk_bits_;
   }
   reg_t qs1 = {{q0, q1}};
   BaseState::apply_chunk_swap(qs1);
@@ -590,9 +668,56 @@ void State<densmat_t>::apply_chunk_swap(const reg_t &qubits)
 // Implementation: Save data
 //=========================================================================
 
-template <class statevec_t>
-double State<statevec_t>::expval_pauli(const reg_t &qubits,
-                                       const std::string& pauli) 
+template <class densmat_t>
+void State<densmat_t>::apply_save_probs(const Operations::Op &op,
+                                            ExperimentResult &result) {
+  auto probs = measure_probs(op.qubits);
+  if (op.type == Operations::OpType::save_probs_ket) {
+    BaseState::save_data_average(result, op.string_params[0],
+                                 Utils::vec2ket(probs, json_chop_threshold_, 16),
+                                 op.save_type);
+  } else {
+    BaseState::save_data_average(result, op.string_params[0],
+                                 std::move(probs), op.save_type);
+  }
+}
+
+template <class densmat_t>
+void State<densmat_t>::apply_save_amplitudes_sq(const Operations::Op &op,
+                                                ExperimentResult &result) 
+{
+  if (op.int_params.empty()) {
+    throw std::invalid_argument("Invalid save_amplitudes_sq instructions (empty params).");
+  }
+  const int_t size = op.int_params.size();
+  int_t iChunk;
+  rvector_t amps_sq(size,0);
+#pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk) 
+  for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
+    uint_t irow,icol;
+    irow = (BaseState::global_chunk_index_ + iChunk) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_));
+    icol = (BaseState::global_chunk_index_ + iChunk) - (irow << ((BaseState::num_qubits_ - BaseState::chunk_bits_)));
+    if(irow != icol)
+      continue;
+
+#pragma omp parallel for if (size > pow(2, omp_qubit_threshold_) &&        \
+                                 BaseState::threads_ > 1)                       \
+                          num_threads(BaseState::threads_)
+    for (int_t i = 0; i < size; ++i) {
+      if(op.int_params[i] >= (irow << BaseState::chunk_bits_) && op.int_params[i] < ((irow+1) << BaseState::chunk_bits_))
+        amps_sq[i] = BaseState::qregs_[iChunk].probability(op.int_params[i] - (irow << BaseState::chunk_bits_));
+    }
+  }
+#ifdef AER_MPI
+  BaseState::reduce_sum(amps_sq);
+#endif
+  BaseState::save_data_average(result, op.string_params[0],
+                               std::move(amps_sq), op.save_type);
+}
+
+template <class densmat_t>
+double State<densmat_t>::expval_pauli(const reg_t &qubits,
+                                      const std::string& pauli) 
 {
   reg_t qubits_in_chunk;
   reg_t qubits_out_chunk;
@@ -604,7 +729,7 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
   //get inner/outer chunk pauli string
   n = pauli.size();
   for(i=0;i<n;i++){
-    if(qubits[i] < BaseState::chunk_bits_/2){
+    if(qubits[i] < BaseState::chunk_bits_){
       qubits_in_chunk.push_back(qubits[i]);
       pauli_in_chunk.push_back(pauli[n-i-1]);
     }
@@ -614,7 +739,7 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
     }
   }
 
-  int_t nrows = 1ull << ((BaseState::num_qubits_ - BaseState::chunk_bits_)/2);
+  int_t nrows = 1ull << ((BaseState::num_qubits_ - BaseState::chunk_bits_));
 
   if(qubits_out_chunk.size() > 0){  //there are bits out of chunk
     std::complex<double> phase = 1.0;
@@ -625,10 +750,10 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
     uint_t x_mask, z_mask, num_y, x_max;
     std::tie(x_mask, z_mask, num_y, x_max) = AER::QV::pauli_masks_and_phase(qubits_out_chunk, pauli_out_chunk);
 
-    z_mask >>= (BaseState::chunk_bits_/2);
+    z_mask >>= (BaseState::chunk_bits_);
     if(x_mask != 0){
-      x_mask >>= (BaseState::chunk_bits_/2);
-      x_max -= (BaseState::chunk_bits_/2);
+      x_mask >>= (BaseState::chunk_bits_);
+      x_max -= (BaseState::chunk_bits_);
 
       AER::QV::add_y_phase(num_y,phase);
 
@@ -641,10 +766,10 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
         uint_t iChunk = (irow ^ x_mask) + irow * nrows;
 
         if(BaseState::chunk_index_begin_[BaseState::distributed_rank_] <= iChunk && BaseState::chunk_index_end_[BaseState::distributed_rank_] > iChunk){  //on this process
-          double sign = 1.0;
-          if (z_mask && (AER::Utils::popcount(iChunk & z_mask) & 1))
-            sign = -1.0;
-          expval += sign * BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits_in_chunk, pauli_in_chunk,phase);
+          double sign = 2.0;
+          if (z_mask && (AER::Utils::popcount(irow & z_mask) & 1))
+            sign = -2.0;
+          expval += sign * BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli_non_diagonal_chunk(qubits_in_chunk, pauli_in_chunk,phase);
         }
       }
     }
@@ -654,9 +779,9 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
         uint_t iChunk = i * (nrows+1);
         if(BaseState::chunk_index_begin_[BaseState::distributed_rank_] <= iChunk && BaseState::chunk_index_end_[BaseState::distributed_rank_] > iChunk){  //on this process
           double sign = 1.0;
-          if (z_mask && (AER::Utils::popcount((i + BaseState::global_chunk_index_) & z_mask) & 1))
+          if (z_mask && (AER::Utils::popcount(i & z_mask) & 1))
             sign = -1.0;
-          expval += sign * BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits_in_chunk, pauli_in_chunk);
+          expval += sign * BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits_in_chunk, pauli_in_chunk,1.0);
         }
       }
     }
@@ -666,7 +791,7 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
     for(i=0;i<nrows;i++){
       uint_t iChunk = i * (nrows+1);
       if(BaseState::chunk_index_begin_[BaseState::distributed_rank_] <= iChunk && BaseState::chunk_index_end_[BaseState::distributed_rank_] > iChunk){  //on this process
-        expval += BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits, pauli);
+        expval += BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits, pauli,1.0);
       }
     }
   }
@@ -677,6 +802,16 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
   return expval;
 }
 
+template <class densmat_t>
+void State<densmat_t>::apply_save_density_matrix(const Operations::Op &op,
+                                                 ExperimentResult &result,
+                                                 bool last_op) 
+{
+  BaseState::save_data_average(result, op.string_params[0],
+                               reduced_density_matrix(op.qubits, last_op),
+                               op.save_type);
+}
+
 //=========================================================================
 // Implementation: Snapshots
 //=========================================================================
@@ -717,10 +852,10 @@ void State<densmat_t>::apply_snapshot(const Operations::Op &op,
       snapshot_pauli_expval(op, result, true);
     } break;
     /* TODO
-    case DensityMatrix::Snapshots::expval_matrix: {
+    case Snapshots::expval_matrix: {
       snapshot_matrix_expval(op, data, false);
     }  break;
-    case DensityMatrix::Snapshots::expval_matrix_var: {
+    case Snapshots::expval_matrix_var: {
       snapshot_matrix_expval(op, data, true);
     }  break;
     */
@@ -775,10 +910,19 @@ template <class densmat_t>
 void State<densmat_t>::snapshot_density_matrix(const Operations::Op &op,
                                                ExperimentResult &result,
                                                bool last_op)
+{
+  result.legacy_data.add_average_snapshot("density_matrix", op.string_params[0],
+                            BaseState::creg_.memory_hex(),
+                            reduced_density_matrix(op.qubits, last_op), false);
+}
+
+
+template <class densmat_t>
+cmatrix_t State<densmat_t>::reduced_density_matrix(const reg_t& qubits, bool last_op) 
 {
   cmatrix_t reduced_state;
   // Check if tracing over all qubits
-  if (op.qubits.empty()) {
+  if (qubits.empty()) {
     reduced_state = cmatrix_t(1, 1);
 
     std::complex<double> sum = 0.0;
@@ -790,30 +934,26 @@ void State<densmat_t>::snapshot_density_matrix(const Operations::Op &op,
 #endif
     reduced_state[0] = sum;
   } else {
-
-    auto qubits_sorted = op.qubits;
+    auto qubits_sorted = qubits;
     std::sort(qubits_sorted.begin(), qubits_sorted.end());
 
-    if ((op.qubits.size() == BaseState::qregs_[0].num_qubits()) && (op.qubits == qubits_sorted)) {
+    if ((qubits.size() == BaseState::num_qubits_) && (qubits == qubits_sorted)) {
       if (last_op) {
         reduced_state = move_to_matrix();
       } else {
-        reduced_state = move_to_matrix();
+        reduced_state = copy_to_matrix();
       }
     } else {
-      reduced_state = reduced_density_matrix(op.qubits, qubits_sorted);
+      reduced_state = reduced_density_matrix_helper(qubits, qubits_sorted);
     }
   }
-
-  result.legacy_data.add_average_snapshot("density_matrix", op.string_params[0],
-                            BaseState::creg_.memory_hex(),
-                            std::move(reduced_state), false);
+  return reduced_state;
 }
-
-
+  
 template <class densmat_t>
-cmatrix_t State<densmat_t>::reduced_density_matrix(const reg_t& qubits, const reg_t& qubits_sorted) {
-
+cmatrix_t State<densmat_t>::reduced_density_matrix_helper(const reg_t &qubits,
+                                          const reg_t &qubits_sorted) 
+{
   // Get superoperator qubits
   const reg_t squbits = BaseState::qregs_[0].superop_qubits(qubits);
   const reg_t squbits_sorted = BaseState::qregs_[0].superop_qubits(qubits_sorted);
@@ -832,12 +972,12 @@ cmatrix_t State<densmat_t>::reduced_density_matrix(const reg_t& qubits, const re
   auto vmat = BaseState::qregs_[0].vector();
 
   //TO DO check memory availability
-  vmat.resize(BaseState::num_local_chunks_ << BaseState::chunk_bits_);
+  vmat.resize(BaseState::num_local_chunks_ << (BaseState::chunk_bits_*2));
 
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk)
   for(iChunk=1;iChunk<BaseState::num_local_chunks_;iChunk++){
     auto tmp = BaseState::qregs_[iChunk].vector();
-    uint_t j,offset = iChunk << BaseState::chunk_bits_;
+    uint_t j,offset = iChunk << (BaseState::chunk_bits_*2);
     for(j=0;j<tmp.size();j++){
       vmat[offset + j] = tmp[j];
     }
@@ -967,6 +1107,9 @@ void State<densmat_t>::apply_gate(const uint_t iChunk, const Operations::Op &op)
     case DensityMatrix::Gates::rzx:
       BaseState::qregs_[iChunk].apply_unitary_matrix(op.qubits, Linalg::VMatrix::rzx(op.params[0]));
       break;
+    case DensityMatrix::Gates::pauli:
+      apply_pauli(op.qubits, op.string_params[0]);
+      break;
     default:
       // We shouldn't reach here unless there is a bug in gateset
       throw std::invalid_argument("DensityMatrix::State::invalid gate instruction \'" +
@@ -1049,7 +1192,7 @@ rvector_t State<densmat_t>::measure_probs(const reg_t &qubits) const
   reg_t qubits_out_chunk;
 
   for(i=0;i<qubits.size();i++){
-    if(qubits[i] < BaseState::chunk_bits_/2){
+    if(qubits[i] < BaseState::chunk_bits_){
       qubits_in_chunk.push_back(qubits[i]);
     }
     else{
@@ -1060,8 +1203,8 @@ rvector_t State<densmat_t>::measure_probs(const reg_t &qubits) const
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(i,j,k) 
   for(i=0;i<BaseState::num_local_chunks_;i++){
     uint_t irow,icol;
-    irow = (BaseState::global_chunk_index_ + i) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_)/2);
-    icol = (BaseState::global_chunk_index_ + i) - (irow << ((BaseState::num_qubits_ - BaseState::chunk_bits_)/2));
+    irow = (BaseState::global_chunk_index_ + i) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_));
+    icol = (BaseState::global_chunk_index_ + i) - (irow << ((BaseState::num_qubits_ - BaseState::chunk_bits_)));
 
     if(irow == icol){   //diagonal chunk
       auto chunkSum = BaseState::qregs_[i].probabilities(qubits);
@@ -1076,12 +1219,12 @@ rvector_t State<densmat_t>::measure_probs(const reg_t &qubits) const
           int idx = 0;
           int i_in = 0;
           for(k=0;k<qubits.size();k++){
-            if(qubits[k] < BaseState::chunk_bits_){
+            if(qubits[k] < (BaseState::chunk_bits_*2)){
               idx += (((j >> i_in) & 1) << k);
               i_in++;
             }
             else{
-              if((((i + BaseState::global_chunk_index_) << (BaseState::chunk_bits_/2)) >> qubits[k]) & 1){
+              if((((i + BaseState::global_chunk_index_) << (BaseState::chunk_bits_)) >> qubits[k]) & 1){
                 idx += 1ull << k;
               }
             }
@@ -1116,8 +1259,8 @@ std::vector<reg_t> State<densmat_t>::sample_measure(const reg_t &qubits,
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(i) 
   for(i=0;i<BaseState::num_local_chunks_;i++){
     uint_t irow,icol;
-    irow = (BaseState::global_chunk_index_ + i) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_)/2);
-    icol = (BaseState::global_chunk_index_ + i) - (irow << ((BaseState::num_qubits_ - BaseState::chunk_bits_)/2));
+    irow = (BaseState::global_chunk_index_ + i) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_));
+    icol = (BaseState::global_chunk_index_ + i) - (irow << ((BaseState::num_qubits_ - BaseState::chunk_bits_)));
     if(irow == icol)   //only diagonal chunk has probabilities
       chunkSum[i] = std::real( BaseState::qregs_[i].trace() );
     else
@@ -1150,29 +1293,33 @@ std::vector<reg_t> State<densmat_t>::sample_measure(const reg_t &qubits,
   //get rnds positions for each chunk
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(i,j) 
   for(i=0;i<BaseState::num_local_chunks_;i++){
-    if(chunkSum[i] != chunkSum[i+1]){
-      uint_t nIn;
-      std::vector<uint_t> vIdx;
-      std::vector<double> vRnd;
-
-      //find rnds in this chunk
-      nIn = 0;
-      for(j=0;j<shots;j++){
-        if(rnds[j] >= chunkSum[i] + globalSum && rnds[j] < chunkSum[i+1] + globalSum){
-          vRnd.push_back(rnds[j] - (globalSum + chunkSum[i]));
-          vIdx.push_back(j);
-          nIn++;
-        }
+    uint_t irow,icol;
+    irow = (BaseState::global_chunk_index_ + i) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_));
+    icol = (BaseState::global_chunk_index_ + i) - (irow << ((BaseState::num_qubits_ - BaseState::chunk_bits_)));
+    if(irow != icol)
+      continue;
+
+    uint_t nIn;
+    std::vector<uint_t> vIdx;
+    std::vector<double> vRnd;
+
+    //find rnds in this chunk
+    nIn = 0;
+    for(j=0;j<shots;j++){
+      if(rnds[j] >= chunkSum[i] + globalSum && rnds[j] < chunkSum[i+1] + globalSum){
+        vRnd.push_back(rnds[j] - (globalSum + chunkSum[i]));
+        vIdx.push_back(j);
+        nIn++;
       }
+    }
 
-      if(nIn > 0){
-        auto chunkSamples = BaseState::qregs_[i].sample_measure(vRnd);
-        uint_t irow;
-        irow = (BaseState::global_chunk_index_ + i) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_)/2);
+    if(nIn > 0){
+      auto chunkSamples = BaseState::qregs_[i].sample_measure(vRnd);
+      uint_t irow;
+      irow = (BaseState::global_chunk_index_ + i) >> ((BaseState::num_qubits_ - BaseState::chunk_bits_));
 
-        for(j=0;j<nIn;j++){
-          local_samples[vIdx[j]] = (irow << BaseState::chunk_bits_/2) + chunkSamples[j];
-        }
+      for(j=0;j<nIn;j++){
+        local_samples[vIdx[j]] = (irow << BaseState::chunk_bits_) + chunkSamples[j];
       }
     }
   }
@@ -1186,7 +1333,7 @@ std::vector<reg_t> State<densmat_t>::sample_measure(const reg_t &qubits,
   std::vector<reg_t> all_samples;
   all_samples.reserve(shots);
   for (int_t val : allbit_samples) {
-    reg_t allbit_sample = Utils::int2reg(val, 2, BaseState::num_qubits_/2);
+    reg_t allbit_sample = Utils::int2reg(val, 2, BaseState::num_qubits_);
     reg_t sample;
     sample.reserve(qubits.size());
     for (uint_t qubit : qubits) {
@@ -1291,7 +1438,7 @@ void State<densmat_t>::measure_reset_update(const reg_t &qubits,
 
 template <class densmat_t>
 void State<densmat_t>::apply_kraus(const reg_t &qubits,
-                                    const std::vector<cmatrix_t> &kmats) 
+                                    const std::vector<cmatrix_t> &kmats)
 {
   int_t i;
   // Convert to Superoperator
diff --git a/src/simulators/density_matrix/densitymatrix_thrust.hpp b/src/simulators/density_matrix/densitymatrix_thrust.hpp
index 850df29dc8..3767649b39 100755
--- a/src/simulators/density_matrix/densitymatrix_thrust.hpp
+++ b/src/simulators/density_matrix/densitymatrix_thrust.hpp
@@ -143,6 +143,7 @@ class DensityMatrixThrust : public UnitaryMatrixThrust<data_t> {
   // Return the expectation value of an N-qubit Pauli matrix.
   // The Pauli is input as a length N string of I,X,Y,Z characters.
   double expval_pauli(const reg_t &qubits, const std::string &pauli,const complex_t initial_phase=1.0) const;
+  double expval_pauli_non_diagonal_chunk(const reg_t &qubits, const std::string &pauli,const complex_t initial_phase=1.0) const;
 
 protected:
   // Construct a vectorized superoperator from a vectorized matrix
@@ -888,6 +889,68 @@ double DensityMatrixThrust<data_t>::expval_pauli(const reg_t &qubits,
     expval_pauli_XYZ_func_dm<data_t>(x_mask, z_mask, x_max, phase, BaseMatrix::rows_) );
 }
 
+template <typename data_t>
+class expval_pauli_XYZ_func_dm_non_diagonal : public GateFuncBase<data_t>
+{
+protected:
+  uint_t x_mask_;
+  uint_t z_mask_;
+  thrust::complex<data_t> phase_;
+  uint_t rows_;
+public:
+  expval_pauli_XYZ_func_dm_non_diagonal(uint_t x,uint_t z,uint_t x_max,std::complex<data_t> p,uint_t stride)
+  {
+    rows_ = stride;
+    x_mask_ = x;
+    z_mask_ = z;
+    phase_ = p;
+  }
+
+  uint_t size(int num_qubits)
+  {
+    return rows_;
+  }
+
+  __host__ __device__ double operator()(const uint_t &i) const
+  {
+    thrust::complex<data_t>* vec;
+    thrust::complex<data_t> q0;
+    double ret = 0.0;
+    uint_t idx_mat;
+
+    vec = this->data_;
+
+    idx_mat = i ^ x_mask_ + rows_ * i;
+
+    q0 = vec[idx_mat];
+    q0 = phase_ * q0;
+    ret = q0.real();
+    if(z_mask_ != 0){
+      if(pop_count_kernel(i & z_mask_) & 1)
+        ret = -ret;
+    }
+    return ret;
+  }
+  const char* name(void)
+  {
+    return "expval_pauli_XYZ";
+  }
+};
+
+template <typename data_t>
+double DensityMatrixThrust<data_t>::expval_pauli_non_diagonal_chunk(const reg_t &qubits,
+                                                 const std::string &pauli,const complex_t initial_phase) const 
+{
+  uint_t x_mask, z_mask, num_y, x_max;
+  std::tie(x_mask, z_mask, num_y, x_max) = pauli_masks_and_phase(qubits, pauli);
+
+  // Compute the overall phase of the operator.
+  // This is (-1j) ** number of Y terms modulo 4
+  auto phase = std::complex<data_t>(initial_phase);
+  add_y_phase(num_y, phase);
+  return BaseVector::apply_function_sum(
+    expval_pauli_XYZ_func_dm_non_diagonal<data_t>(x_mask, z_mask, x_max, phase, BaseMatrix::rows_) );
+}
 //-----------------------------------------------------------------------
 // Z-measurement outcome probabilities
 //-----------------------------------------------------------------------
diff --git a/src/simulators/state.hpp b/src/simulators/state.hpp
index 3b9c5fe10e..f7485e56df 100644
--- a/src/simulators/state.hpp
+++ b/src/simulators/state.hpp
@@ -128,7 +128,7 @@ class State {
                                     const = 0;
 
   //memory allocation (previously called before inisitalize_qreg)
-  virtual void allocate(uint_t num_qubits) {}
+  virtual void allocate(uint_t num_qubits,uint_t block_bits) {}
 
   // Return the expectation value of a N-qubit Pauli operator
   // If the simulator does not support Pauli expectation value this should
diff --git a/src/simulators/state_chunk.hpp b/src/simulators/state_chunk.hpp
index b8ace7a198..61674a7610 100644
--- a/src/simulators/state_chunk.hpp
+++ b/src/simulators/state_chunk.hpp
@@ -118,7 +118,7 @@ class StateChunk {
                          bool final_ops = false);
 
   //memory allocation (previously called before inisitalize_qreg)
-  virtual void allocate(uint_t num_qubits);
+  virtual void allocate(uint_t num_qubits,uint_t block_bits);
 
   // Initializes the State to the default state.
   // Typically this is the n-qubit all |0> state
@@ -319,6 +319,11 @@ class StateChunk {
   void send_chunk(uint_t local_chunk_index, uint_t global_chunk_index);
   void recv_chunk(uint_t local_chunk_index, uint_t global_chunk_index);
 
+  template <class data_t>
+  void send_data(data_t* pSend, uint_t size, uint_t myid,uint_t pairid);
+  template <class data_t>
+  void recv_data(data_t* pRecv, uint_t size, uint_t myid,uint_t pairid);
+
   //reduce values over processes
   void reduce_sum(rvector_t& sum) const;
   void reduce_sum(complex_t& sum) const;
@@ -433,13 +438,14 @@ void StateChunk<state_t>::set_distribution(uint_t nprocs)
 }
 
 template <class state_t>
-void StateChunk<state_t>::allocate(uint_t num_qubits)
+void StateChunk<state_t>::allocate(uint_t num_qubits,uint_t block_bits)
 {
   int_t i;
   uint_t nchunks;
   int max_bits = num_qubits;
 
   num_qubits_ = num_qubits;
+  block_bits_ = block_bits;
 
   if(block_bits_ > 0){
     chunk_bits_ = block_bits_;
@@ -451,11 +457,7 @@ void StateChunk<state_t>::allocate(uint_t num_qubits)
     chunk_bits_ = num_qubits_;
   }
 
-  //scale for density/unitary matrix simulators
-  chunk_bits_ *= qubit_scale();
-  num_qubits_ *= qubit_scale();
-
-  num_global_chunks_ = 1ull << (num_qubits_ - chunk_bits_);
+  num_global_chunks_ = 1ull << ((num_qubits_ - chunk_bits_)*qubit_scale());
 
   chunk_index_begin_.resize(distributed_procs_);
   chunk_index_end_.resize(distributed_procs_);
@@ -469,8 +471,8 @@ void StateChunk<state_t>::allocate(uint_t num_qubits)
 
   qregs_.resize(num_local_chunks_);
 
-  chunk_omp_parallel_ = false;
   gpu_optimization_ = false;
+  chunk_omp_parallel_ = false;
   if(qregs_[0].name().find("gpu") != std::string::npos){
     if(chunk_bits_ < num_qubits_){
       chunk_omp_parallel_ = true;   //CUDA backend requires thread parallelization of chunk loop
@@ -481,7 +483,7 @@ void StateChunk<state_t>::allocate(uint_t num_qubits)
   nchunks = num_local_chunks_;
   for(i=0;i<num_local_chunks_;i++){
     uint_t gid = i + global_chunk_index_;
-    qregs_[i].chunk_setup(chunk_bits_,num_qubits_,gid,nchunks);
+    qregs_[i].chunk_setup(chunk_bits_*qubit_scale(),num_qubits_*qubit_scale(),gid,nchunks);
 
     //only first one allocates chunks, others only set chunk index
     nchunks = 0;
@@ -582,12 +584,12 @@ void StateChunk<state_t>::block_diagonal_matrix(const int_t iChunk, reg_t &qubit
   cvector_t diag_in;
 
   for(i=0;i<qubits.size();i++){
-    if(qubits[i] < chunk_bits_/qubit_scale()){ //in chunk
+    if(qubits[i] < chunk_bits_){ //in chunk
       qubits_in.push_back(qubits[i]);
     }
     else{
       mask_out |= (1ull << i);
-      if((gid >> (qubits[i] - chunk_bits_/qubit_scale())) & 1)
+      if((gid >> (qubits[i] - chunk_bits_)) & 1)
         mask_id |= (1ull << i);
     }
   }
@@ -860,7 +862,7 @@ void StateChunk<state_t>::apply_chunk_swap(const reg_t &qubits)
     q1 = t;
   }
 
-  if(q1 < chunk_bits_){
+  if(q1 < chunk_bits_*qubit_scale()){
     //device
 #pragma omp parallel for if(chunk_omp_parallel_) private(iChunk) 
     for(iChunk=0;iChunk<num_local_chunks_;iChunk++){
@@ -872,15 +874,15 @@ void StateChunk<state_t>::apply_chunk_swap(const reg_t &qubits)
     uint_t nPair,mask0,mask1;
     uint_t baseChunk,iChunk1,iChunk2;
 
-    if(q0 < chunk_bits_)
+    if(q0 < chunk_bits_*qubit_scale())
       nLarge = 1;
     else
       nLarge = 2;
 
     mask0 = (1ull << q0);
     mask1 = (1ull << q1);
-    mask0 >>= chunk_bits_;
-    mask1 >>= chunk_bits_;
+    mask0 >>= (chunk_bits_*qubit_scale());
+    mask1 >>= (chunk_bits_*qubit_scale());
 
     int proc_bits = 0;
     uint_t procs = distributed_procs_;
@@ -893,8 +895,8 @@ void StateChunk<state_t>::apply_chunk_swap(const reg_t &qubits)
       procs >>= 1;
     }
 
-    if(distributed_procs_ == 1 || (proc_bits >= 0 && q1 < (num_qubits_ - proc_bits))){   //no data transfer between processes is needed
-      if(q0 < chunk_bits_){
+    if(distributed_procs_ == 1 || (proc_bits >= 0 && q1 < (num_qubits_*qubit_scale() - proc_bits))){   //no data transfer between processes is needed
+      if(q0 < chunk_bits_*qubit_scale()){
         nPair = num_local_chunks_ >> 1;
       }
       else{
@@ -903,7 +905,7 @@ void StateChunk<state_t>::apply_chunk_swap(const reg_t &qubits)
 
 #pragma omp parallel for if(chunk_omp_parallel_) private(iPair,baseChunk,iChunk1,iChunk2)
       for(iPair=0;iPair<nPair;iPair++){
-        if(q0 < chunk_bits_){
+        if(q0 < chunk_bits_*qubit_scale()){
           baseChunk = iPair & (mask1-1);
           baseChunk += ((iPair - baseChunk) << 1);
         }
@@ -932,31 +934,31 @@ void StateChunk<state_t>::apply_chunk_swap(const reg_t &qubits)
       uint_t iLocalChunk,iRemoteChunk,iProc;
       int i;
 
-      if(q0 < chunk_bits_){
+      if(q0 < chunk_bits_*qubit_scale()){
         nLarge = 1;
-        nu[0] = 1ull << (q1 - chunk_bits_);
+        nu[0] = 1ull << (q1 - chunk_bits_*qubit_scale());
         ub[0] = 0;
         iu[0] = 0;
 
-        nu[1] = 1ull << (num_qubits_ - q1 - 1);
-        ub[1] = (q1 - chunk_bits_) + 1;
+        nu[1] = 1ull << (num_qubits_*qubit_scale() - q1 - 1);
+        ub[1] = (q1 - chunk_bits_*qubit_scale()) + 1;
         iu[1] = 0;
       }
       else{
         nLarge = 2;
-        nu[0] = 1ull << (q0 - chunk_bits_);
+        nu[0] = 1ull << (q0 - chunk_bits_*qubit_scale());
         ub[0] = 0;
         iu[0] = 0;
 
         nu[1] = 1ull << (q1 - q0 - 1);
-        ub[1] = (q0 - chunk_bits_) + 1;
+        ub[1] = (q0 - chunk_bits_*qubit_scale()) + 1;
         iu[1] = 0;
 
-        nu[2] = 1ull << (num_qubits_ - q1 - 1);
-        ub[2] = (q1 - chunk_bits_) + 1;
+        nu[2] = 1ull << (num_qubits_*qubit_scale() - q1 - 1);
+        ub[2] = (q1 - chunk_bits_*qubit_scale()) + 1;
         iu[2] = 0;
       }
-      nPair = 1ull << (num_qubits_ - chunk_bits_ - nLarge);
+      nPair = 1ull << (num_qubits_*qubit_scale() - chunk_bits_*qubit_scale() - nLarge);
 
       for(iPair=0;iPair<nPair;iPair++){
         //calculate index of pair of chunks
@@ -1021,7 +1023,7 @@ void StateChunk<state_t>::apply_chunk_swap(const reg_t &qubits)
 
 
 template <class state_t>
-void StateChunk<state_t>::send_chunk(uint_t local_chunk_index, uint_t global_chunk_index)
+void StateChunk<state_t>::send_chunk(uint_t local_chunk_index, uint_t global_pair_index)
 {
 #ifdef AER_MPI
   MPI_Request reqSend;
@@ -1029,17 +1031,17 @@ void StateChunk<state_t>::send_chunk(uint_t local_chunk_index, uint_t global_chu
   uint_t sizeSend;
   uint_t iProc;
 
-  iProc = get_process_by_chunk(global_chunk_index);
+  iProc = get_process_by_chunk(global_pair_index);
 
   auto pSend = qregs_[local_chunk_index].send_buffer(sizeSend);
-  MPI_Isend(pSend,sizeSend,MPI_BYTE,iProc,0,distributed_comm_,&reqSend);
+  MPI_Isend(pSend,sizeSend,MPI_BYTE,iProc,local_chunk_index + global_chunk_index_,distributed_comm_,&reqSend);
 
   MPI_Wait(&reqSend,&st);
 #endif
 }
 
 template <class state_t>
-void StateChunk<state_t>::recv_chunk(uint_t local_chunk_index, uint_t global_chunk_index)
+void StateChunk<state_t>::recv_chunk(uint_t local_chunk_index, uint_t global_pair_index)
 {
 #ifdef AER_MPI
   MPI_Request reqRecv;
@@ -1047,10 +1049,44 @@ void StateChunk<state_t>::recv_chunk(uint_t local_chunk_index, uint_t global_chu
   uint_t sizeRecv;
   uint_t iProc;
 
-  iProc = get_process_by_chunk(global_chunk_index);
+  iProc = get_process_by_chunk(global_pair_index);
 
   auto pRecv = qregs_[local_chunk_index].recv_buffer(sizeRecv);
-  MPI_Irecv(pRecv,sizeRecv,MPI_BYTE,iProc,0,distributed_comm_,&reqRecv);
+  MPI_Irecv(pRecv,sizeRecv,MPI_BYTE,iProc,global_pair_index,distributed_comm_,&reqRecv);
+
+  MPI_Wait(&reqRecv,&st);
+#endif
+}
+
+template <class state_t>
+template <class data_t>
+void StateChunk<state_t>::send_data(data_t* pSend, uint_t size, uint_t myid,uint_t pairid)
+{
+#ifdef AER_MPI
+  MPI_Request reqSend;
+  MPI_Status st;
+  uint_t iProc;
+
+  iProc = get_process_by_chunk(pairid);
+
+  MPI_Isend(pSend,size*sizeof(data_t),MPI_BYTE,iProc,myid,distributed_comm_,&reqSend);
+
+  MPI_Wait(&reqSend,&st);
+#endif
+}
+
+template <class state_t>
+template <class data_t>
+void StateChunk<state_t>::recv_data(data_t* pRecv, uint_t size, uint_t myid,uint_t pairid)
+{
+#ifdef AER_MPI
+  MPI_Request reqRecv;
+  MPI_Status st;
+  uint_t iProc;
+
+  iProc = get_process_by_chunk(pairid);
+
+  MPI_Irecv(pRecv,size*sizeof(data_t),MPI_BYTE,iProc,pairid,distributed_comm_,&reqRecv);
 
   MPI_Wait(&reqRecv,&st);
 #endif
diff --git a/src/simulators/statevector/chunk/chunk.hpp b/src/simulators/statevector/chunk/chunk.hpp
index dc9c4c0894..e56446e58e 100644
--- a/src/simulators/statevector/chunk/chunk.hpp
+++ b/src/simulators/statevector/chunk/chunk.hpp
@@ -51,6 +51,8 @@ class Chunk
   }
   ~Chunk()
   {
+    if(cache_)
+      cache_.reset();
   }
 
   void set_device(void) const
diff --git a/src/simulators/statevector/chunk/chunk_container.hpp b/src/simulators/statevector/chunk/chunk_container.hpp
index e90f4592c8..b024313ad2 100644
--- a/src/simulators/statevector/chunk/chunk_container.hpp
+++ b/src/simulators/statevector/chunk/chunk_container.hpp
@@ -517,7 +517,6 @@ template <typename data_t>
 void ChunkContainer<data_t>::UnmapChunk(std::shared_ptr<Chunk<data_t>> chunk)
 {
   chunk->unmap();
-//  chunk.reset();
 }
 
 template <typename data_t>
@@ -546,7 +545,6 @@ void ChunkContainer<data_t>::UnmapBuffer(std::shared_ptr<Chunk<data_t>> buf)
 #pragma omp critical
   {
     buf->unmap();
-//    buf.reset();
   }
 }
 
@@ -585,7 +583,6 @@ void ChunkContainer<data_t>::UnmapCheckpoint(std::shared_ptr<Chunk<data_t>> buf)
 #pragma omp critical
     {
       buf->unmap();
-//      buf.reset();
     }
   }
 }
diff --git a/src/simulators/statevector/chunk/device_chunk_container.hpp b/src/simulators/statevector/chunk/device_chunk_container.hpp
index 42b8e78892..8fe2ba9250 100644
--- a/src/simulators/statevector/chunk/device_chunk_container.hpp
+++ b/src/simulators/statevector/chunk/device_chunk_container.hpp
@@ -33,7 +33,7 @@ class DeviceChunkContainer : public ChunkContainer<data_t>
 protected:
   AERDeviceVector<thrust::complex<data_t>>  data_;    //device vector to chunks and buffers
   AERDeviceVector<thrust::complex<double>>  matrix_;  //storage for large matrix
-  mutable AERDeviceVector<uint_t>                   params_;  //storage for additional parameters
+  mutable AERDeviceVector<uint_t>           params_;  //storage for additional parameters
   AERDeviceVector<double>                   reduce_buffer_; //buffer for reduction
   int device_id_;                     //device index
   std::vector<bool> peer_access_;     //to which device accepts peer access 
@@ -349,6 +349,8 @@ uint_t DeviceChunkContainer<data_t>::Resize(uint_t chunks,uint_t buffers,uint_t
 template <typename data_t>
 void DeviceChunkContainer<data_t>::Deallocate(void)
 {
+  set_device();
+
   data_.clear();
   data_.shrink_to_fit();
   matrix_.clear();
@@ -371,7 +373,6 @@ void DeviceChunkContainer<data_t>::Deallocate(void)
   }
   stream_.clear();
 #endif
-
 }
 
 template <typename data_t>
diff --git a/src/simulators/statevector/chunk/host_chunk_container.hpp b/src/simulators/statevector/chunk/host_chunk_container.hpp
index b4b9fb8d96..a6b32d1375 100644
--- a/src/simulators/statevector/chunk/host_chunk_container.hpp
+++ b/src/simulators/statevector/chunk/host_chunk_container.hpp
@@ -166,8 +166,11 @@ template <typename data_t>
 void HostChunkContainer<data_t>::Deallocate(void)
 {
   data_.clear();
+  data_.shrink_to_fit();
   matrix_.clear();
+  matrix_.shrink_to_fit();
   params_.clear();
+  params_.shrink_to_fit();
 }
 
 
diff --git a/src/simulators/statevector/qubitvector_thrust.hpp b/src/simulators/statevector/qubitvector_thrust.hpp
index c68e7b1ce6..a3155f2c84 100644
--- a/src/simulators/statevector/qubitvector_thrust.hpp
+++ b/src/simulators/statevector/qubitvector_thrust.hpp
@@ -956,15 +956,11 @@ bool QubitVectorThrust<data_t>::fetch_chunk(void) const
   int tid,nid;
   int idev;
 
-  tid = omp_get_thread_num();
-  nid = omp_get_num_threads();
-
-  idev = tid * chunk_manager_.num_devices() / nid;
-
   if(chunk_->device() < 0){
     //on host
+    idev = 0;
     do{
-      buffer_chunk_ = chunk_manager_.MapBufferChunk(idev);
+      buffer_chunk_ = chunk_manager_.MapBufferChunk(idev++ % chunk_manager_.num_devices());
     }while(!buffer_chunk_);
     chunk_->set_cache(buffer_chunk_);
     buffer_chunk_->CopyIn(chunk_);
@@ -2587,7 +2583,7 @@ void QubitVectorThrust<data_t>::apply_chunk_swap(const reg_t &qubits, QubitVecto
   else{
     thrust::complex<data_t>* pChunk0;
     thrust::complex<data_t>* pChunk1;
-    std::shared_ptr<Chunk<data_t>> pBuffer0;
+    std::shared_ptr<Chunk<data_t>> pBuffer0 = nullptr;
     std::shared_ptr<Chunk<data_t>> pExec;
 
     if(chunk_->device() >= 0){
diff --git a/src/simulators/statevector/statevector_state.hpp b/src/simulators/statevector/statevector_state.hpp
index e7364fc7d3..24da46f3ce 100755
--- a/src/simulators/statevector/statevector_state.hpp
+++ b/src/simulators/statevector/statevector_state.hpp
@@ -149,7 +149,7 @@ class State : public Base::State<statevec_t> {
   virtual std::vector<reg_t> sample_measure(const reg_t &qubits, uint_t shots,
                                             RngEngine &rng) override;
 
-  virtual void allocate(uint_t num_qubits) override;
+  virtual void allocate(uint_t num_qubits,uint_t block_bits) override;
 
   //-----------------------------------------------------------------------
   // Additional methods
@@ -437,7 +437,7 @@ const stringmap_t<Snapshots> State<statevec_t>::snapshotset_(
 // Initialization
 //-------------------------------------------------------------------------
 template <class statevec_t>
-void State<statevec_t>::allocate(uint_t num_qubits)
+void State<statevec_t>::allocate(uint_t num_qubits,uint_t block_bits)
 {
   BaseState::qreg_.chunk_setup(num_qubits,num_qubits,0,1);
 }
diff --git a/src/simulators/statevector/statevector_state_chunk.hpp b/src/simulators/statevector/statevector_state_chunk.hpp
index 5736a937fd..2ba9ce0855 100644
--- a/src/simulators/statevector/statevector_state_chunk.hpp
+++ b/src/simulators/statevector/statevector_state_chunk.hpp
@@ -33,16 +33,24 @@
 namespace AER {
 namespace StatevectorChunk {
 
+using OpType = Operations::OpType;
+
+// OpSet of supported instructions
 const Operations::OpSet StateOpSet(
     // Op types
-    {Operations::OpType::gate, Operations::OpType::measure,
-     Operations::OpType::reset, Operations::OpType::initialize,
-     Operations::OpType::snapshot, Operations::OpType::barrier,
-     Operations::OpType::bfunc, Operations::OpType::roerror,
-     Operations::OpType::matrix, Operations::OpType::diagonal_matrix,
-     Operations::OpType::multiplexer, Operations::OpType::kraus,
-     Operations::OpType::sim_op, Operations::OpType::save_expval,
-     Operations::OpType::save_expval_var},
+    {OpType::gate, OpType::measure,
+     OpType::reset, OpType::initialize,
+     OpType::snapshot, OpType::barrier,
+     OpType::bfunc, OpType::roerror,
+     OpType::matrix, OpType::diagonal_matrix,
+     OpType::multiplexer, OpType::kraus,
+     OpType::sim_op, OpType::save_expval,
+     OpType::save_expval_var, OpType::save_densmat,
+     OpType::save_probs, OpType::save_probs_ket,
+     OpType::save_amps, OpType::save_amps_sq,
+     OpType::save_statevec
+     // OpType::save_statevec_ket  // TODO
+     },
     // Gates
     {"u1",     "u2",      "u3",  "u",    "U",    "CX",   "cx",   "cz",
      "cy",     "cp",      "cu1", "cu2",  "cu3",  "swap", "id",   "p",
@@ -52,19 +60,13 @@ const Operations::OpSet StateOpSet(
      "mcswap", "mcphase", "mcr", "mcrx", "mcry", "mcry", "sx",   "csx",
      "mcsx",   "delay", "pauli", "mcx_gray"},
     // Snapshots
-    {"memory", "register", "probabilities",
+    {"statevector", "memory", "register", "probabilities",
      "probabilities_with_variance", "expectation_value_pauli", "density_matrix",
+     "density_matrix_with_variance", "expectation_value_pauli_with_variance",
      "expectation_value_matrix_single_shot", "expectation_value_matrix",
      "expectation_value_matrix_with_variance",
      "expectation_value_pauli_single_shot"});
 
-// Allowed gates enum class
-enum class Gates {
-  id, h, s, sdg, t, tdg,
-  rxx, ryy, rzz, rzx,
-  mcx, mcy, mcz, mcr, mcrx, mcry,
-  mcrz, mcp, mcu2, mcu3, mcswap, mcsx, pauli
-};
 
 //=========================================================================
 // QubitVector State subclass
@@ -119,6 +121,7 @@ class State : public Base::StateChunk<statevec_t> {
   void initialize_omp();
 
   auto move_to_vector();
+  auto copy_to_vector();
 
 protected:
 
@@ -185,6 +188,30 @@ class State : public Base::StateChunk<statevec_t> {
   // Save data instructions
   //-----------------------------------------------------------------------
 
+  // Save the current state of the statevector simulator
+  // If `last_op` is True this will use move semantics to move the simulator
+  // state to the results, otherwise it will use copy semantics to leave
+  // the current simulator state unchanged.
+  void apply_save_statevector(const Operations::Op &op,
+                              ExperimentResult &result,
+                              bool last_op);
+
+  // Save the current state of the statevector simulator as a ket-form map.
+  void apply_save_statevector_ket(const Operations::Op &op,
+                                  ExperimentResult &result);
+
+  // Save the current density matrix or reduced density matrix
+  void apply_save_density_matrix(const Operations::Op &op,
+                                 ExperimentResult &result);
+
+  // Helper function for computing expectation value
+  void apply_save_probs(const Operations::Op &op,
+                        ExperimentResult &result);
+
+  // Helper function for saving amplitudes and amplitudes squared
+  void apply_save_amplitudes(const Operations::Op &op,
+                             ExperimentResult &result);
+
   // Helper function for computing expectation value
   virtual double expval_pauli(const reg_t &qubits,
                               const std::string& pauli) override;
@@ -480,6 +507,35 @@ auto State<statevec_t>::move_to_vector()
   }
 }
 
+template <class statevec_t>
+auto State<statevec_t>::copy_to_vector()
+{
+  if(BaseState::num_global_chunks_ == 1){
+    return BaseState::qregs_[0].copy_to_vector();
+  }
+  else{
+    int_t iChunk;
+    auto state = BaseState::qregs_[0].copy_to_vector();
+
+    //TO DO check memory availability
+    state.resize(BaseState::num_local_chunks_ << BaseState::chunk_bits_);
+
+#pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk)
+    for(iChunk=1;iChunk<BaseState::num_local_chunks_;iChunk++){
+      auto tmp = BaseState::qregs_[iChunk].copy_to_vector();
+      uint_t j,offset = iChunk << BaseState::chunk_bits_;
+      for(j=0;j<tmp.size();j++){
+        state[offset + j] = tmp[j];
+      }
+    }
+
+#ifdef AER_MPI
+    BaseState::gather_state(state);
+#endif
+    return state;
+  }
+}
+
 //=========================================================================
 // Implementation: apply operations
 //=========================================================================
@@ -535,6 +591,23 @@ void State<statevec_t>::apply_op(const int_t iChunk,const Operations::Op &op,
       case Operations::OpType::save_expval_var:
         BaseState::apply_save_expval(op, result);
         break;
+      case Operations::OpType::save_densmat:
+        apply_save_density_matrix(op, result);
+        break;
+      case Operations::OpType::save_statevec:
+        apply_save_statevector(op, result, final_ops);
+        break;
+      // case Operations::OpType::save_statevec_ket:
+      //   apply_save_statevector_ket(op, result);
+      //   break;
+      case Operations::OpType::save_probs:
+      case Operations::OpType::save_probs_ket:
+        apply_save_probs(op, result);
+        break;
+      case Operations::OpType::save_amps:
+      case Operations::OpType::save_amps_sq:
+          apply_save_amplitudes(op, result);
+          break;
       default:
         throw std::invalid_argument("QubitVector::State::invalid instruction \'" +
                                     op.name + "\'.");
@@ -546,6 +619,22 @@ void State<statevec_t>::apply_op(const int_t iChunk,const Operations::Op &op,
 // Implementation: Save data
 //=========================================================================
 
+template <class statevec_t>
+void State<statevec_t>::apply_save_probs(const Operations::Op &op,
+                                         ExperimentResult &result) {
+  // get probs as hexadecimal
+  auto probs = measure_probs(op.qubits);
+  if (op.type == Operations::OpType::save_probs_ket) {
+    // Convert to ket dict
+    BaseState::save_data_average(result, op.string_params[0],
+                                 Utils::vec2ket(probs, json_chop_threshold_, 16),
+                                 op.save_type);
+  } else {
+    BaseState::save_data_average(result, op.string_params[0],
+                                 std::move(probs), op.save_type);
+  }
+}
+
 template <class statevec_t>
 double State<statevec_t>::expval_pauli(const reg_t &qubits,
                                        const std::string& pauli) 
@@ -585,7 +674,7 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
       bool on_same_process = true;
 #ifdef AER_MPI
       int proc_bits = 0;
-      uint_t procs = distributed_procs_;
+      uint_t procs = BaseState::distributed_procs_;
       while(procs > 1){
         if((procs & 1) != 0){
           proc_bits = -1;
@@ -618,12 +707,12 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
           z_count_pair = AER::Utils::popcount(pair_chunk & z_mask);
 
           if(iProc == BaseState::distributed_rank_){  //pair is on the same process
-            expval += BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits_in_chunk, pauli_in_chunk,BaseState::qregs_[pair_chunk - BaseState::global_chunk_index_],z_count,z_count_pair);
+            expval += BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits_in_chunk, pauli_in_chunk,BaseState::qregs_[pair_chunk - BaseState::global_chunk_index_],z_count,z_count_pair,phase);
           }
           else{
             BaseState::recv_chunk(iChunk-BaseState::global_chunk_index_,pair_chunk);
             //refer receive buffer to calculate expectation value
-            expval += BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits_in_chunk, pauli_in_chunk,BaseState::qregs_[iChunk-BaseState::global_chunk_index_],z_count,z_count_pair);
+            expval += BaseState::qregs_[iChunk-BaseState::global_chunk_index_].expval_pauli(qubits_in_chunk, pauli_in_chunk,BaseState::qregs_[iChunk-BaseState::global_chunk_index_],z_count,z_count_pair,phase);
           }
         }
         else if(iProc == BaseState::distributed_rank_){  //pair is on this process
@@ -655,6 +744,111 @@ double State<statevec_t>::expval_pauli(const reg_t &qubits,
   return expval;
 }
 
+template <class statevec_t>
+void State<statevec_t>::apply_save_statevector(const Operations::Op &op,
+                                               ExperimentResult &result,
+                                               bool last_op) 
+{
+  if (op.qubits.size() != BaseState::num_qubits_) {
+    throw std::invalid_argument(
+        op.name + " was not applied to all qubits."
+        " Only the full statevector can be saved.");
+  }
+  if (last_op) {
+    BaseState::save_data_pershot(result, op.string_params[0],
+                                 move_to_vector(),
+                                 op.save_type);
+  } else {
+    BaseState::save_data_pershot(result, op.string_params[0],
+                                 copy_to_vector(),
+                                 op.save_type);
+  }
+}
+
+template <class statevec_t>
+void State<statevec_t>::apply_save_statevector_ket(const Operations::Op &op,
+                                                   ExperimentResult &result) 
+{
+  if (op.qubits.size() != BaseState::num_qubits_) {
+    throw std::invalid_argument(
+        op.name + " was not applied to all qubits."
+        " Only the full statevector can be saved.");
+  }
+  // TODO: compute state ket
+  std::map<std::string, complex_t> state_ket;
+
+  BaseState::save_data_pershot(result, op.string_params[0],
+                               std::move(state_ket), op.save_type);
+}
+
+template <class statevec_t>
+void State<statevec_t>::apply_save_density_matrix(const Operations::Op &op,
+                                                  ExperimentResult &result) 
+{
+  cmatrix_t reduced_state;
+
+  // Check if tracing over all qubits
+  if (op.qubits.empty()) {
+    reduced_state = cmatrix_t(1, 1);
+
+    double sum = 0.0;
+#pragma omp parallel for if(BaseState::chunk_omp_parallel_) reduction(+:sum)
+    for(int_t i=0;i<BaseState::num_local_chunks_;i++){
+      sum += BaseState::qregs_[i].norm();
+    }
+#ifdef AER_MPI
+    BaseState::reduce_sum(sum);
+#endif
+    reduced_state[0] = sum;
+  } else {
+    reduced_state = density_matrix(op.qubits);
+  }
+
+  BaseState::save_data_average(result, op.string_params[0],
+                               std::move(reduced_state), op.save_type);
+}
+
+template <class statevec_t>
+void State<statevec_t>::apply_save_amplitudes(const Operations::Op &op,
+                                              ExperimentResult &result) 
+{
+  if (op.int_params.empty()) {
+    throw std::invalid_argument("Invalid save_amplitudes instructions (empty params).");
+  }
+  const int_t size = op.int_params.size();
+  if (op.type == Operations::OpType::save_amps) {
+    Vector<complex_t> amps(size, false);
+    for (int_t i = 0; i < size; ++i) {
+      uint_t iChunk = op.int_params[i] >> BaseState::chunk_bits_;
+      amps[i] = 0.0;
+      if(iChunk >= BaseState::global_chunk_index_ && iChunk < BaseState::global_chunk_index_ + BaseState::num_local_chunks_){
+        amps[i] = BaseState::qregs_[iChunk - BaseState::global_chunk_index_].get_state(op.int_params[i] - (iChunk << BaseState::chunk_bits_));
+      }
+#ifdef AER_MPI
+      complex_t amp = amps[i];
+      BaseState::reduce_sum(amp);
+      amps[i] = amp;
+#endif
+    }
+    BaseState::save_data_pershot(result, op.string_params[0],
+                                 std::move(amps), op.save_type);
+  }
+  else{
+    rvector_t amps_sq(size,0);
+    for (int_t i = 0; i < size; ++i) {
+      uint_t iChunk = op.int_params[i] >> BaseState::chunk_bits_;
+      if(iChunk >= BaseState::global_chunk_index_ && iChunk < BaseState::global_chunk_index_ + BaseState::num_local_chunks_){
+        amps_sq[i] = BaseState::qregs_[iChunk - BaseState::global_chunk_index_].probability(op.int_params[i] - (iChunk << BaseState::chunk_bits_));
+      }
+    }
+#ifdef AER_MPI
+    BaseState::reduce_sum(amps_sq);
+#endif
+    BaseState::save_data_average(result, op.string_params[0],
+                                 std::move(amps_sq), op.save_type);
+  }
+}
+
 //=========================================================================
 // Implementation: Snapshots
 //=========================================================================
@@ -926,7 +1120,7 @@ cmatrix_t State<statevec_t>::vec2density(const reg_t &qubits, const T &vec) {
 
   // Return full density matrix
   cmatrix_t densmat(DIM, DIM);
-  if ((N == BaseState::qregs_[0].num_qubits()) && (qubits == qubits_sorted)) {
+  if ((N == BaseState::num_qubits_) && (qubits == qubits_sorted)) {
     const int_t mask = QV::MASKS[N];
 #pragma omp parallel for if (2 * N > omp_qubit_threshold_ &&                   \
                              BaseState::threads_ > 1)                          \
@@ -937,7 +1131,7 @@ cmatrix_t State<statevec_t>::vec2density(const reg_t &qubits, const T &vec) {
       densmat(row, col) = complex_t(vec[row]) * complex_t(std::conj(vec[col]));
     }
   } else {
-    const size_t END = 1ULL << (BaseState::qregs_[0].num_qubits() - N);
+    const size_t END = 1ULL << (BaseState::num_qubits_ - N);
     // Initialize matrix values with first block
     {
       const auto inds = QV::indexes(qubits, qubits_sorted, 0);
diff --git a/src/simulators/unitary/unitary_state.hpp b/src/simulators/unitary/unitary_state.hpp
index 3b63562721..17bdd91c4b 100755
--- a/src/simulators/unitary/unitary_state.hpp
+++ b/src/simulators/unitary/unitary_state.hpp
@@ -104,7 +104,7 @@ class State : public Base::State<unitary_matrix_t> {
   // Config: {"omp_qubit_threshold": 7}
   virtual void set_config(const json_t &config) override;
 
-  virtual void allocate(uint_t num_qubits) override;
+  virtual void allocate(uint_t num_qubits,uint_t block_bits) override;
 
   //-----------------------------------------------------------------------
   // Additional methods
@@ -256,7 +256,7 @@ const stringmap_t<Gates> State<unitary_matrix_t>::gateset_({
 });
 
 template <class unitary_matrix_t>
-void State<unitary_matrix_t>::allocate(uint_t num_qubits)
+void State<unitary_matrix_t>::allocate(uint_t num_qubits,uint_t block_bits)
 {
   BaseState::qreg_.chunk_setup(num_qubits*2,num_qubits*2,0,1);
 }
diff --git a/src/simulators/unitary/unitary_state_chunk.hpp b/src/simulators/unitary/unitary_state_chunk.hpp
index d98f0cac35..a0276cc7d1 100644
--- a/src/simulators/unitary/unitary_state_chunk.hpp
+++ b/src/simulators/unitary/unitary_state_chunk.hpp
@@ -27,8 +27,6 @@
 #include "unitarymatrix_thrust.hpp"
 #endif
 
-//#include "unitary_state.hpp"
-
 namespace AER {
 namespace QubitUnitaryChunk {
 
@@ -36,7 +34,8 @@ namespace QubitUnitaryChunk {
 const Operations::OpSet StateOpSet(
     // Op types
     {Operations::OpType::gate, Operations::OpType::barrier,
-     Operations::OpType::matrix, Operations::OpType::diagonal_matrix},
+     Operations::OpType::matrix, Operations::OpType::diagonal_matrix,
+     Operations::OpType::snapshot, Operations::OpType::save_unitary},
     // Gates
     {"u1",     "u2",      "u3",  "u",    "U",    "CX",   "cx",   "cz",
      "cy",     "cp",      "cu1", "cu2",  "cu3",  "swap", "id",   "p",
@@ -46,13 +45,7 @@ const Operations::OpSet StateOpSet(
      "mcswap", "mcphase", "mcr", "mcrx", "mcry", "mcry", "sx",   "csx",
      "mcsx",   "delay", "pauli"},
     // Snapshots
-    {});
-
-// Allowed gates enum class
-enum class Gates {
-  id, h, s, sdg, t, tdg, rxx, ryy, rzz, rzx,
-  mcx, mcy, mcz, mcr, mcrx, mcry, mcrz, mcp, mcu2, mcu3, mcswap, mcsx, pauli,
-};
+    {"unitary"});
 
 //=========================================================================
 // QubitUnitary State subclass
@@ -128,6 +121,9 @@ class State : public Base::StateChunk<unitary_matrix_t> {
   // Apply a matrix to given qubits (identity on all other qubits)
   void apply_matrix(const uint_t iChunk,const reg_t &qubits, const cvector_t &vmat);
 
+  // Apply a diagonal matrix
+  void apply_diagonal_matrix(const uint_t iChunk,const reg_t &qubits, const cvector_t &diag);
+
   //-----------------------------------------------------------------------
   // 1-Qubit Gates
   //-----------------------------------------------------------------------
@@ -197,7 +193,7 @@ void State<unitary_matrix_t>::apply_op(const int_t iChunk,const Operations::Op &
       apply_matrix(iChunk,op.qubits, op.mats[0]);
       break;
     case Operations::OpType::diagonal_matrix:
-      BaseState::qregs_[iChunk].apply_diagonal_matrix(op.qubits, op.params);
+      apply_diagonal_matrix(iChunk,op.qubits, op.params);
       break;
     default:
       throw std::invalid_argument(
@@ -240,7 +236,7 @@ void State<unitary_matrix_t>::initialize_qreg(uint_t num_qubits)
 
   if(BaseState::chunk_bits_ == BaseState::num_qubits_){
     for(i=0;i<BaseState::num_local_chunks_;i++){
-      BaseState::qregs_[i].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[i].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[i].zero();
     }
     for(i=0;i<BaseState::num_local_chunks_;i++){
@@ -250,7 +246,7 @@ void State<unitary_matrix_t>::initialize_qreg(uint_t num_qubits)
   else{   //multi-chunk distribution
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(i) 
     for(i=0;i<BaseState::num_local_chunks_;i++){
-      BaseState::qregs_[i].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[i].set_num_qubits(BaseState::chunk_bits_);
       if(BaseState::global_chunk_index_ + i == 0 || this->num_qubits_ == this->chunk_bits_){
         BaseState::qregs_[i].initialize();
       }
@@ -278,19 +274,19 @@ void State<unitary_matrix_t>::initialize_qreg(uint_t num_qubits,
   int_t iChunk;
   if(BaseState::chunk_bits_ == BaseState::num_qubits_){
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
     }
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      BaseState::qregs_[iChunk].initialize_from_data(unitary.data(), 1ull << BaseState::chunk_bits_);
+      BaseState::qregs_[iChunk].initialize_from_data(unitary.data(), 1ull << BaseState::chunk_bits_*2);
     }
   }
   else{   //multi-chunk distribution
-    uint_t local_offset = BaseState::global_chunk_index_ << BaseState::chunk_bits_;
+    uint_t local_offset = BaseState::global_chunk_index_ << BaseState::chunk_bits_*2;
 
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk) 
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
-      BaseState::qregs_[iChunk].initialize_from_data(unitary.data() + local_offset + (iChunk << BaseState::chunk_bits_), 1ull << BaseState::chunk_bits_);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
+      BaseState::qregs_[iChunk].initialize_from_data(unitary.data() + local_offset + (iChunk << BaseState::chunk_bits_*2), 1ull << BaseState::chunk_bits_*2);
     }
   }
 
@@ -312,25 +308,25 @@ void State<unitary_matrix_t>::initialize_qreg(uint_t num_qubits,
   int_t iChunk;
   if(BaseState::chunk_bits_ == BaseState::num_qubits_){
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
     }
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
       BaseState::qregs_[iChunk].initialize_from_matrix(unitary);
     }
   }
   else{   //multi-chunk distribution
-    uint_t local_offset = BaseState::global_chunk_index_ << BaseState::chunk_bits_;
+    uint_t local_offset = BaseState::global_chunk_index_ << BaseState::chunk_bits_*2;
 
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk) 
     for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
       //copy part of state for this chunk
       uint_t i;
-      cvector_t tmp(1ull << BaseState::chunk_bits_);
+      cvector_t tmp(1ull << BaseState::chunk_bits_*2);
       for(i=0;i<(1ull << BaseState::chunk_bits_);i++){
-        tmp[i] = unitary[local_offset + (iChunk << BaseState::chunk_bits_) + i];
+        tmp[i] = unitary[local_offset + (iChunk << BaseState::chunk_bits_*2) + i];
       }
 
-      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_/2);
+      BaseState::qregs_[iChunk].set_num_qubits(BaseState::chunk_bits_);
       BaseState::qregs_[iChunk].initialize_from_vector(tmp);
     }
   }
@@ -355,32 +351,60 @@ auto State<unitary_matrix_t>::move_to_matrix()
   if(BaseState::num_global_chunks_ == 1){
     return BaseState::qregs_[0].move_to_matrix();
   }
-  else{
-    int_t iChunk;
-    auto state = BaseState::qregs_[0].vector();   //using vector to gather distributed matrix
+  int_t iChunk;
+  uint_t size = 1ull << (BaseState::chunk_bits_*2);
+  uint_t mask = (1ull << (BaseState::chunk_bits_)) - 1;
+  uint_t num_threads = BaseState::qregs_[0].get_omp_threads();
+
+  auto matrix = BaseState::qregs_[0].copy_to_matrix();
 
+  if(BaseState::distributed_rank_ == 0){
     //TO DO check memory availability
-    state.resize(BaseState::num_local_chunks_ << BaseState::chunk_bits_);
-
-#pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(iChunk)
-    for(iChunk=1;iChunk<BaseState::num_local_chunks_;iChunk++){
-      auto tmp = BaseState::qregs_[iChunk].vector();
-      uint_t j,offset = iChunk << BaseState::chunk_bits_;
-      for(j=0;j<tmp.size();j++){
-        state[offset + j] = tmp[j];
+    matrix.resize(1ull << (BaseState::num_qubits_),1ull << (BaseState::num_qubits_));
+
+#ifdef AER_MPI
+    auto recv = BaseState::qregs_[0].copy_to_matrix();
+    //gather states from other processes
+    for(iChunk=BaseState::num_local_chunks_;iChunk<BaseState::num_global_chunks_;iChunk++){
+      BaseState::recv_data(recv.data(),size,0,iChunk);
+
+      int_t i;
+      uint_t offset = iChunk << (BaseState::chunk_bits_*2);
+#pragma omp parallel for if(num_threads > 1) num_threads(num_threads)
+      for(i=0;i<size;i++){
+        uint_t irow = i >> (BaseState::chunk_bits_);
+        uint_t icol = i & mask;
+        matrix[offset+i] = recv(icol,irow);
       }
     }
+#endif
 
+    for(iChunk=0;iChunk<BaseState::num_local_chunks_;iChunk++){
+      int_t i;
+      uint_t offset = (iChunk + BaseState::global_chunk_index_) << (BaseState::chunk_bits_*2);
+      auto tmp = BaseState::qregs_[iChunk].move_to_matrix();
+#pragma omp parallel for if(num_threads > 1) num_threads(num_threads)
+      for(i=0;i<size;i++){
+        uint_t irow = i >> (BaseState::chunk_bits_);
+        uint_t icol = i & mask;
+        matrix[offset+i] = tmp(icol,irow);
+      }
+    }
+  }
+  else{
 #ifdef AER_MPI
-    BaseState::gather_state(state);
+    //send matrices to process 0
+    for(iChunk=0;iChunk<BaseState::num_global_chunks_;iChunk++){
+      uint_t iProc = BaseState::get_process_by_chunk(iChunk);
+      if(iProc == BaseState::distributed_rank_){
+        auto tmp = BaseState::qregs_[iChunk-BaseState::global_chunk_index_].move_to_matrix();
+        BaseState::send_data(tmp.data(),size,iChunk,0);
+      }
+    }
 #endif
-
-    //type of matrix cam not be discovered from State class, so make from matrix
-    auto matrix = BaseState::qregs_[0].move_to_matrix();
-    matrix.resize(1ull << (BaseState::num_qubits_/2),1ull << (BaseState::num_qubits_/2));
-    matrix.copy_from_buffer(1ull << (BaseState::num_qubits_/2),1ull << (BaseState::num_qubits_/2),&state[0]);
-    return matrix;
   }
+
+  return matrix;
 }
 
 //=========================================================================
@@ -427,7 +451,7 @@ void State<unitary_matrix_t>::apply_gate(const uint_t iChunk,const Operations::O
       BaseState::qregs_[iChunk].apply_matrix(op.qubits, Linalg::VMatrix::ryy(op.params[0]));
       break;
     case QubitUnitary::Gates::rzz:
-      BaseState::qregs_[iChunk].apply_diagonal_matrix(op.qubits, Linalg::VMatrix::rzz_diag(op.params[0]));
+      apply_diagonal_matrix(iChunk,op.qubits, Linalg::VMatrix::rzz_diag(op.params[0]));
       break;
     case QubitUnitary::Gates::rzx:
       BaseState::qregs_[iChunk].apply_matrix(op.qubits, Linalg::VMatrix::rzx(op.params[0]));
@@ -497,12 +521,28 @@ void State<unitary_matrix_t>::apply_matrix(const uint_t iChunk,const reg_t &qubi
                                            const cvector_t &vmat) {
   // Check if diagonal matrix
   if (vmat.size() == 1ULL << qubits.size()) {
-    BaseState::qregs_[iChunk].apply_diagonal_matrix(qubits, vmat);
+    apply_diagonal_matrix(iChunk,qubits, vmat);
   } else {
     BaseState::qregs_[iChunk].apply_matrix(qubits, vmat);
   }
 }
 
+template <class unitary_matrix_t>
+void State<unitary_matrix_t>::apply_diagonal_matrix(const uint_t iChunk, const reg_t &qubits, const cvector_t &diag)
+{
+  if(BaseState::gpu_optimization_){
+    //GPU computes all chunks in one kernel, so pass qubits and diagonal matrix as is
+    BaseState::qregs_[iChunk].apply_diagonal_matrix(qubits,diag);
+  }
+  else{
+    reg_t qubits_in = qubits;
+    cvector_t diag_in = diag;
+
+    BaseState::block_diagonal_matrix(iChunk,qubits_in,diag_in);
+    BaseState::qregs_[iChunk].apply_diagonal_matrix(qubits_in,diag_in);
+  }
+}
+
 template <class unitary_matrix_t>
 void State<unitary_matrix_t>::apply_gate_phase(const uint_t iChunk,uint_t qubit, complex_t phase) {
   cmatrix_t diag(1, 2);
@@ -540,8 +580,7 @@ void State<unitary_matrix_t>::apply_global_phase() {
     int_t i;
 #pragma omp parallel for if(BaseState::chunk_omp_parallel_) private(i) 
     for(i=0;i<BaseState::num_local_chunks_;i++){
-      BaseState::qregs_[i].apply_diagonal_matrix(
-        {0}, {BaseState::global_phase_, BaseState::global_phase_}
+      apply_diagonal_matrix(i, {0}, {BaseState::global_phase_, BaseState::global_phase_}
       );
     }
   }
diff --git a/src/transpile/cacheblocking.hpp b/src/transpile/cacheblocking.hpp
index e093add728..25c1c3505f 100644
--- a/src/transpile/cacheblocking.hpp
+++ b/src/transpile/cacheblocking.hpp
@@ -65,7 +65,7 @@ class CacheBlocking : public CircuitOptimization {
   }
 
   //setting blocking parameters automatically
-  void set_blocking(int bits, size_t min_memory, uint_t n_place, size_t complex_size = 16, bool is_matrix = false);
+  void set_blocking(int bits, size_t min_memory, uint_t n_place, const size_t complex_size, bool is_matrix = false);
 
 protected:
   mutable int block_bits_;    //qubits less than this will be blocked
diff --git a/test/terra/backends/qasm_simulator/qasm_chunk.py b/test/terra/backends/qasm_simulator/qasm_chunk.py
new file mode 100644
index 0000000000..d649aad641
--- /dev/null
+++ b/test/terra/backends/qasm_simulator/qasm_chunk.py
@@ -0,0 +1,136 @@
+# This code is part of Qiskit.
+#
+# (C) Copyright IBM 2018, 2019, 2020, 2021.
+#
+# This code is licensed under the Apache License, Version 2.0. You may
+# obtain a copy of this license in the LICENSE.txt file in the root directory
+# of this source tree or at http://www.apache.org/licenses/LICENSE-2.0.
+#
+# Any modifications or derivative works of this code must retain this
+# copyright notice, and modified files need to carry a notice indicating
+# that they have been altered from the originals.
+"""
+QasmSimulator Integration Tests
+"""
+# pylint: disable=no-member
+import copy
+
+from qiskit import QuantumRegister, ClassicalRegister, QuantumCircuit
+from qiskit.circuit.library import QuantumVolume, QFT
+from qiskit.compiler import assemble, transpile
+from qiskit.providers.aer import QasmSimulator
+
+class QasmChunkTests:
+    """QasmSimulator Multi-chunk tests."""
+
+    SIMULATOR = QasmSimulator()
+    SUPPORTED_QASM_METHODS = [
+        'statevector', 'statevector_gpu', 'statevector_thrust',
+        'density_matrix', 'density_matrix_gpu', 'density_matrix_thrust'
+    ]
+
+    def test_chunk_QuantumVolume(self):
+        """Test multi-chunk with quantum volume"""
+        shots = 100
+        num_qubits = 4
+        depth = 10
+        backend_options = self.BACKEND_OPTS.copy()
+        backend_options_no_chunk = self.BACKEND_OPTS.copy()
+        backend_options_no_chunk.pop("blocking_enable")
+        backend_options_no_chunk.pop("blocking_qubits")
+
+        circuit = transpile(QuantumVolume(num_qubits, depth, seed=0),
+                            backend=self.SIMULATOR,
+                            optimization_level=0)
+        circuit.measure_all()
+        qobj = assemble(circuit, shots=shots, memory=True)
+        result = self.SIMULATOR.run(qobj, **backend_options_no_chunk).result()
+        counts_no_chunk = result.get_counts(circuit)
+        result = self.SIMULATOR.run(qobj, **backend_options).result()
+        counts = result.get_counts(circuit)
+
+        self.assertEqual(counts_no_chunk,counts)
+
+    def test_chunk_QuantumVolumeWithFusion(self):
+        """Test multi-chunk with fused quantum volume"""
+        shots = 100
+        num_qubits = 8
+        depth = 10
+        backend_options = self.BACKEND_OPTS.copy()
+        backend_options['fusion_enable'] = True
+        backend_options['fusion_threshold'] = 5
+        backend_options["blocking_qubits"] = 4
+        backend_options_no_chunk = self.BACKEND_OPTS.copy()
+        backend_options_no_chunk.pop("blocking_enable")
+        backend_options_no_chunk.pop("blocking_qubits")
+        backend_options_no_chunk['fusion_enable'] = True
+        backend_options_no_chunk['fusion_threshold'] = 5
+
+        circuit = transpile(QuantumVolume(num_qubits, depth, seed=0),
+                            backend=self.SIMULATOR,
+                            optimization_level=0)
+        circuit.measure_all()
+        qobj = assemble(circuit, shots=shots, memory=True)
+        result = self.SIMULATOR.run(qobj, **backend_options_no_chunk).result()
+        counts_no_chunk = result.get_counts(circuit)
+        result = self.SIMULATOR.run(qobj, **backend_options).result()
+        counts = result.get_counts(circuit)
+
+        self.assertEqual(counts_no_chunk,counts)
+
+    def test_chunk_QFTWithFusion(self):
+        """Test multi-chunk with fused QFT (testing multi-chunk diagonal matrix)"""
+        shots = 100
+        num_qubits = 8
+        backend_options = self.BACKEND_OPTS.copy()
+        backend_options['fusion_enable'] = True
+        backend_options['fusion_threshold'] = 5
+        backend_options["blocking_qubits"] = 4
+        backend_options_no_chunk = self.BACKEND_OPTS.copy()
+        backend_options_no_chunk.pop("blocking_enable")
+        backend_options_no_chunk.pop("blocking_qubits")
+        backend_options_no_chunk['fusion_enable'] = True
+        backend_options_no_chunk['fusion_threshold'] = 5
+
+        circuit = transpile(QFT(num_qubits),
+                            backend=self.SIMULATOR,
+                            optimization_level=0)
+        circuit.measure_all()
+        qobj = assemble(circuit, shots=shots, memory=True)
+        result = self.SIMULATOR.run(qobj, **backend_options_no_chunk).result()
+        counts_no_chunk = result.get_counts(circuit)
+        result = self.SIMULATOR.run(qobj, **backend_options).result()
+        counts = result.get_counts(circuit)
+
+        self.assertEqual(counts_no_chunk,counts)
+
+    def test_chunk_pauli(self):
+        """Test multi-chunk pauli gate"""
+        shots = 100
+        backend_options = self.BACKEND_OPTS.copy()
+        backend_options["blocking_qubits"] = 3
+        backend_options['fusion_enable'] = False
+        backend_options_no_chunk = self.BACKEND_OPTS.copy()
+        backend_options_no_chunk.pop("blocking_enable")
+        backend_options_no_chunk.pop("blocking_qubits")
+
+        qr = QuantumRegister(5)
+        cr = ClassicalRegister(5)
+        regs = (qr, cr)
+        circuit = QuantumCircuit(*regs)
+        circuit.h(qr[0])
+        circuit.h(qr[1])
+        circuit.h(qr[2])
+        circuit.h(qr[3])
+        circuit.h(qr[4])
+        circuit.pauli('YXZYX',qr)
+        circuit.measure_all()
+
+        qobj = assemble(circuit, shots=shots, memory=True)
+        result = self.SIMULATOR.run(qobj, **backend_options_no_chunk).result()
+        counts_no_chunk = result.get_counts(circuit)
+        result = self.SIMULATOR.run(qobj, **backend_options).result()
+        counts = result.get_counts(circuit)
+
+        self.assertEqual(counts_no_chunk,counts)
+
diff --git a/test/terra/backends/test_qasm_simulator_density_matrix_chunk.py b/test/terra/backends/test_qasm_simulator_density_matrix_chunk.py
new file mode 100644
index 0000000000..bcb3aa7165
--- /dev/null
+++ b/test/terra/backends/test_qasm_simulator_density_matrix_chunk.py
@@ -0,0 +1,74 @@
+# This code is part of Qiskit.
+#
+# (C) Copyright IBM 2018, 2019.
+#
+# This code is licensed under the Apache License, Version 2.0. You may
+# obtain a copy of this license in the LICENSE.txt file in the root directory
+# of this source tree or at http://www.apache.org/licenses/LICENSE-2.0.
+#
+# Any modifications or derivative works of this code must retain this
+# copyright notice, and modified files need to carry a notice indicating
+# that they have been altered from the originals.
+"""
+QasmSimulator Integration Tests
+"""
+
+import unittest
+from test.terra import common
+from test.terra.decorators import requires_method
+
+# Save data tests
+from test.terra.backends.qasm_simulator.qasm_save import QasmSaveDataTests
+# chunk tests
+from test.terra.backends.qasm_simulator.qasm_chunk import QasmChunkTests
+
+class DensityMatrixChunkTests(
+        QasmSaveDataTests,
+        QasmChunkTests
+        ):
+    """Container class of density matrix method tests."""
+    pass
+
+
+class TestQasmSimulatorDensityMatrixChunk(common.QiskitAerTestCase, DensityMatrixChunkTests):
+    """QasmSimulator density_matrix method tests."""
+
+    BACKEND_OPTS = {
+        "seed_simulator": 314159,
+        "method": "density_matrix",
+        "max_parallel_threads": 1,
+        "blocking_enable" : True,
+        "blocking_qubits" : 2
+    }
+
+
+@requires_method("qasm_simulator", "density_matrix_gpu")
+class TestQasmSimulatorDensityMatrixChunkThrustGPU(common.QiskitAerTestCase,
+                                            DensityMatrixChunkTests):
+    """QasmSimulator density_matrix_gpu method tests."""
+
+    BACKEND_OPTS = {
+        "seed_simulator": 314159,
+        "method": "density_matrix_gpu",
+        "max_parallel_threads": 1,
+        "blocking_enable" : True,
+        "blocking_qubits" : 2
+    }
+
+
+@requires_method("qasm_simulator", "density_matrix_thrust")
+class TestQasmSimulatorDensityMatrixChunkThrustCPU(common.QiskitAerTestCase,
+                                            DensityMatrixChunkTests):
+    """QasmSimulator density_matrix_thrust method tests."""
+
+    BACKEND_OPTS = {
+        "seed_simulator": 314159,
+        "method": "density_matrix_thrust",
+        "max_parallel_threads": 1,
+        "blocking_enable" : True,
+        "blocking_qubits" : 2
+    }
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/test/terra/backends/test_qasm_simulator_density_matrix_mpi.py b/test/terra/backends/test_qasm_simulator_density_matrix_mpi.py
deleted file mode 100644
index f98a14a1f3..0000000000
--- a/test/terra/backends/test_qasm_simulator_density_matrix_mpi.py
+++ /dev/null
@@ -1,84 +0,0 @@
-# This code is part of Qiskit.
-#
-# (C) Copyright IBM 2018, 2019, 2020, 2021.
-#
-# This code is licensed under the Apache License, Version 2.0. You may
-# obtain a copy of this license in the LICENSE.txt file in the root directory
-# of this source tree or at http://www.apache.org/licenses/LICENSE-2.0.
-#
-# Any modifications or derivative works of this code must retain this
-# copyright notice, and modified files need to carry a notice indicating
-# that they have been altered from the originals.
-"""
-QasmSimulator Integration Tests
-"""
-
-import unittest
-from qiskit.providers.aer import QasmSimulator
-from qiskit.providers.aer import AerError
-from test.terra import common
-from test.terra.decorators import requires_method
-
-from test.terra.backends.qasm_simulator.qasm_mpi import QasmMPITests
-
-
-class DensityMatrixMPITests(QasmMPITests):
-    """Container class of statevector method tests."""
-    pass
-
-
-class TestQasmSimulatorDensityMatrixMPI(common.QiskitAerTestCase, DensityMatrixMPITests):
-    """QasmSimulator density_matrix method MPI tests."""
-
-    BACKEND_OPTS = {
-        "seed_simulator": 271828,
-        "method": "density_matrix",
-        "blocking_enable": True,
-        "blocking_qubits": 6,
-        "max_parallel_threads": 1
-    }
-    try:
-        SIMULATOR = QasmSimulator(**BACKEND_OPTS)
-    except AerError:
-        SIMULATOR = None
-
-
-@requires_method("qasm_simulator", "density_matrix_gpu")
-class TestQasmSimulatorDensityMatrixMPIThrustGPU(common.QiskitAerTestCase,
-                                                 DensityMatrixMPITests):
-    """QasmSimulator density_matrix_gpu method MPI tests."""
-
-    BACKEND_OPTS = {
-        "seed_simulator": 271828,
-        "method": "density_matrix_gpu",
-        "blocking_enable": True,
-        "blocking_qubits": 6,
-        "blocking_ignore_diagonal" : True,
-        "max_parallel_threads": 1
-    }
-    try:
-        SIMULATOR = QasmSimulator(**BACKEND_OPTS)
-    except AerError:
-        SIMULATOR = None
-
-
-@requires_method("qasm_simulator", "density_matrix_thrust")
-class TestQasmSimulatorDensityMatrixMPIThrustCPU(common.QiskitAerTestCase,
-                                                 DensityMatrixMPITests):
-    """QasmSimulator density_matrix_thrust method MPI tests."""
-
-    BACKEND_OPTS = {
-        "seed_simulator": 271828,
-        "method": "density_matrix_thrust",
-        "blocking_enable": True,
-        "blocking_qubits": 6,
-        "max_parallel_threads": 1
-    }
-    try:
-        SIMULATOR = QasmSimulator(**BACKEND_OPTS)
-    except AerError:
-        SIMULATOR = None
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/test/terra/backends/test_qasm_simulator_statevector_mpi.py b/test/terra/backends/test_qasm_simulator_statevector_chunk.py
similarity index 56%
rename from test/terra/backends/test_qasm_simulator_statevector_mpi.py
rename to test/terra/backends/test_qasm_simulator_statevector_chunk.py
index 5b2db11089..058692c47b 100644
--- a/test/terra/backends/test_qasm_simulator_statevector_mpi.py
+++ b/test/terra/backends/test_qasm_simulator_statevector_chunk.py
@@ -1,6 +1,6 @@
 # This code is part of Qiskit.
 #
-# (C) Copyright IBM 2018, 2019, 2020, 2021.
+# (C) Copyright IBM 2018, 2019.
 #
 # This code is licensed under the Apache License, Version 2.0. You may
 # obtain a copy of this license in the LICENSE.txt file in the root directory
@@ -19,40 +19,43 @@
 from test.terra import common
 from test.terra.decorators import requires_method
 
-from test.terra.backends.qasm_simulator.qasm_mpi import QasmMPITests
+# Save data tests
+from test.terra.backends.qasm_simulator.qasm_save import QasmSaveDataTests
+# chunk tests
+from test.terra.backends.qasm_simulator.qasm_chunk import QasmChunkTests
 
-
-class StatevectorMPITests(
-        QasmMPITests):
+class StatevectorChunkTests(
+        QasmSaveDataTests,
+        QasmChunkTests
+        ):
     """Container class of statevector method tests."""
     pass
 
 
-class TestQasmSimulatorStatevectorMPI(common.QiskitAerTestCase, StatevectorMPITests):
-    """QasmSimulator statevector method MPI tests."""
+class TestQasmSimulatorStatevectorChunk(common.QiskitAerTestCase, StatevectorChunkTests):
+    """QasmSimulator statevector method tests."""
 
     BACKEND_OPTS = {
         "seed_simulator": 271828,
         "method": "statevector",
-        "blocking_enable": True,
-        "blocking_qubits": 6,
-        "max_parallel_threads": 1
+        "max_parallel_threads": 1,
+        "blocking_enable" : True,
+        "blocking_qubits" : 2
     }
     SIMULATOR = QasmSimulator(**BACKEND_OPTS)
 
 
 @requires_method("qasm_simulator", "statevector_gpu")
-class TestQasmSimulatorStatevectorMPIThrustGPU(common.QiskitAerTestCase,
-                                               StatevectorMPITests):
-    """QasmSimulator statevector_gpu method MPI tests."""
+class TestQasmSimulatorStatevectorChunkThrustGPU(common.QiskitAerTestCase,
+                                            StatevectorChunkTests):
+    """QasmSimulator statevector_gpu method tests."""
 
     BACKEND_OPTS = {
         "seed_simulator": 271828,
         "method": "statevector_gpu",
-        "blocking_enable": True,
-        "blocking_qubits": 6,
-        "blocking_ignore_diagonal": True,
-        "max_parallel_threads": 1
+        "max_parallel_threads": 1,
+        "blocking_enable" : True,
+        "blocking_qubits" : 2
     }
     try:
         SIMULATOR = QasmSimulator(**BACKEND_OPTS)
@@ -61,16 +64,16 @@ class TestQasmSimulatorStatevectorMPIThrustGPU(common.QiskitAerTestCase,
 
 
 @requires_method("qasm_simulator", "statevector_thrust")
-class TestQasmSimulatorStatevectorMPIThrustCPU(common.QiskitAerTestCase,
-                                               StatevectorMPITests):
-    """QasmSimulator statevector_thrust method MPI tests."""
+class TestQasmSimulatorStatevectorChunkThrustCPU(common.QiskitAerTestCase,
+                                            StatevectorChunkTests):
+    """QasmSimulator statevector_thrust method tests."""
 
     BACKEND_OPTS = {
         "seed_simulator": 271828,
         "method": "statevector_thrust",
-        "blocking_enable": True,
-        "blocking_qubits": 6,
-        "max_parallel_threads": 1
+        "max_parallel_threads": 1,
+        "blocking_enable" : True,
+        "blocking_qubits" : 2
     }
     try:
         SIMULATOR = QasmSimulator(**BACKEND_OPTS)

From 1994b351ed863bcf73ff94d068c0b15ffb4c2969 Mon Sep 17 00:00:00 2001
From: Hiroshi Horii <hhorii@users.noreply.github.com>
Date: Thu, 11 Mar 2021 00:05:51 +0900
Subject: [PATCH 7/7] Add Fusion variations (#1110)

Co-authored-by: Victor Villar <vvilpas@gmail.com>
---
 src/framework/operations.hpp                  |   13 +
 src/transpile/fusion.hpp                      | 1010 +++++++++++++----
 .../backends/qasm_simulator/qasm_fusion.py    |   80 +-
 3 files changed, 863 insertions(+), 240 deletions(-)

diff --git a/src/framework/operations.hpp b/src/framework/operations.hpp
index 01bddc5608..0c22520ceb 100755
--- a/src/framework/operations.hpp
+++ b/src/framework/operations.hpp
@@ -283,6 +283,19 @@ inline Op make_unitary(const reg_t &qubits, cmatrix_t &&mat, std::string label =
   return op;
 }
 
+inline Op make_diagonal(const reg_t &qubits, cvector_t &&vec, std::string label = "") {
+  Op op;
+  op.type = OpType::diagonal_matrix;
+  op.name = "diagonal";
+  op.qubits = qubits;
+  op.params = std::move(vec);
+
+  if (label != "")
+    op.string_params = {label};
+
+  return op;
+}
+
 inline Op make_superop(const reg_t &qubits, const cmatrix_t &mat) {
   Op op;
   op.type = OpType::superop;
diff --git a/src/transpile/fusion.hpp b/src/transpile/fusion.hpp
index ae84c0cab8..10ae3bff03 100644
--- a/src/transpile/fusion.hpp
+++ b/src/transpile/fusion.hpp
@@ -32,6 +32,642 @@ using oplist_t = std::vector<op_t>;
 using opset_t = Operations::OpSet;
 using reg_t = std::vector<uint_t>;
 
+class FusionMethod {
+public:
+  // Return name of method
+  virtual std::string name() = 0;
+
+  virtual bool support_diagonal() const = 0;
+
+  // Aggregate a subcircuit of operations into a single operation
+  virtual op_t generate_operation(std::vector<op_t>& fusioned_ops, bool diagonal = false) const {
+    std::set<uint_t> fusioned_qubits;
+    for (auto & op: fusioned_ops)
+      fusioned_qubits.insert(op.qubits.begin(), op.qubits.end());
+
+    reg_t remapped2orig(fusioned_qubits.begin(), fusioned_qubits.end());
+    std::unordered_map<uint_t, uint_t> orig2remapped;
+    reg_t arg_qubits;
+    arg_qubits.assign(fusioned_qubits.size(), 0);
+    for (size_t i = 0; i < remapped2orig.size(); i++) {
+      orig2remapped[remapped2orig[i]] = i;
+      arg_qubits[i] = i;
+    }
+
+    // Remap qubits
+    for (auto & op: fusioned_ops)
+      for (size_t i = 0; i < op.qubits.size(); i++)
+        op.qubits[i] = orig2remapped[op.qubits[i]];
+
+    auto fusioned_op = generate_operation_internal(fusioned_ops, arg_qubits);
+
+    // Revert qubits
+    for (size_t i = 0; i < fusioned_op.qubits.size(); i++)
+      fusioned_op.qubits[i] = remapped2orig[fusioned_op.qubits[i]];
+
+    if (diagonal) {
+      std::vector<complex_t> vec;
+      vec.assign((1UL << fusioned_op.qubits.size()), 0);
+      for (size_t i = 0; i < vec.size(); ++i)
+        vec[i] = fusioned_op.mats[0](i, i);
+      fusioned_op = Operations::make_diagonal(fusioned_op.qubits, std::move(vec), std::string("fusion"));
+    }
+
+    return fusioned_op;
+  };
+
+  virtual op_t generate_operation_internal(const std::vector<op_t>& fusioned_ops,
+                                           const reg_t &fusioned_qubits) const = 0;
+
+  virtual bool can_apply(const op_t& op, uint_t max_fused_qubits) const = 0;
+
+  virtual bool can_ignore(const op_t& op) const {
+    switch (op.type) {
+      case optype_t::barrier:
+        return true;
+      case optype_t::gate:
+        return op.name == "id" || op.name == "u0";
+      default:
+        return false;
+    }
+  }
+
+  static FusionMethod& find_method(const Circuit& circ,
+                                  const opset_t &allowed_opset,
+                                  const bool allow_superop,
+                                  const bool allow_kraus);
+
+  static bool exist_non_unitary(const std::vector<op_t>& fusioned_ops) {
+    for (auto & op: fusioned_ops)
+      if (noise_opset_.contains(op.type))
+        return true;
+    return false;
+  };
+
+private:
+  const static Operations::OpSet noise_opset_;
+};
+
+const Operations::OpSet FusionMethod::noise_opset_(
+  {Operations::OpType::kraus,
+   Operations::OpType::superop,
+   Operations::OpType::reset},
+  {}, {}
+);
+
+class UnitaryFusion : public FusionMethod {
+public:
+  virtual std::string name() override { return "unitary"; };
+
+  virtual bool support_diagonal() const override { return true; }
+
+  virtual op_t generate_operation_internal (const std::vector<op_t>& fusioned_ops,
+                                           const reg_t &qubits) const override {
+    // Run simulation
+    RngEngine dummy_rng;
+    ExperimentResult dummy_result;
+
+    // Unitary simulation
+    QubitUnitary::State<> unitary_simulator;
+    unitary_simulator.initialize_qreg(qubits.size());
+    unitary_simulator.apply_ops(fusioned_ops, dummy_result, dummy_rng);
+    return Operations::make_unitary(qubits, unitary_simulator.qreg().move_to_matrix(),
+                                    std::string("fusion"));
+  };
+
+  virtual bool can_apply(const op_t& op, uint_t max_fused_qubits) const {
+    if (op.conditional)
+      return false;
+    switch (op.type) {
+      case optype_t::matrix:
+        return op.mats.size() == 1 && op.qubits.size() <= max_fused_qubits;
+      case optype_t::diagonal_matrix:
+        return op.qubits.size() <= max_fused_qubits;
+      case optype_t::gate: {
+        if (op.qubits.size() > max_fused_qubits)
+          return false;
+        return QubitUnitary::StateOpSet.contains_gates(op.name);
+      }
+      default:
+        return false;
+    }
+  };
+};
+
+class SuperOpFusion : public UnitaryFusion {
+public:
+  virtual std::string name() override { return "superop"; };
+
+  virtual bool support_diagonal() const override { return false; }
+
+  virtual op_t generate_operation_internal(const std::vector<op_t>& fusioned_ops,
+                                           const reg_t &qubits) const override {
+
+    if (!exist_non_unitary(fusioned_ops))
+      return UnitaryFusion::generate_operation_internal(fusioned_ops, qubits);
+
+    // Run simulation
+    RngEngine dummy_rng;
+    ExperimentResult dummy_result;
+
+    // For both Kraus and SuperOp method we simulate using superoperator
+    // simulator
+    QubitSuperoperator::State<> superop_simulator;
+    superop_simulator.initialize_qreg(qubits.size());
+    superop_simulator.apply_ops(fusioned_ops, dummy_result, dummy_rng);
+    auto superop = superop_simulator.qreg().move_to_matrix();
+
+    return Operations::make_superop(qubits, std::move(superop));
+  };
+
+  virtual bool can_apply(const op_t& op, uint_t max_fused_qubits) const {
+    if (op.conditional)
+      return false;
+    switch (op.type) {
+      case optype_t::kraus:
+      case optype_t::reset:
+      case optype_t::superop: {
+        return op.qubits.size() <= max_fused_qubits;
+      }
+      case optype_t::gate: {
+        if (op.qubits.size() > max_fused_qubits)
+          return false;
+        return QubitSuperoperator::StateOpSet.contains_gates(op.name);
+      }
+      default:
+        return UnitaryFusion::can_apply(op, max_fused_qubits);
+    }
+  };
+};
+
+class KrausFusion : public UnitaryFusion {
+public:
+  virtual std::string name() override { return "kraus"; };
+
+  virtual bool support_diagonal() const override { return false; }
+
+  virtual op_t generate_operation_internal(const std::vector<op_t>& fusioned_ops,
+                                           const reg_t &qubits) const override {
+
+    if (!exist_non_unitary(fusioned_ops))
+      return UnitaryFusion::generate_operation_internal(fusioned_ops, qubits);
+
+    // Run simulation
+    RngEngine dummy_rng;
+    ExperimentResult dummy_result;
+
+    // For both Kraus and SuperOp method we simulate using superoperator
+    // simulator
+    QubitSuperoperator::State<> superop_simulator;
+    superop_simulator.initialize_qreg(qubits.size());
+    superop_simulator.apply_ops(fusioned_ops, dummy_result, dummy_rng);
+    auto superop = superop_simulator.qreg().move_to_matrix();
+
+    // If Kraus method we convert superop to canonical Kraus representation
+    size_t dim = 1 << qubits.size();
+    return Operations::make_kraus(qubits, Utils::superop2kraus(superop, dim));
+  };
+
+  virtual bool can_apply(const op_t& op, uint_t max_fused_qubits) const {
+    if (op.conditional)
+      return false;
+    switch (op.type) {
+      case optype_t::kraus:
+      case optype_t::reset:
+      case optype_t::superop: {
+        return op.qubits.size() <= max_fused_qubits;
+      }
+      case optype_t::gate: {
+        if (op.qubits.size() > max_fused_qubits)
+          return false;
+        return QubitSuperoperator::StateOpSet.contains_gates(op.name);
+      }
+      default:
+        return UnitaryFusion::can_apply(op, max_fused_qubits);
+    }
+  };
+};
+
+FusionMethod& FusionMethod::find_method(const Circuit& circ,
+                                       const opset_t &allowed_opset,
+                                       const bool allow_superop,
+                                       const bool allow_kraus) {
+  static UnitaryFusion unitary;
+  static SuperOpFusion superOp;
+  static KrausFusion kraus;
+
+  if (allow_superop && allowed_opset.contains(optype_t::superop) &&
+      (circ.opset().contains(optype_t::kraus)
+       || circ.opset().contains(optype_t::superop)
+       || circ.opset().contains(optype_t::reset))) {
+    return superOp;
+  } else if (allow_kraus && allowed_opset.contains(optype_t::kraus) &&
+      (circ.opset().contains(optype_t::kraus)
+       || circ.opset().contains(optype_t::superop))) {
+    return kraus;
+  } else {
+    return unitary;
+  }
+}
+
+class Fuser {
+public:
+  virtual std::string name() const = 0;
+
+  virtual void set_config(const json_t &config) = 0;
+
+  virtual void set_metadata(ExperimentResult &result) const { }; //nop
+
+  virtual bool aggregate_operations(oplist_t& ops,
+                                    const int fusion_start,
+                                    const int fusion_end,
+                                    const uint_t max_fused_qubits,
+                                    const FusionMethod& method) const = 0;
+
+  virtual void allocate_new_operation(oplist_t& ops,
+                                      const uint_t idx,
+                                      const std::vector<uint_t>& fusioned_ops_idxs,
+                                      const FusionMethod& method,
+                                      const bool diagonal = false) const;
+};
+
+void Fuser::allocate_new_operation(oplist_t& ops,
+                                   const uint_t idx,
+                                   const std::vector<uint_t>& idxs,
+                                   const FusionMethod& method,
+                                   const bool diagonal) const {
+
+  oplist_t fusing_ops;
+  for (uint_t i: idxs)
+    fusing_ops.push_back(ops[i]);
+  ops[idx] = method.generate_operation(fusing_ops, diagonal);
+  for (auto i: idxs)
+    if (i != idx)
+      ops[i].type = optype_t::nop;
+}
+
+class CostBasedFusion : public Fuser {
+public:
+  CostBasedFusion() {
+    std::fill_n(costs, 64, -1);
+  };
+
+  virtual std::string name() const override { return "cost_base"; };
+
+  virtual void set_config(const json_t &config) override;
+
+  virtual void set_metadata(ExperimentResult &result) const override;
+
+  virtual bool aggregate_operations(oplist_t& ops,
+                                    const int fusion_start,
+                                    const int fusion_end,
+                                    const uint_t max_fused_qubits,
+                                    const FusionMethod& method) const override;
+
+private:
+  bool is_diagonal(const oplist_t& ops,
+                   const uint_t from,
+                   const uint_t until) const;
+
+  double estimate_cost(const oplist_t& ops,
+                       const uint_t from,
+                       const uint_t until) const;
+
+  void add_fusion_qubits(reg_t& fusion_qubits, const op_t& op) const;
+
+private:
+  bool active = true;
+  double cost_factor = 1.8;
+  double costs[64];
+};
+
+template<size_t N>
+class NQubitFusion : public Fuser {
+public:
+  NQubitFusion(): opt_name(std::to_string(N) + "_qubits"),
+                  activate_prop_name("fusion_enable." + std::to_string(N) + "_qubits") {
+  }
+
+  virtual void set_config(const json_t &config) override;
+
+  virtual std::string name() const override {
+    return opt_name;
+  };
+
+  virtual bool aggregate_operations(oplist_t& ops,
+                                    const int fusion_start,
+                                    const int fusion_end,
+                                    const uint_t max_fused_qubits,
+                                    const FusionMethod& method) const override;
+
+  bool exclude_escaped_qubits(std::vector<uint_t>& fusing_qubits,
+                                const op_t& tgt_op) const;
+private:
+  bool active = true;
+  const std::string opt_name;
+  const std::string activate_prop_name;
+  uint_t qubit_threshold = 5;
+};
+
+template<size_t N>
+void NQubitFusion<N>::set_config(const json_t &config) {
+  if (JSON::check_key("fusion_enable.n_qubits", config))
+    JSON::get_value(active, "fusion_enable.n_qubits", config);
+
+  if (JSON::check_key(activate_prop_name, config))
+    JSON::get_value(active, activate_prop_name, config);
+}
+
+template<size_t N>
+bool NQubitFusion<N>::exclude_escaped_qubits(std::vector<uint_t>& fusing_qubits,
+                                             const op_t& tgt_op) const {
+  bool included = true;
+  for (const auto qubit: tgt_op.qubits)
+    included &= (std::find(fusing_qubits.begin(), fusing_qubits.end(), qubit) != fusing_qubits.end());
+
+  if (included)
+    return false;
+
+  for (const int op_qubit: tgt_op.qubits) {
+    auto found = std::find(fusing_qubits.begin(), fusing_qubits.end(), op_qubit);
+    if (found != fusing_qubits.end())
+      fusing_qubits.erase(found);
+  }
+  return true;
+}
+
+template<size_t N>
+bool NQubitFusion<N>::aggregate_operations(oplist_t& ops,
+                                           const int fusion_start,
+                                           const int fusion_end,
+                                           const uint_t max_fused_qubits,
+                                           const FusionMethod& method) const {
+  if (!active)
+    return false;
+
+  std::vector<std::pair<uint_t, std::vector<op_t>>> targets;
+  bool fused = false;
+
+  for (uint_t op_idx = fusion_start; op_idx < fusion_end; ++op_idx) {
+    // skip operations to be ignored
+    if (!method.can_apply(ops[op_idx], max_fused_qubits) || ops[op_idx].type == optype_t::nop)
+      continue;
+
+    // 1. find a N-qubit operation
+    if (ops[op_idx].qubits.size() != N)
+      continue;
+
+    std::vector<uint_t> fusing_op_idxs = { op_idx };
+
+    std::vector<uint_t> fusing_qubits;
+    fusing_qubits.insert(fusing_qubits.end(), ops[op_idx].qubits.begin(), ops[op_idx].qubits.end());
+
+    // 2. fuse operations with backwarding
+    for (int fusing_op_idx = op_idx - 1; fusing_op_idx >= fusion_start; --fusing_op_idx) {
+      auto& tgt_op = ops[fusing_op_idx];
+      if (tgt_op.type == optype_t::nop)
+        continue;
+      if (!method.can_apply(tgt_op, max_fused_qubits))
+        break;
+      // check all the qubits are in fusing_qubits
+      if (!exclude_escaped_qubits(fusing_qubits, tgt_op))
+        fusing_op_idxs.push_back(fusing_op_idx); // All the qubits of tgt_op are in fusing_qubits
+      else if (fusing_qubits.empty())
+          break;
+    }
+
+    std::reverse(fusing_op_idxs.begin(), fusing_op_idxs.end());
+    fusing_qubits.clear();
+    fusing_qubits.insert(fusing_qubits.end(), ops[op_idx].qubits.begin(), ops[op_idx].qubits.end());
+
+    // 3. fuse operations with forwarding
+    for (int fusing_op_idx = op_idx + 1; fusing_op_idx < fusion_end; ++fusing_op_idx) {
+      auto& tgt_op = ops[fusing_op_idx];
+      if (tgt_op.type == optype_t::nop)
+        continue;
+      if (!method.can_apply(tgt_op, max_fused_qubits))
+        break;
+      // check all the qubits are in fusing_qubits
+      if (!exclude_escaped_qubits(fusing_qubits, tgt_op))
+        fusing_op_idxs.push_back(fusing_op_idx); // All the qubits of tgt_op are in fusing_qubits
+      else if (fusing_qubits.empty())
+          break;
+    }
+
+    if (fusing_op_idxs.size() <= 1)
+      continue;
+
+    // 4. generate a fused operation
+    allocate_new_operation(ops, op_idx, fusing_op_idxs, method, false);
+
+    fused = true;
+  }
+
+  return fused;
+}
+
+class DiagonalFusion : public Fuser {
+public:
+  DiagonalFusion() = default;
+
+  virtual ~DiagonalFusion() = default;
+
+  virtual std::string name() const override { return "diagonal"; };
+
+  virtual void set_config(const json_t &config) override;
+
+  virtual bool aggregate_operations(oplist_t& ops,
+                                    const int fusion_start,
+                                    const int fusion_end,
+                                    const uint_t max_fused_qubits,
+                                    const FusionMethod& method) const override;
+
+private:
+  bool is_diagonal_op(const op_t& op) const;
+
+  int get_next_diagonal_end(const oplist_t& ops, const int from, std::set<uint_t>& fusing_qubits) const;
+
+  const std::shared_ptr<FusionMethod> method;
+  uint_t min_qubit = 3;
+  bool active = true;
+};
+
+void DiagonalFusion::set_config(const json_t &config) {
+  if (JSON::check_key("fusion_enable.diagonal", config))
+    JSON::get_value(active, "fusion_enable.diagonal", config);
+  if (JSON::check_key("fusion_min_qubit.diagonal", config))
+    JSON::get_value(min_qubit, "fusion_min_qubit.diagonal", config);
+}
+
+bool DiagonalFusion::is_diagonal_op(const op_t& op) const {
+
+  if (op.type == Operations::OpType::diagonal_matrix)
+    return true;
+
+  if (op.type == Operations::OpType::gate) {
+    if (op.name == "p" || op.name == "cp" || op.name == "u1" || op.name == "cu1"
+        || op.name == "mcu1" || op.name== "rz" || op.name== "rzz")
+      return true;
+    if (op.name == "u3")
+      return op.params[0] == std::complex<double>(0.) && op.params[1] == std::complex<double>(0.);
+    else
+      return false;
+  }
+
+  return false;
+}
+
+int DiagonalFusion::get_next_diagonal_end(const oplist_t& ops,
+                                          const int from,
+                                          std::set<uint_t>& fusing_qubits) const {
+
+  if (is_diagonal_op(ops[from])) {
+    for (const auto qubit: ops[from].qubits)
+      fusing_qubits.insert(qubit);
+    return from;
+  }
+
+  if (ops[from].type != Operations::OpType::gate)
+    return -1;
+
+  auto pos = from;
+
+  // find a diagonal gate that has the same lists of CX before and after it
+  //      ┌───┐                                   ┌───┐
+  // q_0: ┤ X ├───────────────────────────────────┤ X ├
+  //      └─┬─┘┌───┐            ┌──────────┐ ┌───┐└─┬─┘
+  // q_1: ──■──┤ X ├────────────┤ diagonal ├─┤ X ├──■──
+  //           └─┬─┘┌──────────┐└──────────┘ └─┬─┘
+  // q_2: ───────■──┤ diagonal ├───────────────■───────
+  //                └──────────┘
+  //        ■ [from,pos]
+
+  // find first cx list
+  for (; pos < ops.size(); ++pos)
+    if (ops[from].type != Operations::OpType::gate || ops[pos].name != "cx")
+      break;
+
+  if (pos == from || pos == ops.size())
+    return -1;
+
+  auto cx_end = pos - 1;
+
+  //      ┌───┐                                   ┌───┐
+  // q_0: ┤ X ├───────────────────────────────────┤ X ├
+  //      └─┬─┘┌───┐            ┌──────────┐ ┌───┐└─┬─┘
+  // q_1: ──■──┤ X ├────────────┤ diagonal ├─┤ X ├──■──
+  //           └─┬─┘┌──────────┐└──────────┘ └─┬─┘
+  // q_2: ───────■──┤ diagonal ├───────────────■───────
+  //                └──────────┘
+  //        ■ [from]     ■ [pos]
+  //             ■ [cx_end]
+
+  bool found = false;
+  // find diagonals
+  for (; pos < ops.size(); ++pos)
+    if (is_diagonal_op(ops[pos]))
+      found = true;
+    else
+      break;
+
+  if (!found)
+    return -1;
+
+  if (pos == ops.size())
+    return -1;
+
+  auto u1_end = pos;
+
+  //      ┌───┐                                   ┌───┐
+  // q_0: ┤ X ├───────────────────────────────────┤ X ├
+  //      └─┬─┘┌───┐            ┌──────────┐ ┌───┐└─┬─┘
+  // q_1: ──■──┤ X ├────────────┤ diagonal ├─┤ X ├──■──
+  //           └─┬─┘┌──────────┐└──────────┘ └─┬─┘
+  // q_2: ───────■──┤ diagonal ├───────────────■───────
+  //                └──────────┘
+  //        ■ [from]                           ■ [pos,u1_end]
+  //             ■ [cx_end]
+
+  // find second cx list that is the reverse of the first
+  for (; pos < ops.size(); ++pos) {
+    if (ops[pos].type == Operations::OpType::gate
+        && ops[pos].name == ops[cx_end].name
+        && ops[pos].qubits == ops[cx_end].qubits) {
+      if (cx_end == from)
+        break;
+      --cx_end;
+    } else {
+      return -1;
+    }
+  }
+
+  if (pos == ops.size())
+    return -1;
+
+  //      ┌───┐                                   ┌───┐
+  // q_0: ┤ X ├───────────────────────────────────┤ X ├
+  //      └─┬─┘┌───┐            ┌──────────┐ ┌───┐└─┬─┘
+  // q_1: ──■──┤ X ├────────────┤ diagonal ├─┤ X ├──■──
+  //           └─┬─┘┌──────────┐└──────────┘ └─┬─┘
+  // q_2: ───────■──┤ diagonal ├───────────────■───────
+  //                └──────────┘
+  //        ■ [from]                                ■ [pos]
+  //        ■ [cx_end]                         ■ [u1_end]
+
+  for (auto i = from; i < u1_end; ++i)
+    for (const auto qubit: ops[i].qubits)
+      fusing_qubits.insert(qubit);
+
+  return pos;
+
+}
+
+bool DiagonalFusion::aggregate_operations(oplist_t& ops,
+                                          const int fusion_start,
+                                          const int fusion_end,
+                                          const uint_t max_fused_qubits,
+                                          const FusionMethod& method) const {
+
+  if (!active || !method.support_diagonal())
+    return false;
+
+  // current impl is sensitive to ordering of gates
+  for (int op_idx = fusion_start; op_idx < fusion_end; ++op_idx) {
+
+    std::set<uint_t> checking_qubits_set;
+    auto next_diagonal_end = get_next_diagonal_end(ops, op_idx, checking_qubits_set);
+
+    if (next_diagonal_end < 0)
+      continue;
+
+    if (checking_qubits_set.size() > max_fused_qubits)
+      continue;
+
+    auto next_diagonal_start = next_diagonal_end + 1;
+
+    int cnt = 0;
+    while (true) {
+      auto next_diagonal_end = get_next_diagonal_end(ops, next_diagonal_start, checking_qubits_set);
+      if (next_diagonal_end < 0)
+        break;
+      if (checking_qubits_set.size() > max_fused_qubits)
+        break;
+      next_diagonal_start = next_diagonal_end + 1;
+    }
+
+    if (checking_qubits_set.size() < min_qubit)
+      continue;
+
+    std::vector<uint_t> fusing_op_idxs;
+    for (; op_idx < next_diagonal_start; ++op_idx)
+      fusing_op_idxs.push_back(op_idx);
+
+    --op_idx;
+    allocate_new_operation(ops, op_idx, fusing_op_idxs, method, true);
+  }
+
+  return true;
+}
 
 class Fusion : public CircuitOptimization {
 public:
@@ -49,15 +685,8 @@ class Fusion : public CircuitOptimization {
    * - fusion_cost_factor (double): a cost function to estimate an aggregate
    *       gate [Default: 1.8]
    */
-  Fusion(uint_t _max_qubit = 5, uint_t _threshold = 14, double _cost_factor = 1.8)
-    : max_qubit(_max_qubit), threshold(_threshold), cost_factor(_cost_factor) {}
+  Fusion();
   
-  // Allowed fusion methods:
-  // - Unitary: only fuse gates into unitary instructions
-  // - SuperOp: fuse gates, reset, kraus, and superops into kraus instuctions
-  // - Kraus: fuse gates, reset, kraus, and superops into kraus instuctions
-  enum class Method {unitary, kraus, superop};
-
   void set_config(const json_t &config) override;
 
   virtual void set_parallelization(uint_t num) { parallelization_ = num; };
@@ -70,9 +699,9 @@ class Fusion : public CircuitOptimization {
                                 ExperimentResult &result) const override;
 
   // Qubit threshold for activating fusion pass
-  uint_t max_qubit;
-  uint_t threshold;
-  double cost_factor;
+  uint_t max_qubit = 5;
+  uint_t threshold = 14;
+
   bool verbose = false;
   bool active = true;
   bool allow_superop = false;
@@ -84,57 +713,52 @@ class Fusion : public CircuitOptimization {
   uint_t parallel_threshold_ = 10000;
 
 private:
-  bool can_ignore(const op_t& op) const;
-
-  bool can_apply_fusion(const op_t& op,
-                        uint_t max_max_fused_qubits,
-                        Method method) const;
-
-  double get_cost(const op_t& op) const;
-
   void optimize_circuit(Circuit& circ,
-                        Noise::NoiseModel& noise,
+                        const Noise::NoiseModel& noise,
                         const opset_t &allowed_opset,
-                        uint_t ops_start,
-                        uint_t ops_end) const;
-
-  bool aggregate_operations(oplist_t& ops,
-                            const int fusion_start,
-                            const int fusion_end,
-                            uint_t max_fused_qubits,
-                            Method method) const;
-
-  // Aggregate a subcircuit of operations into a single operation
-  op_t generate_fusion_operation(const std::vector<op_t>& fusioned_ops,
-                                 const reg_t &num_qubits,
-                                 Method method) const;
-
-  bool is_diagonal(const oplist_t& ops,
-                   const uint_t from,
-                   const uint_t until) const;
-
-  double estimate_cost(const oplist_t& ops,
-                       const uint_t from,
-                       const uint_t until) const;
-
-  void add_fusion_qubits(reg_t& fusion_qubits, const op_t& op) const;
+                        const uint_t ops_start,
+                        const uint_t ops_end,
+                        const std::shared_ptr<Fuser>& fuser,
+                        const FusionMethod& method) const;
 
 #ifdef DEBUG
-  void dump(const Circuit& circuit) const;
+  void dump(const Circuit& circuit) const {
+    auto& ops = circuit.ops;
+    for (uint_t op_idx = 0; op_idx < ops.size(); ++op_idx) {
+      std::cout << std::setw(3) << op_idx << ": ";
+      if (ops[op_idx].type == optype_t::nop) {
+        std::cout << std::setw(15) << "nop" << ": ";
+      } else {
+        std::cout << std::setw(15) << ops[op_idx].name << "-" << ops[op_idx].qubits.size() << ": ";
+        if (ops[op_idx].qubits.size() > 0) {
+          auto qubits = ops[op_idx].qubits;
+          std::sort(qubits.begin(), qubits.end());
+          int pos = 0;
+          for (int j = 0; j < qubits.size(); ++j) {
+            int q_pos = 1 + qubits[j] * 2;
+            for (int k = 0; k < (q_pos - pos); ++k) {
+              std::cout << " ";
+            }
+            pos = q_pos + 1;
+            std::cout << "X";
+          }
+        }
+      }
+      std::cout << std::endl;
+    }
+  }
 #endif
 
 private:
-  const static Operations::OpSet noise_opset_;
+  std::vector<std::shared_ptr<Fuser>> fusers;
 };
 
-
-const Operations::OpSet Fusion::noise_opset_(
-  {Operations::OpType::kraus,
-   Operations::OpType::superop,
-   Operations::OpType::reset},
-  {}, {}
-);
-
+Fusion::Fusion() {
+  fusers.push_back(std::make_shared<DiagonalFusion>());
+  fusers.push_back(std::make_shared<NQubitFusion<1>>());
+  fusers.push_back(std::make_shared<NQubitFusion<2>>());
+  fusers.push_back(std::make_shared<CostBasedFusion>());
+}
 
 void Fusion::set_config(const json_t &config) {
 
@@ -152,9 +776,9 @@ void Fusion::set_config(const json_t &config) {
   if (JSON::check_key("fusion_threshold", config_))
     JSON::get_value(threshold, "fusion_threshold", config_);
 
-  if (JSON::check_key("fusion_cost_factor", config))
-    JSON::get_value(cost_factor, "fusion_cost_factor", config);
-  
+  for (std::shared_ptr<Fuser>& fuser: fusers)
+    fuser->set_config(config_);
+
   if (JSON::check_key("fusion_allow_kraus", config))
     JSON::get_value(allow_kraus, "fusion_allow_kraus", config);
 
@@ -170,6 +794,11 @@ void Fusion::optimize_circuit(Circuit& circ,
                               const opset_t &allowed_opset,
                               ExperimentResult &result) const {
 
+#ifdef DEBUG
+    std::cout << "original" << std::endl;
+    dump(circ);
+#endif
+
   // Start timer
   using clock_t = std::chrono::high_resolution_clock;
   auto timer_start = clock_t::now();
@@ -182,7 +811,6 @@ void Fusion::optimize_circuit(Circuit& circ,
 
   result.metadata.add(true, "fusion", "enabled");
   result.metadata.add(threshold, "fusion", "threshold");
-  result.metadata.add(cost_factor, "fusion", "cost_factor");
   result.metadata.add(max_qubit, "fusion", "max_fused_qubits");
 
   // Check qubit threshold
@@ -190,185 +818,108 @@ void Fusion::optimize_circuit(Circuit& circ,
     result.metadata.add(false, "fusion", "applied");
     return;
   }
+
   // Determine fusion method
-  // TODO: Support Kraus fusion method
-  Method method = Method::unitary;
-  if (allow_superop && allowed_opset.contains(optype_t::superop) &&
-      (circ.opset().contains(optype_t::kraus)
-       || circ.opset().contains(optype_t::superop)
-       || circ.opset().contains(optype_t::reset))) {
-    method = Method::superop;
-  } else if (allow_kraus && allowed_opset.contains(optype_t::kraus) &&
-      (circ.opset().contains(optype_t::kraus)
-       || circ.opset().contains(optype_t::superop))) {
-    method = Method::kraus;
-  }
-  if (method == Method::unitary) {
-    result.metadata.add("unitary", "fusion", "method");
-  } else if (method == Method::superop) {
-    result.metadata.add("superop", "fusion", "method");
-  } else if (method == Method::kraus) {
-    result.metadata.add("kraus", "fusion", "method");
-  }
+  FusionMethod& method = FusionMethod::find_method(circ, allowed_opset, allow_superop, allow_kraus);
+  result.metadata.add(method.name(), "fusion", "method");
 
-  if (circ.ops.size() < parallel_threshold_ || parallelization_ <= 1) {
-    optimize_circuit(circ, noise, allowed_opset, 0, circ.ops.size());
-  } else {
-    // determine unit for each OMP thread
-    int_t unit = circ.ops.size() / parallelization_;
-    if (circ.ops.size() % parallelization_)
-      ++unit;
+  bool applied = false;
+  for (const std::shared_ptr<Fuser>& fuser: fusers) {
+    fuser->set_metadata(result);
+
+    if (circ.ops.size() < parallel_threshold_ || parallelization_ <= 1) {
+      optimize_circuit(circ, noise, allowed_opset, 0, circ.ops.size(), fuser, method);
+      result.metadata.add(1, "fusion", "parallelization");
+    } else {
+      // determine unit for each OMP thread
+      int_t unit = circ.ops.size() / parallelization_;
+      if (circ.ops.size() % parallelization_)
+        ++unit;
 
 #pragma omp parallel for if (parallelization_ > 1) num_threads(parallelization_)
-    for (int_t i = 0; i < parallelization_; i++) {
-      int_t start = unit * i;
-      int_t end = std::min(start + unit, (int_t) circ.ops.size());
-      optimize_circuit(circ, noise, allowed_opset, start, end);
+      for (int_t i = 0; i < parallelization_; i++) {
+        int_t start = unit * i;
+        int_t end = std::min(start + unit, (int_t) circ.ops.size());
+        optimize_circuit(circ, noise, allowed_opset, start, end, fuser, method);
+      }
+      result.metadata.add(parallelization_, "fusion", "parallelization");
     }
-  }
-
-  result.metadata.add(parallelization_, "fusion", "parallelization");
 
-  auto timer_stop = clock_t::now();
-  result.metadata.add(std::chrono::duration<double>(timer_stop - timer_start).count(), "fusion", "time_taken");
+    size_t idx = 0;
+    for (size_t i = 0; i < circ.ops.size(); ++i) {
+      if (circ.ops[i].type != optype_t::nop) {
+        if (i != idx)
+          circ.ops[idx] = circ.ops[i];
+        ++idx;
+      }
+    }
 
-  size_t idx = 0;
-  for (size_t i = 0; i < circ.ops.size(); ++i) {
-    if (circ.ops[i].type != optype_t::nop) {
-      if (i != idx)
-        circ.ops[idx] = circ.ops[i];
-      ++idx;
+    if (idx != circ.ops.size()) {
+      applied = true;
+      circ.ops.erase(circ.ops.begin() + idx, circ.ops.end());
+      circ.set_params();
     }
-  }
 
-  if (idx == circ.ops.size()) {
-    result.metadata.add(false, "fusion", "applied");
-  } else {
-    circ.ops.erase(circ.ops.begin() + idx, circ.ops.end());
-    result.metadata.add(true, "fusion", "applied");
-    circ.set_params();
+#ifdef DEBUG
+    std::cout << fuser->name() << std::endl;
+    dump(circ);
+#endif
 
-    if (verbose)
-      result.metadata.add(circ.ops, "fusion", "output_ops");
   }
+  result.metadata.add(applied, "fusion", "applied");
+  if (applied && verbose)
+    result.metadata.add(circ.ops, "fusion", "output_ops");
+
+  auto timer_stop = clock_t::now();
+  result.metadata.add(std::chrono::duration<double>(timer_stop - timer_start).count(), "fusion", "time_taken");
 }
 
 void Fusion::optimize_circuit(Circuit& circ,
-                              Noise::NoiseModel& noise,
+                              const Noise::NoiseModel& noise,
                               const opset_t &allowed_opset,
-                              uint_t ops_start,
-                              uint_t ops_end) const {
-
-  // Determine fusion method
-  // TODO: Support Kraus fusion method
-  Method method = Method::unitary;
-  if (allow_superop && allowed_opset.contains(optype_t::superop) &&
-      (circ.opset().contains(optype_t::kraus)
-       || circ.opset().contains(optype_t::superop)
-       || circ.opset().contains(optype_t::reset))) {
-    method = Method::superop;
-  } else if (allow_kraus && allowed_opset.contains(optype_t::kraus) &&
-      (circ.opset().contains(optype_t::kraus)
-       || circ.opset().contains(optype_t::superop))) {
-    method = Method::kraus;
-  }
+                              const uint_t ops_start,
+                              const uint_t ops_end,
+                              const std::shared_ptr<Fuser>& fuser,
+                              const FusionMethod& method) const {
 
   uint_t fusion_start = ops_start;
   uint_t op_idx;
   for (op_idx = ops_start; op_idx < ops_end; ++op_idx) {
-    if (can_ignore(circ.ops[op_idx]))
+    if (method.can_ignore(circ.ops[op_idx]))
       continue;
-    if (!can_apply_fusion(circ.ops[op_idx], max_qubit, method) || op_idx == (ops_end - 1)) {
-      aggregate_operations(circ.ops, fusion_start, op_idx, max_qubit, method);
+    if (!method.can_apply(circ.ops[op_idx], max_qubit) || op_idx == (ops_end - 1)) {
+      fuser->aggregate_operations(circ.ops, fusion_start, op_idx, max_qubit, method);
       fusion_start = op_idx + 1;
     }
   }
 }
 
-bool Fusion::can_ignore(const op_t& op) const {
-  switch (op.type) {
-    case optype_t::barrier:
-      return true;
-    case optype_t::gate:
-      return op.name == "id" || op.name == "u0";
-    default:
-      return false;
-  }
-}
-
-bool Fusion::can_apply_fusion(const op_t& op, uint_t max_fused_qubits, Method method) const {
-  if (op.conditional)
-    return false;
-  switch (op.type) {
-    case optype_t::matrix:
-      return op.mats.size() == 1 && op.qubits.size() <= max_fused_qubits;
-    case optype_t::kraus:
-    case optype_t::reset:
-    case optype_t::superop: {
-      return method != Method::unitary && op.qubits.size() <= max_fused_qubits;
-    }
-    case optype_t::gate: {
-      if (op.qubits.size() > max_fused_qubits)
-        return false;
-      return (method == Method::unitary)
-        ? QubitUnitary::StateOpSet.contains_gates(op.name)
-        : QubitSuperoperator::StateOpSet.contains_gates(op.name);
-    }
-    case optype_t::measure:
-    case optype_t::bfunc:
-    case optype_t::roerror:
-    case optype_t::snapshot:
-    case optype_t::barrier:
-    default:
-      return false;
-  }
-}
-
-double Fusion::get_cost(const op_t& op) const {
-  if (can_ignore(op))
-    return .0;
-  else
-    return cost_factor;
+void CostBasedFusion::set_metadata(ExperimentResult &result) const {
+  result.metadata.add(cost_factor, "fusion", "cost_factor");
 }
 
+void CostBasedFusion::set_config(const json_t &config) {
 
-op_t Fusion::generate_fusion_operation(const std::vector<op_t>& fusioned_ops,
-                                       const reg_t &qubits,
-                                       Method method) const {
-  // Run simulation
-  RngEngine dummy_rng;
-  ExperimentResult dummy_result;
-
-  if (method == Method::unitary) {
-    // Unitary simulation
-    QubitUnitary::State<> unitary_simulator;
-    unitary_simulator.initialize_qreg(qubits.size());
-    unitary_simulator.apply_ops(fusioned_ops, dummy_result, dummy_rng);
-    return Operations::make_unitary(qubits, unitary_simulator.move_to_matrix(),
-                                    std::string("fusion"));
-  }
+  if (JSON::check_key("fusion_cost_factor", config))
+    JSON::get_value(cost_factor, "fusion_cost_factor", config);
 
-  // For both Kraus and SuperOp method we simulate using superoperator
-  // simulator
-  QubitSuperoperator::State<> superop_simulator;
-  superop_simulator.initialize_qreg(qubits.size());
-  superop_simulator.apply_ops(fusioned_ops, dummy_result, dummy_rng);
-  auto superop = superop_simulator.move_to_matrix();
+  if (JSON::check_key("fusion_enable.cost_based", config))
+    JSON::get_value(active, "fusion_enable.cost_based", config);
 
-  if (method == Method::superop) {
-    return Operations::make_superop(qubits, std::move(superop));
+  for (int i = 0; i < 64; ++i) {
+    auto prop_name = "fusion_cost." + std::to_string(i + 1);
+    if (JSON::check_key(prop_name, config))
+      JSON::get_value(costs[i], prop_name, config);
   }
-
-  // If Kraus method we convert superop to canonical Kraus representation
-  size_t dim = 1 << qubits.size();
-  return Operations::make_kraus(qubits, Utils::superop2kraus(superop, dim));
 }
 
-bool Fusion::aggregate_operations(oplist_t& ops,
+bool CostBasedFusion::aggregate_operations(oplist_t& ops,
                                   const int fusion_start,
                                   const int fusion_end,
-                                  uint_t max_fused_qubits,
-                                  Method method) const {
+                                  const uint_t max_fused_qubits,
+                                  const FusionMethod& method) const {
+  if (!active)
+    return false;
 
   // costs[i]: estimated cost to execute from 0-th to i-th in original.ops
   std::vector<double> costs;
@@ -377,14 +928,14 @@ bool Fusion::aggregate_operations(oplist_t& ops,
 
   // set costs and fusion_to of fusion_start
   fusion_to.push_back(fusion_start);
-  costs.push_back(get_cost(ops[fusion_start]));
+  costs.push_back(method.can_ignore(ops[fusion_start])? .0 : cost_factor);
 
   bool applied = false;
   // calculate the minimal path to each operation in the circuit
   for (int i = fusion_start + 1; i < fusion_end; ++i) {
     // init with fusion from i-th to i-th
     fusion_to.push_back(i);
-    costs.push_back(costs[i - fusion_start - 1] + get_cost(ops[i]));
+    costs.push_back(costs[i - fusion_start - 1] + (method.can_ignore(ops[i])? .0 : cost_factor));
 
     for (int num_fusion = 2; num_fusion <=  static_cast<int> (max_fused_qubits); ++num_fusion) {
       // calculate cost if {num_fusion}-qubit fusion is applied
@@ -416,36 +967,13 @@ bool Fusion::aggregate_operations(oplist_t& ops,
 
   // generate a new circuit with the minimal path to the last operation in the circuit
   for (int i = fusion_end - 1; i >= fusion_start;) {
-
     int to = fusion_to[i - fusion_start];
-
     if (to != i) {
-      std::vector<op_t> fusioned_ops;
-      std::set<uint_t> fusioned_qubits;
-      for (int j = to; j <= i; ++j) {
-        fusioned_ops.push_back(ops[j]);
-        fusioned_qubits.insert(ops[j].qubits.cbegin(), ops[j].qubits.cend());
-        ops[j].type = optype_t::nop;
-      }
-      if (!fusioned_ops.empty()) {
-        // We need to remap qubits in fusion subcircuits for simulation
-        // TODO: This could be done above during the fusion cost calculation
-        reg_t qubits(fusioned_qubits.begin(), fusioned_qubits.end());
-        std::unordered_map<uint_t, uint_t> qubit_mapping;
-        for (size_t j = 0; j < qubits.size(); j++) {
-          qubit_mapping[qubits[j]] = j;
-        }
-        // Remap qubits and determine method
-        bool non_unitary = false;
-        for (auto & op: fusioned_ops) {
-          non_unitary |= noise_opset_.contains(op.type);
-          for (size_t j = 0; j < op.qubits.size(); j++) {
-            op.qubits[j] = qubit_mapping[op.qubits[j]];
-          }
-        }
-        Method required_method = (non_unitary) ? method : Method::unitary;
-        ops[i] = generate_fusion_operation(fusioned_ops, qubits, required_method);
-      }
+      std::vector<uint_t> fusing_op_idxs;
+      for (int j = to; j <= i; ++j)
+        fusing_op_idxs.push_back(j);
+      if (!fusing_op_idxs.empty())
+        allocate_new_operation(ops, i, fusing_op_idxs, method, false);
     }
     i = to - 1;
   }
@@ -456,7 +984,7 @@ bool Fusion::aggregate_operations(oplist_t& ops,
 // Gate-swap optimized helper functions
 //------------------------------------------------------------------------------
 
-bool Fusion::is_diagonal(const std::vector<op_t>& ops,
+bool CostBasedFusion::is_diagonal(const std::vector<op_t>& ops,
                          const uint_t from,
                          const uint_t until) const {
 
@@ -485,34 +1013,38 @@ bool Fusion::is_diagonal(const std::vector<op_t>& ops,
   return true;
 }
 
-double Fusion::estimate_cost(const std::vector<op_t>& ops,
+double CostBasedFusion::estimate_cost(const std::vector<op_t>& ops,
                              const uint_t from,
                              const uint_t until) const {
   if (is_diagonal(ops, from, until))
-    return cost_factor;
+    return 1.0;
 
   reg_t fusion_qubits;
   for (uint_t i = from; i <= until; ++i)
     add_fusion_qubits(fusion_qubits, ops[i]);
 
+  auto configured_cost = costs[fusion_qubits.size() - 1];
+  if (configured_cost > 0)
+    return configured_cost;
+
   if(is_avx2_supported()){
     switch (fusion_qubits.size()) {
       case 1:
         // [[ falling through :) ]]
       case 2:
-        return cost_factor;
+        return 1.0;
       case 3:
-        return cost_factor * 1.1;
+        return 1.1;
       case 4:
-        return cost_factor * 3;
+        return 3;
       default:
-        return pow(cost_factor, (double) std::max(fusion_qubits.size() - 1, size_t(1)));
+        return pow(cost_factor, (double) std::max(fusion_qubits.size() - 2, size_t(1)));
     }
   }
   return pow(cost_factor, (double) std::max(fusion_qubits.size() - 1, size_t(1)));
 }
 
-void Fusion::add_fusion_qubits(reg_t& fusion_qubits, const op_t& op) const {
+void CostBasedFusion::add_fusion_qubits(reg_t& fusion_qubits, const op_t& op) const {
   for (const auto &qubit: op.qubits){
     if (find(fusion_qubits.begin(), fusion_qubits.end(), qubit) == fusion_qubits.end()){
       fusion_qubits.push_back(qubit);
diff --git a/test/terra/backends/qasm_simulator/qasm_fusion.py b/test/terra/backends/qasm_simulator/qasm_fusion.py
index 838810cc5c..9412785588 100644
--- a/test/terra/backends/qasm_simulator/qasm_fusion.py
+++ b/test/terra/backends/qasm_simulator/qasm_fusion.py
@@ -14,9 +14,10 @@
 """
 # pylint: disable=no-member
 import copy
+import numpy as np
 
 from qiskit import QuantumRegister, ClassicalRegister, QuantumCircuit
-from qiskit.circuit.library import QuantumVolume, QFT
+from qiskit.circuit.library import QuantumVolume, QFT, RealAmplitudes
 from qiskit.compiler import assemble, transpile
 from qiskit.providers.aer import QasmSimulator
 from qiskit.providers.aer.noise import NoiseModel
@@ -463,3 +464,80 @@ def test_fusion_parallelization(self):
                                    result_serial.get_counts(circuit),
                                    delta=0.0,
                                    msg="parallelized fusion was failed")
+
+    def test_fusion_two_qubits(self):
+        """Test 2-qubit fusion"""
+        shots = 100
+        num_qubits = 8
+        reps = 3
+        
+        circuit = RealAmplitudes(num_qubits=num_qubits, entanglement='linear', reps=reps)
+        param_binds = {}
+        for param in circuit.parameters:
+            param_binds[param] = np.random.random()
+
+        circuit = transpile(circuit.bind_parameters(param_binds),
+                            backend=self.SIMULATOR,
+                            optimization_level=0)
+        circuit.measure_all()
+        
+        qobj = assemble([circuit],
+                        self.SIMULATOR,
+                        shots=shots,
+                        seed_simulator=1)
+
+        backend_options = self.fusion_options(enabled=True, threshold=1)
+        backend_options['fusion_verbose'] =  True
+        
+        backend_options['fusion_enable.2_qubits'] =  False
+        result_disabled = self.SIMULATOR.run(qobj, **backend_options).result()
+        meta_disabled = self.fusion_metadata(result_disabled)
+        
+        backend_options['fusion_enable.2_qubits'] =  True
+        result_enabled = self.SIMULATOR.run(qobj, **backend_options).result()
+        meta_enabled = self.fusion_metadata(result_enabled)
+        
+        self.assertTrue(getattr(result_disabled, 'success', 'False'))
+        self.assertTrue(getattr(result_enabled, 'success', 'False'))
+        
+        self.assertTrue(len(meta_enabled['output_ops']) if 'output_ops' in meta_enabled else len(circuit.ops) < 
+                        len(meta_disabled['output_ops']) if 'output_ops' in meta_disabled else len(circuit.ops))
+
+    def test_fusion_diagonal(self):
+        """Test diagonal fusion"""
+        shots = 100
+        num_qubits = 8
+
+        circuit = QuantumCircuit(num_qubits)
+        for i in range(num_qubits):
+            circuit.p(0.1, i)
+        
+        for i in range(num_qubits - 1):
+            circuit.cp(0.1, i, i + 1)
+        
+        circuit = transpile(circuit,
+                            backend=self.SIMULATOR,
+                            optimization_level=0)
+        circuit.measure_all()
+        
+        qobj = assemble([circuit],
+                        self.SIMULATOR,
+                        shots=shots,
+                        seed_simulator=1)
+
+        backend_options = self.fusion_options(enabled=True, threshold=1)
+        backend_options['fusion_verbose'] =  True
+        
+        backend_options['fusion_enable.cost_base'] =  False
+        result = self.SIMULATOR.run(qobj, **backend_options).result()
+        meta = self.fusion_metadata(result)
+        
+        method = result.results[0].metadata.get('method')
+        if method not in ['statevector']:
+            return
+
+        for op in meta['output_ops']:
+            op_name = op['name']
+            if op_name == 'measure':
+                break
+            self.assertEqual(op_name, 'diagonal')