diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
index 9583b82a..10e927ad 100644
--- a/.github/CONTRIBUTING.md
+++ b/.github/CONTRIBUTING.md
@@ -1,50 +1,176 @@
-
 # Contributing to **Cooper**
 
+We encourage contributions to **Cooper**.
 
-We want to make contributing to **Cooper** as easy and transparent as
-possible.
+Some things that we would like to see in the future, but have not yet had time to implement, are:
 
+- More tutorials showing how to use **Cooper** in non-deep learning applications.
 
-## Building
+## How to contribute
 
-Using `pip`, you can install the package in development mode by running:
+Please follow these steps to contribute:
 
-```sh
-pip install --editable ".[dev]"
-```
+1. If you plan to contribute new features, please first open an issue and discuss the feature with us.
 
-## Testing
+2. Fork the **Cooper** repository by clicking the **Fork** button on the
+   [repository page](http://www.github.com/cooper-org/cooper). This creates
+   a copy of the **Cooper** repository in your own account.
 
-We test the package using `pytest`, which you can run locally by typing
+3. Install Python >= 3.9 locally in order to run tests.
 
-```sh
-pytest tests
-```
+4. `pip` installing your fork from source. This allows you to modify the code
+   and immediately test it out:
+    ```bash
+    git clone https://github.com/YOUR_USERNAME/cooper
+    cd cooper
+    pip install --editable .  # Without tests.
+    pip install --editable '.[test]'  # Matches test environment.
+    pip install --editable '.[dev]'  # Matches development environment.
+    pip install --editable '.[notebooks]'  # Install dependencies for running notebooks.
+    pip install --editable '.[docs]'  # Used to generate the documentation.
+    pip install --editable '.[dev, docs]'  # Install all dependencies.
+    ```
+
+5. Add the **Cooper** repo as an upstream remote, so you can use it to sync your
+   changes.
+
+   ```bash
+   git remote add upstream https://www.github.com/cooper-org/cooper
+   ```
+
+6. Create a branch where you will develop from:
+
+   ```bash
+   git checkout -b name-of-change
+   ```
+
+7. Make sure your code passes **Cooper**'s lint and type checks, by running the following from
+   the top of the repository:
+
+   ```bash
+   pip install pre-commit
+   pre-commit run --all
+   ```
+
+8. Make sure the tests pass by running the following command from the top of
+   the repository:
+
+   ```bash
+   pytest tests
+   ```
+
+   **Cooper**'s pipeline tests can take a while to run, so if you know the specific test file that covers your changes, you can limit the tests to that; for example:
 
-## Pull Requests
+   ```bash
+   pytest tests/multipliers/test_explicit_multipliers.py
+   ```
 
-We actively welcome your pull requests.
+   You can narrow the tests further by using the `pytest -k` flag to match particular test
+   names:
 
-1. Fork the repo and create your branch from `master`.
-2. If you've added code that should be tested, add tests.
-3. If you've changed APIs, update the documentation.
-4. Ensure the test suite passes.
-5. Make sure your code lints.
+   ```bash
+   pytest tests/test_cmp.py -k test_cmp_state_dict
+   ```
 
+9. Once you are satisfied with your change, create a commit as follows (
+   [how to write a commit message](https://chris.beams.io/posts/git-commit/)):
+
+    ```bash
+    git add file1.py file2.py ...
+    git commit -m "Your commit message"
+    ```
+
+   Then sync your code with the main repo:
+
+    ```bash
+    git fetch upstream
+    git rebase upstream/main
+    ```
+
+   Finally, push your commit on your development branch and create a remote
+   branch in your fork that you can use to create a pull request from:
+
+    ```bash
+    git push --set-upstream origin name-of-change
+    ```
+
+10. Create a pull request from the **Cooper** repository and send it for review. The pull request should be aimed at the `dev` branch.
+
+If you have any questions, please feel free to ask in the issue you opened, or reach out via our [Discord server](https://discord.gg/Aq5PjH8m6E).
 
 ## Issues
 
 We use GitHub issues to track public bugs. Please ensure your description is
 clear and has sufficient instructions to be able to reproduce the issue.
 
-## Coding Style
+## Code Style
+
+We use [ruff](https://docs.astral.sh/ruff/) for linting, formatting and import sorting. We ask for type hints for all code committed to **Cooper** and check for compliance with [mypy](https://mypy.readthedocs.io/).
+The CI system should check this when you submit your pull requests.
+The easiest way to run these checks locally is via the
+[pre-commit](https://pre-commit.com/) framework:
+
+```bash
+pip install pre-commit
+pre-commit run --all-files
+```
+
+## Update notebooks
+
+We use [jupytext](https://jupytext.readthedocs.io/) to maintain two synced copies of the notebooks
+in `docs/source/notebooks`: one in `ipynb` format, and one in `md` format. The advantage of the former
+is that it can be opened and executed directly in Colab; the advantage of the latter is that
+it makes it much easier to track diffs within version control.
+
+```bash
+pip install jupytext==1.16.4
+jupytext --sync docs/source/notebooks/new_tutorial.ipynb
+```
+
+The jupytext version should match that specified in
+[.pre-commit-config.yaml](https://github.com/cooper-org/cooper/blob/master/.pre-commit-config.yaml).
+
+To check that the markdown and ipynb files are properly synced, you may use the
+[pre-commit](https://pre-commit.com/) framework to perform the same check used
+by the GitHub CI:
+
+```bash
+pip install pre-commit
+pre-commit run jupytext --all-files
+```
+
+## Update documentation
+
+To rebuild the documentation, install several packages:
+
+```
+pip install -e '.[docs]'
+```
+
+And then run:
+
+```
+sphinx-build -b html docs/source docs/source/build/html -j auto
+```
+
+This can take a long time because it executes many of the notebooks in the documentation source;
+if you'd prefer to build the docs without executing the notebooks, you can run:
+
+```
+sphinx-build -b html -D nb_execution_mode=off docs/source docs/source/build/html -j auto
+```
+
+You can then see the generated documentation in `docs/source/build/html/index.html`.
+
+The `-j auto` option controls the parallelism of the build. You can use a number
+in place of `auto` to control how many CPU cores to use.
 
-We use `ruff` for linting, formatting and import sorting. We ask for type hints for all code committed to **Cooper** and check
-for compliance with `mypy`. The CI system should check of this when you submit
-your pull requests.
 
 ## License
 
 By contributing to **Cooper**, you agree that your contributions will be
 licensed under the LICENSE file in the root directory of this source tree.
+
+## Acknowledgements
+
+This CONTRIBUTING.md file is based on the one from [JAX](https://jax.readthedocs.io/en/latest/contributing.html).
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index 9ea7665a..251bc0f6 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -5,6 +5,14 @@ on:
   push:
     branches: [ master ]
 
+concurrency:
+  # github.workflow: name of the workflow
+  # github.event.pull_request.number || github.ref: pull request number or branch name if not a pull request
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+
+  # Cancel in-progress runs when a new workflow with the same group name is triggered
+  cancel-in-progress: true
+
 jobs:
   ci:
     runs-on: ubuntu-latest
@@ -24,12 +32,13 @@ jobs:
         uses: actions/setup-python@v5
         with:
           python-version: "3.10"
+          cache: "pip"
 
       - name: Pre-commit checks
         uses: pre-commit/action@v3.0.1
 
       - name: Install package & dependencies
-        run: pip install --editable '.[dev, docs, tests, examples]'
+        run: pip install '.[tests]'
 
       - name: Launch tests & generate coverage report
         run: coverage run -m pytest tests
@@ -39,6 +48,10 @@ jobs:
         uses: py-cov-action/python-coverage-comment-action@v3
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          # If the coverage percentage is above or equal to this value, the badge will be green.
+          MINIMUM_GREEN: 90
+          # If the coverage percentage is below this value, the badge will be red.
+          MINIMUM_ORANGE: 80
 
       - name: Store Pull Request comment to be posted
         uses: actions/upload-artifact@v4
diff --git a/.github/workflows/coverage.yaml b/.github/workflows/coverage.yaml
index 6e24509e..bdc7bdb9 100644
--- a/.github/workflows/coverage.yaml
+++ b/.github/workflows/coverage.yaml
@@ -22,9 +22,12 @@ jobs:
       # artifact that contains the comment to be published
       actions: read
     steps:
-
       - name: Post comment
         uses: py-cov-action/python-coverage-comment-action@v3
         with:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
           GITHUB_PR_RUN_ID: ${{ github.event.workflow_run.id }}
+          # If the coverage percentage is above or equal to this value, the badge will be green.
+          MINIMUM_GREEN: 90
+          # If the coverage percentage is below this value, the badge will be red.
+          MINIMUM_ORANGE: 80
diff --git a/.github/workflows/publish.yaml b/.github/workflows/publish.yaml
index a00ff0fc..9d818f8e 100644
--- a/.github/workflows/publish.yaml
+++ b/.github/workflows/publish.yaml
@@ -4,7 +4,7 @@ on:
   workflow_dispatch:
   push:
     tags:
-      - v.**
+      - 'v[0-9]+\.[0-9]+\.[0-9]+'
 
 jobs:
   publish:
diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
new file mode 100644
index 00000000..a42b21ff
--- /dev/null
+++ b/.github/workflows/test.yaml
@@ -0,0 +1,48 @@
+name: Run tests
+
+on:
+  pull_request:
+  push:
+    branches: [ master ]
+
+concurrency:
+  # github.workflow: name of the workflow
+  # github.event.pull_request.number || github.ref: pull request number or branch name if not a pull request
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+
+  # Cancel in-progress runs when a new workflow with the same group name is triggered
+  cancel-in-progress: true
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: [ "3.9", "3.10", "3.11" ]
+        TORCH_VERSION: [ "1.13.1", "2.0.1", "2.1.2", "2.2.2", "2.3.1", "2.4.0" ]
+        include:
+          - python-version: "3.12"
+            TORCH_VERSION: "2.4.0"
+        exclude:
+          - python-version: "3.11"
+            TORCH_VERSION: "1.13.1"
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: "pip"
+
+      - name: Install dependencies and PyTorch ${{ matrix.TORCH_VERSION }}
+        run: |
+          pip install --upgrade pip
+          pip install torch==${{ matrix.TORCH_VERSION }} '.[tests]'
+
+      - name: Launch tests
+        # Only run the unit tests, not the pipeline tests.
+        # Pipeline tests are too expensive to run for every python/PyTorch version.
+        # However, they are run as part the coverage job in the CI workflow
+        run: pytest --ignore=tests/pipeline tests
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 0aac89dc..7253f832 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -19,7 +19,7 @@ repos:
         files: \.py$
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.5.6
+    rev: v0.6.1
     hooks:
       - id: ruff
         types_or: [ python, pyi, jupyter ]
@@ -28,7 +28,7 @@ repos:
         types_or: [ python, pyi, jupyter ]
 
   - repo: https://github.com/mwouts/jupytext
-    rev: v1.16.3
+    rev: v1.16.4
     hooks:
       - id: jupytext
         files: docs/source/notebooks
diff --git a/CITATION.cff b/CITATION.cff
index 02fb739a..99e836c6 100644
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -7,6 +7,8 @@ authors:
       given-names: "Juan"
     - family-names: "Hashemizadeh"
       given-names: "Meraj"
-title: "Cooper: a toolkit for Lagrangian-based constrained optimization"
-date-released: 2022-03-15
+    - family-names: "Lacoste-Julien"
+      given-names: "Simon"
+title: "Cooper: A Library for Constrained Optimization in Deep Learning"
+date-released: 2024-09-01
 url: "https://github.com/cooper-org/cooper"
diff --git a/LICENSE b/LICENSE
index 12782ee7..41222ac7 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2022, The Cooper Developers
+Copyright (c) 2024, The Cooper Developers
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
diff --git a/Makefile b/Makefile
deleted file mode 100644
index 46f3d1ef..00000000
--- a/Makefile
+++ /dev/null
@@ -1,89 +0,0 @@
-.PHONY: clean clean-build clean-pyc clean-test coverage dist docs help install lint lint/flake8 lint/black
-.DEFAULT_GOAL := help
-
-define BROWSER_PYSCRIPT
-import os, webbrowser, sys
-
-from urllib.request import pathname2url
-
-webbrowser.open("file://" + pathname2url(os.path.abspath(sys.argv[1])))
-endef
-export BROWSER_PYSCRIPT
-
-define PRINT_HELP_PYSCRIPT
-import re, sys
-
-for line in sys.stdin:
-	match = re.match(r'^([a-zA-Z_-]+):.*?## (.*)$$', line)
-	if match:
-		target, help = match.groups()
-		print("%-20s %s" % (target, help))
-endef
-export PRINT_HELP_PYSCRIPT
-
-BROWSER := python -c "$$BROWSER_PYSCRIPT"
-
-help:
-	@python -c "$$PRINT_HELP_PYSCRIPT" < $(MAKEFILE_LIST)
-
-clean: clean-build clean-pyc clean-test ## remove all build, test, coverage and Python artifacts
-
-clean-build: ## remove build artifacts
-	rm -fr build/
-	rm -fr dist/
-	rm -fr .eggs/
-	find . -name '*.egg-info' -exec rm -fr {} +
-	find . -name '*.egg' -exec rm -f {} +
-
-clean-pyc: ## remove Python file artifacts
-	find . -name '*.pyc' -exec rm -f {} +
-	find . -name '*.pyo' -exec rm -f {} +
-	find . -name '*~' -exec rm -f {} +
-	find . -name '__pycache__' -exec rm -fr {} +
-
-clean-test: ## remove test and coverage artifacts
-	rm -fr .tox/
-	rm -f .coverage
-	rm -fr htmlcov/
-	rm -fr .pytest_cache
-
-lint/flake8: ## check style with flake8
-	flake8 cooper tests
-lint/black: ## check style with black
-	black --check cooper tests
-
-lint: lint/flake8 lint/black ## check style
-
-test: ## run tests quickly with the default Python
-	pytest
-
-test-all: ## run tests on every Python version with tox
-	tox
-
-coverage: ## check code coverage quickly with the default Python
-	coverage run --source cooper -m pytest
-	coverage report -m
-	coverage html
-	$(BROWSER) htmlcov/index.html
-
-docs: ## generate Sphinx HTML documentation, including API docs
-	rm -f docs/cooper.rst
-	rm -f docs/modules.rst
-	sphinx-apidoc -o docs/ cooper
-	$(MAKE) -C docs clean
-	$(MAKE) -C docs html
-	$(BROWSER) docs/_build/html/index.html
-
-servedocs: docs ## compile the docs watching for changes
-	watchmedo shell-command -p '*.rst' -c '$(MAKE) -C docs html' -R -D .
-
-release: dist ## package and upload a release
-	twine upload dist/*
-
-dist: clean ## builds source and wheel package
-	python setup.py sdist
-	python setup.py bdist_wheel
-	ls -l dist
-
-install: clean ## install the package to the active Python's site-packages
-	python setup.py install
diff --git a/README.md b/README.md
index 537778cf..2414735a 100644
--- a/README.md
+++ b/README.md
@@ -1,180 +1,152 @@
-# Cooper
+# **Cooper**
 
-[![LICENSE](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/cooper-org/cooper/tree/master/LICENSE)
-[![DOCS](https://readthedocs.org/projects/cooper/badge/?version=latest)](https://cooper.readthedocs.io/en/latest/?version=latest)
-[![Build and Test](https://github.com/cooper-org/cooper/actions/workflows/build.yml/badge.svg)](https://github.com/cooper-org/cooper/actions/workflows/build.yml)
-[![Coverage](https://codecov.io/gh/cooper-org/cooper/graph/badge.svg?token=4U41P8JCE1)](https://codecov.io/gh/cooper-org/cooper)
-[![HitCount](https://hits.dwyl.com/cooper-org/cooper.svg?style=flat-square)](https://cooper.readthedocs.io/en/latest/?version=latest)
-[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/cooper-org/cooper/issues)
+[![LICENSE](https://img.shields.io/pypi/l/cooper-optim)](https://github.com/cooper-org/cooper/tree/master/LICENSE)
+[![Version](https://img.shields.io/pypi/v/cooper-optim?label=version)](https://pypi.python.org/pypi/cooper-optim)
+[![Downloads](https://img.shields.io/pepy/dt/cooper-optim?color=blue)](https://pypi.python.org/pypi/cooper-optim)
+[![Python](https://img.shields.io/pypi/pyversions/cooper-optim?label=Python&logo=python&logoColor=white)](https://pypi.python.org/pypi/cooper-optim)
+[![PyTorch](https://img.shields.io/badge/PyTorch-1.13.1+-EE4C2C?logo=pytorch)](https://pytorch.org/docs/stable/index.html)
+[![DOCS](https://img.shields.io/readthedocs/cooper)](https://cooper.readthedocs.io/en/latest/?version=latest)
+[![Coverage badge](https://raw.githubusercontent.com/cooper-org/cooper/python-coverage-comment-action-data/badge.svg)](https://github.com/cooper-org/cooper/tree/python-coverage-comment-action-data)
+[![Continuous Integration](https://github.com/cooper-org/cooper/actions/workflows/ci.yml/badge.svg)](https://github.com/cooper-org/cooper/actions/workflows/ci.yml)
+[![Stars](https://img.shields.io/github/stars/cooper-org/cooper)](https://github.com/cooper-org/cooper)
+[![HitCount](https://img.shields.io/endpoint?url=https://hits.dwyl.com/cooper-org/cooper.json&color=brightgreen)](https://cooper.readthedocs.io/en/latest/?version=latest)
+[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen)](https://github.com/cooper-org/cooper/issues)
+[![Discord](https://img.shields.io/badge/Discord-5865F2?logo=discord&logoColor=white)](https://discord.gg/Aq5PjH8m6E)
+[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
 
-## About
+## What is **Cooper**?
 
-**Cooper** is a toolkit for Lagrangian-based constrained optimization in PyTorch.
-This library aims to encourage and facilitate the study of constrained
-optimization problems in machine learning.
+**Cooper** is a library for solving constrained optimization problems in [PyTorch](https://github.com/pytorch/pytorch).
 
-**Cooper** is (almost!) seamlessly integrated with PyTorch and preserves the
-usual `loss -> backward -> step` workflow. If you are already familiar with
-PyTorch, using **Cooper** will be a breeze! 🙂
+**Cooper** implements several Lagrangian-based (first-order) update schemes that are applicable to a wide range of continuous constrained optimization problems. **Cooper** is mainly targeted for deep learning applications (where gradients are estimated based on mini-batches), but it is also suitable for general continuous constrained optimization.
 
-**Cooper** was born out of the need to handle constrained optimization problems
-for which the loss or constraints are not necessarily "nicely behaved"
-or "theoretically tractable", e.g. when no (efficient) projection or proximal
-are available. Although assumptions of this kind have enabled the development of
-great PyTorch-based libraries such as [CHOP](https://github.com/openopt/chop)
-and [GeoTorch](https://github.com/Lezcano/geotorch), they are seldom satisfied
-in the context of many modern machine learning problems.
+There exist other libraries for constrained optimization in PyTorch, like [CHOP](https://github.com/openopt/chop) and [GeoTorch](https://github.com/Lezcano/geotorch), but they rely on assumptions about the constraints (such as admitting efficient projection or proximal terms). These assumptions are often not met in modern machine learning problems. **Cooper** can be applied to a wider range of constrained optimization problems (including non-convex problems) thanks to its Lagrangian-based approach.
 
-Many of the structural design ideas behind **Cooper** are heavily inspired by
-the [TensorFlow Constrained Optimization (TFCO)](https://github.com/google-research/tensorflow_constrained_optimization)
-library. We highly recommend TFCO for TensorFlow-based projects and will
-continue to integrate more of TFCO's features in future releases.
+TODO(juan43ramirez): mention Cooper MLOSS paper
 
-⚠️ This library is under active development. Future API changes might break backward
-compatibility. ⚠️
+- [**Cooper**](#cooper)
+  - [What is **Cooper**?](#what-is-cooper)
+  - [Installation](#installation)
+  - [Getting Started](#getting-started)
+    - [Quick Start](#quick-start)
+    - [Example](#example)
+  - [Contributions](#contributions)
+  - [Acknowledgements](#acknowledgements)
+  - [License](#license)
+  - [How to cite **Cooper**](#how-to-cite-cooper)
+  - [FAQ](#faq)
 
-## Getting Started
 
-Here we consider a simple convex constrained optimization problem that involves
-training a Logistic Regression clasifier on the MNIST dataset. The model is
-constrained so that the squared L2 norm of its parameters is less than 1.
+## Installation
 
-This example illustrates how **Cooper** integrates with:
-- constructing a ``cooper.LagrangianFormulation`` and a ``cooper.SimultaneousOptimizer``
-- models defined using a ``torch.nn.Module``,
-- CUDA acceleration,
-- typical machine learning training loops,
-- extracting the value of the Lagrange multipliers from a ``cooper.LagrangianFormulation``.
+To install the latest release of Cooper, use the following command:
 
-Please visit the entry in the **Tutorial Gallery** for a complete version of the code.
+```bash
+pip install cooper-optim
+```
 
-```python
-import cooper
-import torch
+To install the latest **development** version, use the following command instead:
 
-train_loader = ... # Create a PyTorch Dataloader for MNIST
-loss_fn = torch.nn.CrossEntropyLoss()
+```bash
+pip install git+https://github.com/cooper-org/cooper@dev
+```
 
-# Create a Logistic Regression model
-model = torch.nn.Linear(in_features=28 * 28, out_features=10, bias=True)
-if torch.cuda.is_available():
-    model = model.cuda()
-primal_optimizer = torch.optim.Adagrad(model.parameters(), lr=5e-3)
+## Getting Started
 
-# Create a Cooper formulation, and pick a PyTorch optimizer class for the dual variables
-formulation = cooper.LagrangianFormulation()
-dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3)
 
-# Create a ConstrainedOptimizer for performing simultaneous updates based on the
-# formulation, and the selected primal and dual optimizers.
-cooper_optimizer = cooper.SimultaneousOptimizer(
-    formulation, primal_optimizer, dual_optimizer
-)
+### Quick Start
 
-for epoch_num in range(50):
-    for batch_num, (inputs, targets) in enumerate(train_loader):
+To use **Cooper**, you need to:
 
-        if torch.cuda.is_available():
-            inputs, targets = inputs.cuda(), targets.cuda()
+- Implement a {py:class}`~cooper.ConstrainedMinimizationProblem` (`CMP`) class and its associated {py:method}`~cooper.ConstrainedMinimizationProblem.compute_cmp_state` method. This method computes the objective value and constraint violations, and packages them in a {py:class}`~cooper.CMPState` object.
+- The initialization of the `CMP` must create a {py:class}`~cooper.Constraint` object for each constraint. Each constraint requires an associated {py:class}`~cooper.Multiplier` object corresponding to the Lagrange multipliers for that constraint.
+- Create a {py:class}`torch.optim.Optimizer` for the primal variables and a {py:class}`torch.optim.Optimizer(maximize=True)` for the dual variables (i.e. the multipliers). Then, wrap these two optimizers in a {py:class}`~cooper.optim.CooperOptimizer` (such as {py:class}`~cooper.optim.constrained_optimizer.SimultaneousOptimizer` for executing simultaneous updates).
+- You are now ready to perform updates on the primal and dual parameters using the {py:meth}`cooper.optim.CooperOptimizer.roll` method. This method triggers the following calls:
+  - {py:meth}`zero_grad` on both optimizers,
+  - {py:meth}`~cooper.ConstrainedMinimizationProblem.compute_cmp_state` on the `CMP`,
+  - compute the Lagrangian based on the latest {py:class}`~cooper.cmp.CMPState`
+  - {py:meth}`backward` on the Lagrangian,
+  - {py:meth}`~torch.optim.Optimizer.step` on both optimizers.
 
-        logits = model.forward(inputs.view(inputs.shape[0], -1))
-        loss = loss_fn(logits, targets)
 
-        sq_l2_norm = model.weight.pow(2).sum() + model.bias.pow(2).sum()
-        # Constraint defects use convention “g - \epsilon ≤ 0”
-        constraint_defect = sq_l2_norm - 1.0
+### Example
 
-        # Create a CMPState object, which contains the loss and constraint defect
-        cmp_state = cooper.CMPState(loss=loss, ineq_defect=constraint_defect)
+This is an abstract example on how to solve a constrained optimization problem with
+**Cooper**. You can find runnable notebooks in our [**Tutorials**](https://cooper.readthedocs.io/en/master/notebooks/index.html).
 
-        cooper_optimizer.zero_grad()
-        lagrangian = formulation.compute_lagrangian(pre_computed_state=cmp_state)
-        formulation.backward(lagrangian)
-        cooper_optimizer.step()
+```python
+import cooper
+import torch
 
-    # We can extract the value of the Lagrange multiplier for the constraint
-    # The dual variables are stored and updated internally by Cooper
-    lag_multiplier, _ = formulation.state()
+# Set up GPU acceleration
+DEVICE = ...
 
-```
+class MyCMP(cooper.ConstrainedMinimizationProblem):
+    def __init__(self):
+        super().__init__()
+        multiplier = cooper.multipliers.DenseMultiplier(num_constraints=..., device=DEVICE)
+        # By default, constraints are built using `formulation_type=cooper.LagrangianFormulation`
+        self.constraint = cooper.Constraint(
+            multiplier=multiplier, constraint_type=cooper.ConstraintType.INEQUALITY
+        )
 
-## Installation
+    def compute_cmp_state(self, model, inputs, targets):
+        inputs, targets = inputs.to(DEVICE), targets.to(DEVICE)
+        loss = ...
+        constraint_state = cooper.ConstraintState(violation=...)
+        observed_constraints = {self.constraint: constraint_state}
 
-### Basic Installation
+        return cooper.CMPState(loss=loss, observed_constraints=observed_constraints)
 
-```bash
-pip install git+https://github.com/cooper-org/cooper.git
-```
-
-### Development Installation
 
-First, clone the [repository](https://github.com/cooper-org/cooper), navigate
-to the **Cooper** root directory and install the package in development mode by running:
+train_loader = ...
+model = (...).to(DEVICE)
+cmp = MyCMP()
 
-| Setting     | Command                                  | Notes                                     |
-| ----------- | ---------------------------------------- | ----------------------------------------- |
-| Development | `pip install --editable ".[dev, tests]"` | Editable mode. Matches test environment.  |
-| Docs        | `pip install --editable ".[docs]"`       | Used to re-generate the documentation.    |
-| Tutorials   | `pip install --editable ".[examples]"`   | Install dependencies for running examples |
-| No Tests    | `pip install --editable .`               | Editable mode, without tests.             |
+primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+# Must set `maximize=True` since the Lagrange multipliers solve a _maximization_ problem
+dual_optimizer = torch.optim.SGD(cmp.dual_parameters(), lr=1e-2, maximize=True)
 
-## Package structure
+cooper_optimizer = cooper.optim.SimultaneousOptimizer(
+    cmp=cmp, primal_optimizers=primal_optimizer, dual_optimizers=dual_optimizer
+)
 
--   `cooper` - base package
-    -   `problem` - abstract class for representing ConstrainedMinimizationProblems (CMPs)
-    -   `constrained_optimizer` - `torch.optim.Optimizer`-like class for handling CMPs
-    -   `lagrangian_formulation` - Lagrangian formulation of a CMP
-    -   `multipliers` - utility class for Lagrange multipliers
-    -   `optim` - aliases for PyTorch optimizers and [extra-gradient versions](https://github.com/GauthierGidel/Variational-Inequality-GAN/blob/master/optim/extragradient.py) of SGD and Adam
--   `tests` - unit tests for `cooper` components
--   `tutorials` - source code for examples contained in the tutorial gallery
+for epoch_num in range(NUM_EPOCHS):
+    for inputs, targets in train_loader:
+        # `roll` is a convenience function that packages together the evaluation
+        # of the loss, call for gradient computation, the primal and dual updates and zero_grad
+        compute_cmp_state_kwargs = {"model": model, "inputs": inputs, "targets": targets}
+        roll_out = cooper_optimizer.roll(compute_cmp_state_kwargs=compute_cmp_state_kwargs)
+        # `roll_out` is a namedtuple containing the loss, last CMPState, and the primal
+        # and dual Lagrangian stores, useful for inspection and logging
+```
 
 ## Contributions
 
-Please read our [CONTRIBUTING](https://github.com/cooper-org/cooper/tree/master/.github/CONTRIBUTING.md)
-guide prior to submitting a pull request. We use `black` for formatting, `isort`
-for import sorting, `flake8` for linting, and `mypy` for type checking.
+Please read our [CONTRIBUTING](https://github.com/cooper-org/cooper/tree/master/.github/CONTRIBUTING.md) guide prior to submitting a pull request. We use `ruff` for formatting and linting, and `mypy` for type checking.
 
-We test all pull requests. We rely on this for reviews, so please make sure any
-new code is tested. Tests for `cooper` go in the `tests` folder in the root of
-the repository.
+## Acknowledgements
+
+We thank Manuel Del Verme, Daniel Otero, and Isabel Urrego for useful discussions during the early stages of **Cooper**.
 
 ## License
 
 **Cooper** is distributed under an MIT license, as found in the
 [LICENSE](https://github.com/cooper-org/cooper/tree/master/LICENSE) file.
 
-## Projects built with Cooper
-
-- J. Gallego-Posada et al. Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints. In [NeurIPS 2022](https://arxiv.org/abs/2208.04425).
-- S. Lachapelle and S. Lacoste-Julien. Partial Disentanglement via Mechanism Sparsity. In [CLR Workshop at UAI 2022](https://arxiv.org/abs/2207.07732).
-- J. Ramirez and J. Gallego-Posada. L0onie: Compressing COINS with L0-constraints. In [Sparsity in Neural Networks Workshop 2022](https://arxiv.org/abs/2207.04144).
-
-*If you would like your work to be highlighted in this list, please open a pull request.*
+## How to cite **Cooper**
 
-## Acknowledgements
-
-**Cooper** supports the use of extra-gradient style optimizers for solving the
-min-max Lagrangian problem. We include the implementations of the
-[extra-gradient version](https://github.com/GauthierGidel/Variational-Inequality-GAN/blob/master/optim/extragradient.py)
-of SGD and Adam by Hugo Berard.
-
-We thank Manuel del Verme for insightful discussions during the early stages of
-this library.
-
-This README follows closely the style of the [NeuralCompression](https://github.com/facebookresearch/NeuralCompression)
-repository.
-
-## How to cite this work?
-
-If you find **Cooper** useful in your research, please consider citing it using
-the snippet below:
+To cite **Cooper**, please cite [this paper](link-to-paper):
 
 ```bibtex
-@misc{gallegoPosada2022cooper,
-    author={Gallego-Posada, Jose and Ramirez, Juan},
-    title={Cooper: a toolkit for Lagrangian-based constrained optimization},
+@misc{gallegoPosada2024cooper,
+    author={Gallego-Posada, Jose and Ramirez, Juan and Hashemizadeh, Meraj and Lacoste-Julien, Simon},
+    title={{Cooper: A Library for Constrained Optimization in Deep Learning}},
     howpublished={\url{https://github.com/cooper-org/cooper}},
-    year={2022}
+    year={2024}
 }
 ```
+
+## FAQ
+
+**Cooper**'s FAQ is available [here](https://cooper.readthedocs.io/en/latest/faq.html).
diff --git a/docs/source/additional_features.md b/docs/source/additional_features.md
index f4ae7bfb..78a7894e 100644
--- a/docs/source/additional_features.md
+++ b/docs/source/additional_features.md
@@ -8,8 +8,6 @@ In this section we provide details on using "advanced features" such as
 alternating updates, or the Augmented Lagrangian method, in conjunction with a
 {py:class}`~cooper.optim.constrained_optimizers.ConstrainedOptimizer`.
 
-______________________________________________________________________
-
 (alternating-updates)=
 
 ## Alternating updates
@@ -25,21 +23,19 @@ variables. This two-stage process is handled by **Cooper** inside the
 
 One can perform alternating updates in which the primal parameters are updated first. We
 refer to this update strategy as `cooper.optim.AlternationType.PRIMAL_DUAL`.
-.. math:
 
-```
+$$
 x_{t+1} &= \texttt{primal_optimizers_update} \left( x_{t}, \nabla_{x} \mathcal{L}_{c_t}(x, \lambda_t)|_{x=x_t} \right)\\
 \lambda_{t+1} &= \texttt{dual_optimizer_update} \left( \lambda_{t}, {\color{red} \mathbf{-}} \nabla_{\lambda} \mathcal{L}({\color{red} x_{t+1}}, \lambda)|_{\lambda=\lambda_t} \right)
-```
+$$
 
 Alternative, `cooper.optim.AlternationType.DUAL_PRIMAL` carries out an update of the
 dual parameters first.
-.. math:
 
-```
+$$
 \lambda_{t+1} &= \texttt{dual_optimizer_update} \left( \lambda_{t}, {\color{red} \mathbf{-}} \nabla_{\lambda} \mathcal{L}({\color{red} x_{t}}, \lambda)|_{\lambda=\lambda_t} \right) \\
 x_{t+1} &= \texttt{primal_optimizers_update} \left( x_{t}, \nabla_{x} \mathcal{L}_{c_t}(x, \lambda_{t+1})|_{x=x_t} \right)
-```
+$$
 
 :::{important}
 Selecting `alternation_type=AlternationType.DualPrimal` does not double the number
@@ -57,8 +53,6 @@ allows for updating the Lagrange multiplier without having to re-evaluate
 the loss function, but rather only the constraints.
 :::
 
-______________________________________________________________________
-
 (augmented-lagrangian-const-opt)=
 
 ## Augmented Lagrangian method
@@ -92,7 +86,7 @@ This corresponds exactly to a (projected) gradient ascent update on the dual
 variables with "step size" $c_t$ on the function:
 
 $$
-\mathcal{L}_{c_t}(x_{t+1}, \lambda) \triangleq &  \, \, {\color{gray} \overbrace{ f(x_{t+1}) +\frac{c_t}{2} ||g(x_{t+1}) \odot \mathbf{1}_{g(x_{t+1}) \ge 0 \vee \lambda_{g} > 0}||^2 +  \frac{c_t}{2} ||h(x_{t+1})||^2}^{\text{do not contribute to gradient } \nabla_{\lambda} \mathcal{L}(x_{t+1}, \lambda)|_{\lambda = \lambda_t}}} \\ &  \, \, +  \lambda_{g}^{\top} \, g(x_{t+1}) + \lambda_{h}^{\top} \, h(x_{t+1})
+\mathcal{L}_{c_t}(x_{t+1}, \lambda) \triangleq & \, \, {\color{gray} \overbrace{ f(x_{t+1}) +\frac{c_t}{2} ||g(x_{t+1}) \odot \mathbf{1}_{g(x_{t+1}) \ge 0 \vee \lambda_{g} > 0}||^2 + \frac{c_t}{2} ||h(x_{t+1})||^2}^{\text{do not contribute to gradient } \nabla_{\lambda} \mathcal{L}(x_{t+1}, \lambda)|_{\lambda = \lambda_t}}} \\ & \, \, + \lambda_{g}^{\top} \, g(x_{t+1}) + \lambda_{h}^{\top} \, h(x_{t+1})
 $$
 
 Therefore, the sequence of Augmented Lagrangian coefficients can be identified
@@ -172,8 +166,6 @@ for step_id in range(1000):
     coop.dual_scheduler.step()
 ```
 
-______________________________________________________________________
-
 (multiple-primal-optimizers)=
 
 # Multiple primal optimizers
diff --git a/docs/source/coefficient_updaters.md b/docs/source/coefficient_updaters.md
new file mode 100644
index 00000000..8c38a1fa
--- /dev/null
+++ b/docs/source/coefficient_updaters.md
@@ -0,0 +1,3 @@
+(coefficient_updaters)=
+
+## Coefficient Updaters
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 9963ea4f..20aa11be 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -20,7 +20,7 @@
 # -- Project information -----------------------------------------------------
 
 project = "Cooper"
-copyright = "2022, The Cooper Developers"
+copyright = "2024, The Cooper Developers"
 author = "The Cooper Developers"
 
 # The full version, including alpha/beta/rc tags
@@ -50,21 +50,30 @@
 ]
 
 mathjax3_config = {
-    "extensions": ["tex2jax.js"],
-    "TeX": {
-        "Macros": {
-            "argmin": "\\DeclareMathOperator*{\\argmin}{\\mathbf{arg\\,min}}",
-            "argmax": "\\DeclareMathOperator*{\\argmin}{\\mathbf{arg\\,max}}",
-            "bs": "\\newcommand{\\bs}[1]{\\boldsymbol{#1}}",
-        },
+    "tex": {
+        "macros": {
+            "argmin": ["\\underset{#1}{\\text{argmin}}", 1],
+            "argmax": ["\\underset{#1}{\\text{argmax}}", 1],
+            "reals": "\\mathbb{R}",
+            "bs": ["\\boldsymbol{#1}", 1],
+            "vx": "\\bs{x}",
+            "vlambda": "\\bs{\\lambda}",
+            "vmu": "\\bs{\\mu}",
+            "vg": "\\bs{g}",
+            "vh": "\\bs{h}",
+            "vc": "\\bs{c}",
+            "vzero": "\\bs{0}",
+            "xstar": "\\bs{x}^*",
+            "lambdastar": "\\bs{\\lambda}^*",
+            "mustar": "\\bs{\\mu}^*",
+            "gtilde": "\\tilde{\\vg}",
+            "htilde": "\\tilde{\\vh}",
+        }
     },
-    "tex2jax": {
-        "inlineMath": [["$", "$"], [r"\(", r"\)"]],
-    },
-    "jax": ["input/TeX", "output/HTML-CSS"],
-    "displayAlign": "left",
 }
 
+autodoc_member_order = "bysource"
+
 source_suffix = [".ipynb", ".md"]
 
 myst_enable_extensions = [
@@ -73,6 +82,10 @@
     "dollarmath",
 ]
 
+# For adding implicit referencing, see:
+# https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#implicit-targets
+myst_heading_anchors = 6
+
 nb_execution_mode = "force"
 nb_execution_allow_errors = False
 nb_merge_streams = True
@@ -125,7 +138,6 @@
 
 # intersphinx maps
 intersphinx_mapping = {
-    # "python": ("https://docs.python.org/3", None),
     "python": ("https://python.readthedocs.io/en/latest", None),
     "numpy": ("https://numpy.org/doc/stable", None),
     "torch": ("https://pytorch.org/docs/stable/", None),
diff --git a/docs/source/constrained_optimization.md b/docs/source/constrained_optimization.md
new file mode 100644
index 00000000..0047b9f1
--- /dev/null
+++ b/docs/source/constrained_optimization.md
@@ -0,0 +1,114 @@
+(overview)=
+
+# Overview of Constrained Optimization
+
+## Constrained Minimization Problems
+
+We consider constrained optimization problems expressed as:
+
+$$
+\min_{\vx \in \reals^d} & \,\, f(\vx) \\ \text{s.t. }
+& \,\, \vg(\vx) \le \vzero \\ & \,\, \vh(\vx) = \vzero
+$$
+
+:::{admonition} Conventions and terminology
+
+- We refer to $f$ as the **loss** or **objective** to be minimized.
+- We adopt the convention $\vg(\vx) \le \vzero$ for **inequality constraints** and $\vh(\vx) = \vzero$ for **equality constraints**. If your constraints are different, for example $\vc(\vx) \ge \epsilon$, you should provide **Cooper** with $\vg(\vx) = \epsilon - \vc(\vx) \le \vzero$.
+:::
+
+:::{warning}
+We use the term constraint violation to refer to both $\vg(\vx)$ and $\vh(\vx)$. For equality constraints, $\vh(\vx)$ is satisfied only when its violation is zero, i.e., $\vh(\vx) = \vzero$. For inequality constraints, a negative violation of $\vg(\vx)$ indicates the constraint is strictly satisfied (i.e., $\vg(\vx) < \vzero$), whereas a positive violation indicates the constraint is violated (i.e., $\vg(\vx) > \vzero$).
+
+Note that we still refer to $\vg(\vx)$ and $\vh(\vx)$ as "violations" even when the constraint are satisfied. This differs from the convention in some of the optimization literature, which uses the term "violation" to refer to the amount by which a constraint is violated (i.e., $\max\{\vzero, \vg(\vx)\}$ for inequality constraints and $|\vh(\vx)|$ for equality constraints).
+:::
+
+
+We group together all the inequality constraints in $\vg$, and all the equality constraints in $\vh$.
+In other words, $f$ is a scalar-valued function, whereas $\vg$ and $\vh$ are vector-valued functions with parameters $\vx$.
+A component function $h_i(\vx)$ corresponds to the scalar constraint
+$h_i(\vx) = 0$.
+
+
+## The Lagrangian Approach
+
+An approach for solving general nonconvex constrained optimization problems is to formulate their Lagrangian and find a min-max point:
+
+$$
+\xstar, \lambdastar, \mustar = \argmin{\vx \in \reals^d} \, \, \argmax{\vlambda \ge \vzero, \vmu} \, \, \mathcal{L}(\vx, \vlambda, \vmu)
+$$
+
+where $\mathcal{L}(\vx, \vlambda, \vmu) = f(\vx) + \vlambda^\top \vg(\vx) + \vmu^\top \vh(\vx)$ is the Lagrangian function associated with the constrained minimization problem. $\vlambda \geq \vzero$ and $\vmu$ are the Lagrange multipliers associated with the inequality and equality constraints, respectively.
+We refer to $\vx$ as the **primal variables** of the CMP, and $\vlambda$ and $\vmu$ as the **dual variables**.
+
+:::{note}
+$\mathcal{L}(\vx,\vlambda)$ is a concave function of $\vlambda$ regardless of the convexity properties of $f$, $\vg$, and $\vh$.
+:::
+
+An argmin-argmax point of the Lagrangian corresponds to a solution of the original CMP {cite:p}`boyd2004convex`. We refer to finding such a point as the **Lagrangian approach** to solving a constrained minimization problem. **Cooper** is primarily designed to solve constrained optimization problems using the Lagrangian approach, and it also implements alternative formulations such as the {py:class}`~cooper.formulation.AugmentedLagrangianFormulation` (see {doc}`formulations`).
+
+:::{admonition} Why does **Cooper** use the Lagrangian approach?
+**Cooper** is designed for solving constrained optimization problems that arise in deep learning applications. These problems are often **nonconvex** and **high-dimensional**, and may require **estimating constraints stochastically** from mini-batches of data. The Lagrangian approach is well-suited to these problems for several reasons:
+- **Nonconvexity**. The Lagrangian approach does not require the loss or constraints to be convex or follow a specific structure, making it applicable to general nonconvex problems.
+>
+- **Scalability**. First-order optimization methods, such as gradient descent-ascent, can be used to find min-max points of the Lagrangian. These methods are well-supported by automatic differentiation frameworks such as PyTorch and scale to high-dimensional problems.
+\
+Moreover, the overhead (relative to unconstrained minimization) of storing and updating the Lagrange multipliers is generally negligible in deep learning problems, where the computational cost of calculating the loss, constraints, and their gradients represents the main bottleneck.
+>
+- **Stochastic estimates of the constraints**. Gradient-based methods can utilize stochastic estimates of the loss and constraints, making them applicatble to problems where computing the exact loss and constraints is prohibitively expensive.
+:::
+
+:::{warning}
+**Cooper** is primarily designed for **nonconvex** constrained optimization problems that arise in many deep learning applications. While the techniques implemented in **Cooper** are applicable to convex problems as well, we recommend using specialized solvers for convex optimization problems whenever possible.
+:::
+
+## Min-max Optimization
+
+A simple approach for finding min-max points of the Lagrangian is doing gradient descent on the primal variables and gradient ascent on the dual variables. Simultaneous **gradient descent-ascent** has the following updates:
+
+$$
+\vx_{t+1} &= \vx_t - \eta_{\vx} \nabla_{\vx} \mathcal{L}(\vx_t, \vlambda_t, \vmu_t) \\
+\vlambda_{t+1} &= \left [ \vlambda_t + \eta_{\vlambda} \nabla_{\vlambda} \mathcal{L}(\vx_t, \vlambda_t, \vmu_t) \right ]_+ \\
+\vmu_{t+1} &= \vmu_t + \eta_{\vmu} \nabla_{\vmu} \mathcal{L}(\vx_t, \vlambda_t, \vmu_t)
+$$
+
+where $\eta_{\vx}, \eta_{\vlambda}, \eta_{\vmu}$ are the step sizes for the primal and dual variables. The projection operator $[\cdot]_+$ ensures that the dual variables associated with the inequality constraints remain non-negative.
+
+Plugging in the gradients of the Lagrangian, we get the following updates:
+
+$$
+\vx_{t+1} &= \vx_t - \eta_{\vx} \left [ \nabla_{\vx} f(\vx_t) + \vlambda_t^\top \nabla_{\vx} \vg(\vx_t) + \vmu_t^\top \nabla_{\vx} \vh(\vx_t) \right ] \\
+\vlambda_{t+1} &= \left [ \vlambda_t + \eta_{\vlambda} \vg(\vx_t) \right ]_+ \\
+\vmu_{t+1} &= \vmu_t + \eta_{\vmu} \vh(\vx_t)
+$$
+
+The primal updates follow a linear combination of the gradients of the loss and constraints, with the coefficients corresponding to the Lagrange multipliers. Larger values of a Lagrange multiplier result in a stronger influence of the corresponding constraint on the primal updates, promoting feasibility. Conversely, smaller values (or zero) reduce the influence of the constraint, prioritizing loss reduction.
+
+The dual updates accumulate the constraint violations. Together with the primal updates, these ensure that the constraints are satisfied:
+- **Inequality constraints**: When a constraint is violated ($\vg(\vx) > \vzero$), the corresponding Lagrange multiplier increases to penalize the violation. If the constraint is strictly satisfied ($\vg(\vx) < \vzero$), the multiplier decreases, allowing the focus to shift toward loss reduction.
+- **Equality constraints**: For a positive (resp. negative) violation, the Lagrange multiplier increases (resp. decreases). The multiplier stabilizes when the constraint is satisfied ($\vh(\vx) = \vzero$).
+
+**Cooper** leverages PyTorch's automatic differentiation framework to efficiently perform gradient-based optimization of the Lagrangian.
+**Cooper** supports simultaneous gradient descent-ascent, as well as other variants like alternating gradient descent-ascent and the {py:class}`~cooper.optim.Extragradient` method {cite:p}`korpelevich1976extragradient`.
+
+With **Cooper**, you can specify {py:class}`~torch.optim.Optimizer` objects for the primal and dual updates (see {doc}`optim`), allowing you to apply familiar optimization techniques such as Adam, just as you would when training deep neural networks.
+
+(proxy)=
+## Non-differentiable Constraints
+
+{cite:t}`cotter2019proxy` introduce the concept of **proxy constraints** to address problems with non-differentiable constraints. In these cases, the gradient of the Lagrangian with respect to the primal variables cannot be computed, making standard gradient descent-ascent updates inadmissible.
+
+Proxy constraints allow for considering a differentiable surrogate of the constraint when updating the primal variables, while still using the original non-differentiable constraint for updating the dual variables. This approach enables the use of gradient-based optimization methods for problems with non-differentiable constraints, **while still ensuring that the original non-differentiable constraints are satisfied**.
+
+Formally, the optimization problem becomes:
+
+$$
+\xstar &\in \argmin{\vx \in \reals^d} \, \, f(\vx) + [\lambdastar]^\top \vg(\vx) + [\mustar]^\top \vh(\vx) \\
+\lambdastar, \mustar &\in \argmax{\vlambda \ge \vzero, \vmu} \, \, f(\xstar) + \vlambda^\top \gtilde(\xstar) + \vmu^\top \htilde(\xstar)
+$$
+
+where $\vg(\vx) \le \vzero$ and $\vh(\vx) = \vzero$ are the non-differentiable constraints of the problem, and $\gtilde(\vx) \le \vzero$ and $\htilde(\vx) = \vzero$ are differentiable surrogates of $\vg(\vx)$ and $\vh(\vx)$, respectively.
+
+The proxy constraints problem can be solved with the same gradient descent-ascent updates as before, but using the differentiable surrogates $\gtilde(\vx)$ and $\htilde(\vx)$ for the primal updates, and the original non-differentiable constraints $\vg(\vx)$ and $\vh(\vx)$ for the dual updates.
+
+**Cooper** supports proxy constraints when a `strict_violation` is provided in the {py:class}`~cooper.constraints.ConstraintState`. Here, `strict_violation` corresponds to the violation of the original non-differentiable constraint, while violation represents the violation of the differentiable surrogate.
diff --git a/docs/source/constrained_optimizer.md b/docs/source/constrained_optimizer.md
deleted file mode 100644
index 5b3ced5c..00000000
--- a/docs/source/constrained_optimizer.md
+++ /dev/null
@@ -1,221 +0,0 @@
-# Constrained Optimizer
-
-```{eval-rst}
-.. currentmodule:: cooper.optim.constrained_optimizers.constrained_optimizer
-```
-
-## `ConstrainedOptimizer` Class
-
-```{eval-rst}
-.. autoclass:: ConstrainedOptimizer
-    :members:
-```
-
-## How to use a `ConstrainedOptimizer`
-
-The {py:class}`ConstrainedOptimizer` class is the cornerstone of **Cooper**. A
-{py:class}`ConstrainedOptimizer` performs parameter updates to solve a
-{py:class}`~cooper.problem.ConstrainedMinimizationProblem` given a chosen
-{py:class}`~cooper.formulation.Formulation`.
-
-A `ConstrainedOptimizer` wraps a {py:class}`torch.optim.Optimizer`
-used for updating the "primal" parameters associated directly with the
-optimization problem. These might be, for example, the parameters of the model
-you are training.
-
-Additionally, a `ConstrainedOptimizer` includes a second
-{py:class}`torch.optim.Optimizer`, which performs updates on the "dual"
-parameters (e.g. the multipliers used in a
-{py:class}`~cooper.formulation.LagrangianFormulation`).
-
-### Construction
-
-The main ingredients to build a `ConstrainedOptimizer` are a
-{py:class}`~cooper.formulation.Formulation` (associated with a
-{py:class}`~cooper.problem.ConstrainedMinimizationProblem`) and a
-{py:class}`torch.optim.Optimizer` corresponding to a `primal_optimizer`.
-
-:::{note}
-**Cooper** supports the use of multiple `primal_optimizers`, each
-corresponding to different groups of primal variables. The
-`primal_optimizers` argument accepts a single optimizer, or a list
-of optimizers. See {ref}`multiple-primal_optimizers`.
-:::
-
-If the `ConstrainedMinimizationProblem` you are dealing with is in fact
-constrained, depending on your formulation, you might also need to provide a
-`dual_optimizer`. Check out the section on {ref}`partial_optimizer_instantiation`
-for more details on defining `dual_optimizer`s.
-
-:::{note}
-**Cooper** includes extra-gradient implementations of SGD and Adam which can
-be used as primal or dual optimizers. See {ref}`extra-gradient_optimizers`.
-:::
-
-#### Examples
-
-The highlighted lines below show the small changes required to go from an
-unconstrained to a constrained problem. Note that these changes should also be
-accompanied with edits to the custom problem class which inherits from
-{py:class}`~cooper.problem.ConstrainedMinimizationProblem`. More details on
-the definition of a CMP can be found under the entry for {ref}`cmp`.
-
-- **Unconstrained problem**
-
-  > ```{code-block} python
-  > :linenos: true
-  >
-  > model =  ModelClass(...)
-  > cmp = cooper.ConstrainedMinimizationProblem()
-  > formulation = cooper.formulation.Formulation(...)
-  >
-  > primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
-  >
-  > constrained_optimizer = cooper.UnconstrainedOptimizer(
-  >     formulation=formulation,
-  >     primal_optimizers=primal_optimizer,
-  > )
-  > ```
-
-- **Constrained problem**
-
-  > ```{code-block} python
-  > :emphasize-lines: 7,9,12
-  > :linenos: true
-  >
-  > model =  ModelClass(...)
-  > cmp = cooper.ConstrainedMinimizationProblem()
-  > formulation = cooper.formulation.Formulation(...)
-  >
-  > primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
-  > # Note that dual_optimizer is "partly instantiated", *without* parameters
-  > dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3, momentum=0.9)
-  >
-  > constrained_optimizer = cooper.SimultaneousOptimizer(
-  >     formulation=formulation,
-  >     primal_optimizers=primal_optimizer,
-  >     dual_optimizer=dual_optimizer,
-  > )
-  > ```
-
-### The training loop
-
-We have gathered all the ingredients we need for tackling our CMP: the
-custom {py:class}`~cooper.problem.ConstrainedMinimizationProblem` class, along
-with your {py:class}`ConstrainedOptimizer` of choice and a
-{py:class}`ConstrainedOptimizer` for updating the parameters. Now it is time to
-put them to good use.
-
-The typical training loop for solving a CMP in a machine learning set up using
-**Cooper** (with a {ref}`Lagrangian Formulation<lagrangian_formulations>`)
-will involve the following steps:
-
-:::{admonition} Overview of main steps in a training loop
-:class: hint
-
-1. (Optional) Iterate over your dataset and sample of mini-batch.
-2. Call {py:meth}`constrained_optimizer.zero_grad()<zero_grad>` to reset the parameters' gradients
-3. Compute the current {py:class}`CMPState` (or estimate it with the minibatch) and calculate the Lagrangian using {py:meth}`formulation.compute_lagrangian(cmp.closure, ...)<cooper.formulation.LagrangianFormulation.compute_lagrangian>`.
-4. Populate the primal and dual gradients with {py:meth}`formulation.backward(lagrangian)<cooper.formulation.Formulation.backward>`
-5. Perform updates on the parameters using the primal and dual optimizers based on the recently computed gradients, via a call to {py:meth}`constrained_optimizer.step()<step>`.
-:::
-
-#### Example
-
-> ```{code-block} python
-> :linenos: true
->
-> model = ModelClass(...)
-> cmp = cooper.ConstrainedMinimizationProblem(...)
-> formulation = cooper.LagrangianFormulation(...)
->
-> primal_optimizer = torch.optim.SGD(model.parameters(), lr=primal_lr)
-> # Note that dual_optimizer is "partly instantiated", *without* parameters
-> dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=primal_lr)
->
-> constrained_optimizer = cooper.ConstrainedOptimizer(
->     formulation=formulation,
->     primal_optimizers=primal_optimizer,
->     dual_optimizer=dual_optimizer,
-> )
->
-> for inputs, targets in dataset:
->     # Clear gradient buffers
->     constrained_optimizer.zero_grad()
->
->     # The closure is required to compute the Lagrangian
->     # The closure might in turn require the model, inputs, targets, etc.
->     lagrangian = formulation.compute_lagrangian(cmp.closure, ...)
->
->     # Populate the primal and dual gradients
->     formulation.backward(lagrangian)
->
->     # Perform primal and dual parameter updates
->     constrained_optimizer.step()
-> ```
-
-(basic-parameter-updates)=
-
-### Parameter updates
-
-By default, parameter updates are performed using **simultaneous** gradient
-descent-ascent updates (according to the choice of primal and dual optimizers).
-Formally,
-
-$$
-x_{t+1} &= \texttt{primal_optimizer_update} \left( x_{t}, \nabla_{x} \mathcal{L}(x, \lambda_t)|_{x=x_t} \right)\\
-\lambda_{t+1} &= \texttt{dual_optimizer_update} \left( \lambda_{t}, {\color{red} \mathbf{-}} \nabla_{\lambda} \mathcal{L}({x_{t}}, \lambda)|_{\lambda=\lambda_t} \right)
-$$
-
-:::{note}
-We explicitly include a negative sign in front of the gradient for
-$\lambda$ in order to highlight the fact that $\lambda$ solves
-**maximization** problem. **Cooper** handles the sign flipping internally, so
-you should provide your definition for a `dual_optimizer` using a non-negative
-learning rate, as usual!
-:::
-
-:::{admonition} Multiplier projection
-:class: note
-
-Lagrange multipliers associated with inequality constraints should remain
-non-negative. **Cooper** executes the standard projection to
-$\mathbb{R}^{+}$ by default for
-{py:class}`~cooper.optim.multipliers.DenseMultiplier`s. For more details
-on using custom projection operations, see the section on {ref}`multipliers`.
-:::
-
-Other update strategies supported by {py:class}`~ConstrainedOptimizer` include:
-
-- {ref}`Alternating updates<alternating_updates>` for (projected) gradient descent-ascent
-
-- The {ref}`Augmented Lagrangian<augmented_lagrangian_const_opt>` method (ALM)
-
-- Using {ref}`Extra-gradient<extra-gradient_optimizers>`
-
-  - Extra-gradient-based optimizers require an extra call to the
-    {py:meth}`cmp.closure()<cooper.problem.ConstrainedMinimizationProblem.closure>`.
-    See the section on {ref}`extra-gradient_optimizers` for usage details.
-
-The `ConstrainedOptimizer` implements a {py:meth}`ConstrainedOptimizer.step`
-method, that updates the primal and dual parameters (if `Formulation` has any).
-The nature of the update depends on the attributes provided during the
-initialization of the `ConstrainedOptimizer`. By default, updates are via
-gradient descent on the primal parameters and (projected) ascent
-on the dual parameters, with simultaneous updates.
-
-:::{note}
-When applied to an unconstrained problem, {py:meth}`ConstrainedOptimizer.step`
-will be equivalent to performing `optimizer.step()` on all of the
-`primal_optimizers` based on the gradient of the loss with respect to the
-primal parameters.
-:::
-
-```{eval-rst}
-.. include:: additional_features.rst
-```
-
-```{eval-rst}
-.. autoclass:: AlternatingPrimalDualOptimizer
-    :members:
-```
diff --git a/docs/source/faq.md b/docs/source/faq.md
new file mode 100644
index 00000000..86b8f221
--- /dev/null
+++ b/docs/source/faq.md
@@ -0,0 +1,243 @@
+# FAQ
+
+TODO: emojis?
+
+How can I tell if Cooper found a good solution?
+  As a reference, consider the solution of the unconstrained problem, which is a lower bound on the solution to the constrained problem
+  Nuance with the fact that you may not actually solve the problem in the nonconvex case
+Primal optimization pipeline
+  Tune with unconstrained
+How to choose dual lr
+  1e-3 to start
+  If dual lr is Larger, pushing for feasibility faster.
+  Relationship between mini-batch size, and the relative frequency of multiplier updates.
+Noise
+  What is noise? Constraints are estimated stochastically
+  Also makes it tricky to determine if you are feasible.
+  Difficult to achieve feasibility
+  Consider evaluating the constraints at the epoch level/averaging out constraints
+  Increase batch size
+  Variance reduction
+
+
+**What are common pitfalls when implementing a CMP?**
+
+> * Make sure your constraints comply with **Cooper**'s  convention $g(\boldsymbol{x}) \leq 0$ for inequality constraints and $h(x) = 0$ for equality constraints. If you have a greater than or equal constraint $g(\boldsymbol{x}) \geq 0$, you should provide **Cooper** with $-g(\boldsymbol{x}) \leq 0$.
+>
+> * Make sure that the tensors corresponding to the loss and constraints have gradients. Avoid "creating **new** tensors" for packing multiple constraints in a single tensor as this could block gradient backpropagation: do not use `torch.tensor([g1, g2, ...])`; instead, use `torch.cat([g1, g2, ...])`. You can use the {py:meth}`~cooper.ConstrainedMinimizationProblem.sanity_check_cmp_state` to check this.
+>
+> * For efficiency, we suggest reusing as much of the computational graph as possible between loss and the constraints. For example, if both depend on the outputs of a neural network, we recommend performing a single forward pass and reusing the computed outputs for both the loss and the constraints.
+
+**What types of problems can I solve with <b>Cooper</b>?**
+Answer here. For convex problems or problems with special structure, suggest other libraries.
+
+
+If non convex
+Or stochastic
+Autograd differentiable objective and constraints (or non-differentiable constraints but with a surrogate)
+
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Where can I get help with <b>Cooper</b>?
+  </summary>
+  <div style="margin-left: 20px;">
+    You can ask questions and get help on our <a href="https://discord.gg/Aq5PjH8m6E">Discord server</a>.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Where can I learn more about constrained optimization?
+  </summary>
+  <div style="margin-left: 20px;">
+    You can find more on convex constrained optimization in <a href="https://web.stanford.edu/~boyd/cvxbook/">Convex Optimization</a> by Boyd and Vandenberghe.
+    For non-convex constrained optimization, you can check out <a href="http://athenasc.com/nonlinbook.html">Nonlinear Programming</a> by Bertsekas.
+  </div>
+</details>
+
+### Formulations
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    What problem formulations does <b>Cooper</b> support?
+  </summary>
+  <div style="margin-left: 20px;">
+    <b>Cooper</b> supports the following formulations:
+    <ul>
+      <li><a href="https://cooper.readthedocs.io/en/latest/lagrangian_formulation.html#lagrangian-formulation">Lagrangian Formulation.</a></li>
+      <li><a href="https://cooper.readthedocs.io/en/latest/lagrangian_formulation.html#augmented-lagrangian-formulation">Augmented Lagrangian Formulation.</a></li>
+    </ul>
+  </div>
+</details>
+
+### Optimizers
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    What is a good configuration for the primal optimizer?
+  </summary>
+  <div style="margin-left: 20px;">
+    You can use whichever optimizer you prefer for your task, e.g., SGD, Adam, ...
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    What is a good configuration for the dual optimizer?
+  </summary>
+  <div style="margin-left: 20px;">
+    For the dual optimizer, we recommend starting with SGD. If the dual learning rate is difficult to tune or if the Lagrange multipliers present oscillations, we recommend using <a href="TODO">nuPI</a>.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Which <b>Cooper</b> optimizer should I use?
+  </summary>
+  <div style="margin-left: 20px;">
+    <b>Cooper</b> provides a range of CooperOptimizers to choose from. The <b>AlternatingDualPrimalOptimizer</b> is a good starting point. For details, <a href=https://cooper.readthedocs.io/en/latest/optim.html>see</a>.
+  </div>
+</details>
+
+### Debugging and troubleshooting
+
+**Why is my solution not becoming feasible?**
+
+> Start by assessing the feasibility of your problem. You may establish the feasibility of your problem by inspecting the constraints. Alternatively, you may try to solve a "feasibility problem" (by removing the loss). However, note that determining feasibility for a non-convex constrained optimization problem is intractable in general.
+>
+> Once you have determined your problem is feasible, monitor the progress of the model becoming feasible. If the primal parameters are not moving fast enough towards feasibility, you may need to tune (increase) the dual learning rate.
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Why is my objective function increasing? 😟
+  </summary>
+  <div style="margin-left: 20px;">
+    There are several reasons why this might happen. But the most common one is that the dual learning rate is too high. Try reducing it.
+  </div>
+</details>
+
+**How can I tell if Cooper found a "good" solution?**
+> As a reference, consider the solution of the unconstrained problem, which is a lower bound on the solution to the constrained problem
+> Nuance with the fact that you may not actually solve the problem in the nonconvex case
+
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    What quantities should I log for sanity-checking?
+  </summary>
+  <div style="margin-left: 20px;">
+    Log the loss, the constraint violations, the multiplier values, and the Lagrangian.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    What do typical multiplier dynamics look like?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here. Complementary slackness.
+  </div>
+</details>
+
+**What should I do if my Lagrange multipliers diverge?**
+> * Start by ensuring that your problem is feasible: for infeasible problems, the optimal Lagrange multipliers are infinite.
+> * Normally, the growth in the Lagrange multipliers (due to the accumulation of the violation) is accompanied by a "response" from the primal parameters moving towards feasibility. A lack of primal response could be due to the primal learning rate being too low.
+> * Having tuned the primal learning rate, a lack of primal response could indicate (i) that your problem is infeasible or (ii) that the constraint gradients are vanishing (impeding movement towards feasibility). In situation (ii), you may attempt reformulating the constraints to avoid the vanishing gradient.
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    What should I do if my Lagrange multipliers oscillate too much?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    What should I do if my Lagrange multipliers are too noisy?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here.
+  </div>
+</details>
+
+### Computational considerations
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Is <b>Cooper</b> computationally expensive?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here.
+  </div>
+</details>
+
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Does <b>Cooper</b> support GPU acceleration?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Does <b>Cooper</b> support DDP execution?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Does <b>Cooper</b> support AMP?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    What if my problem has a lot of constraints?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here. IndexedMultipliers, ImplicitMultipliers, etc.
+  </div>
+
+### Advanced topics
+
+
+### Miscellaneous
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    How do I cite <b>Cooper</b>?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Is there a JAX version of <b>Cooper</b>?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here.
+  </div>
+</details>
+
+<details>
+  <summary style="font-size: 1.2rem;">
+    Is there a TensorFlow version of <b>Cooper</b>?
+  </summary>
+  <div style="margin-left: 20px;">
+    Answer here. TFCO is a good alternative.
+  </div>
+</details>
diff --git a/docs/source/formulations.md b/docs/source/formulations.md
new file mode 100644
index 00000000..5d7d5f9b
--- /dev/null
+++ b/docs/source/formulations.md
@@ -0,0 +1,162 @@
+(formulations)=
+
+```{eval-rst}
+.. currentmodule:: cooper.formulations
+```
+
+# Formulations
+
+Once equipped with a {ref}`constrained minimization problem (CMP)<cmp>`, several algorithmic approaches can be adopted to find a solution. These occur in two stages: the **formulation** of the optimization problem and the choice of the **optimization algorithm** to solve it.
+
+This section focuses on formulations of the CMP. {ref}`Here<optim>` we discuss the algorithms for solving the formulated problem (e.g., simultaneous gradient descent-ascent).
+
+The formulations supported by **Cooper** are of the form:
+
+$$
+\min_{\vx \in \reals^d} \,\, \max_{\vlambda \ge \vzero, \vmu} \,\, f(\vx) + P(\vg(\vx), \vlambda, \vc_{\vg}) + Q(\vh(\vx), \vmu, \vc_{\vh}),
+$$
+
+where $P$ and $Q$ are penalty functions aimed at enforcing the satisfaction of the constraints, with parameters $\vlambda$ and $\vmu$, and hyper-parameters $\vc_{\vg}$ and $\vc_{\vh}$.
+
+
+:::{warning}
+**Cooper**'s framework for formulations supports a wide range of approaches for solving constrained optimization problems, including:
+- The Lagrangian (with $\vlambda$ and $\vmu$ as Lagrange multipliers)
+- The Augmented Lagrangian (with $\vc_{\vg}$ and $\vc_{\vh}$ as penalty coefficients)
+- Penalty methods (e.g., the quadratic penalty method; not currently implemented)
+- Interior-point methods (not currently implemented)
+
+However, this framework is not exhaustive and formulations such as Sequential Quadratic Programming (SQPs) are not supported in **Cooper**.
+:::
+
+To specify your formulation of choice, pass the corresponding class to the `formulation_type` argument of the {py:class}`~cooper.constraints.Constraint` class:
+
+```python
+my_constraint = cooper.Constraint(
+    constraint_type=cooper.ConstraintType.INEQUALITY,
+    multiplier=multiplier,
+    formulation=cooper.LagrangianFormulation
+)
+```
+
+:::{note}
+**Cooper** offers flexibility in that a single CMP can be solved using *different* formulations. Crucially, the choice of formulation is tied to the **constraints**, rather than the CMP itself. By specifying different {py:class}`~cooper.constraints.Constraint`s within a CMP, you can apply different formulations to each.
+:::
+
+## Lagrangian Formulations
+
+The **Lagrangian** formulation of a CMP is:
+
+$$
+\min_{\vx \in \reals^d} \,\, \max_{\vlambda \ge \vzero, \vmu} \,\, \mathcal{L}(\vx, \vlambda, \vmu) = f(\vx) + \vlambda^\top \vg(\vx) + \vmu^\top \vh(\vx),
+$$
+
+where $\vlambda \geq \vzero$ and $\vmu$ are the Lagrange multipliers or **dual variables** associated with the inequality and equality constraints, respectively.
+
+
+:::{warning}
+There is no guarantee that a general nonconvex constrained optimization problem admits optimal Lagrange multipliers at its solution $\xstar$. In such cases, finding $\xstar, \lambdastar, \mustar$ as an argmin-argmax point of the Lagrangian is a futile approach to solve the problem since $\lambdastar$ and $\mustar$ do not exist.
+
+See {cite:t}`boyd2004convex` for conditions under which Lagrange multipliers are guaranteed to exist.
+:::
+
+
+```{eval-rst}
+.. autoclass:: LagrangianFormulation
+    :members:
+```
+
+(augmented-lagrangian-formulation)=
+
+## Augmented Lagrangian Formulation
+
+The Augmented Lagrangian function is a modification of the Lagrangian function that includes a quadratic penalty term on the constraints:
+
+$$
+\mathcal{L}_{c}(\vx, \vlambda, \vmu) = f(\vx) + \vlambda^\top \vg(\vx) + \vmu^\top \vh(\vx) + \frac{c}{2} ||\text{max}\{\vzero, \vg(\vx)\}||^2 + \frac{c}{2} ||\vh(\vx)||^2
+$$
+
+where $c > 0$ is a penalty coefficient.
+
+TODO: predoc 3
+
+The main advantage of the ALM compared to the quadratic penalty method
+(see $\S$ 4.2.1 in {cite:p}`bertsekas1999NonlinearProgramming`) is that
+(under some reasonable assumptions), the algorithm can be successful without
+requiring the unbounded increase of the penalty parameter sequence $c^t$.
+The use of explicit estimates for the Lagrange multipliers contribute to
+avoiding the ill-conditioning that is inherent in the quadratic penalty method.
+
+See $\S$ 4.2.1 in {cite:p}`bertsekas1999NonlinearProgramming` and
+$\S$ 17 in {cite:p}`nocedal2006NumericalOptimization` for a comprehensive
+treatment of the Augmented Lagrangian method.
+
+
+To use the Augmented Lagrangian formulation in **Cooper**, first define a penalty coefficient (see {ref}`multipliers` for details):
+
+```python
+from cooper.multipliers import DensePenaltyCoefficient
+
+penalty_coefficient = DensePenaltyCoefficient(init=1.0)
+```
+
+Then, pass the {py:class}`~cooper.formulation.lagrangian.AugmentedLagrangianFormulation` class and the `penalty_coefficient` to the {py:class}`~cooper.constraints.Constraint` constructor:
+
+```python
+my_constraint = cooper.Constraint(
+    constraint_type=cooper.ConstraintType.INEQUALITY,
+    multiplier=multiplier,
+    formulation=cooper.AugmentedLagrangianFormulation,
+    penalty_coefficient=penalty_coefficient,
+)
+```
+
+:::{note}
+**Cooper** also allows for having different penalty coefficients for different constraints. This can be achieved by passing a tensor of coefficients to the `init` argument of a {py:class}`~cooper.multipliers.PenaltyCoefficient`.
+:::
+
+**Cooper** also supports the use of a scheduler for the penalty coefficient. For more information, see {ref}`coefficient_updaters`.
+
+:::{warning}
+We make a distinction between the Augmented Lagrangian function/formulation and the Augmented Lagrangian *method* $\S$4.2.1 in {cite:p}`bertsekas1999NonlinearProgramming`. The Augmented Lagrangian method is an optimization algorithm over the Augmented Lagrangian function above:
+
+$$
+\vx_{t+1} &\in \argmin{\vx \in \reals^d} \,\, \mathcal{L}_{c_t}(\vx, \vlambda_t, \vmu_t) \\
+\vlambda_{t+1} &= \left[\vlambda_t + {\color{red} c_t} \, \vg(\vx_{\color{red} t+1})\right]_+ \\
+\vmu_{t+1} &= \vmu_t + {\color{red} c_t} \, \vh(\vx_{\color{red} t+1}) \\
+c_{t+1} &\ge c_t
+$$
+
+The Augmented Lagrangian method has the following distinguishing features:
+- The minimization with respect to the primal variables $\vx$ is (usually) solved completely or approximately (in contrast to taking one gradient step).
+- It uses alternating updates (the updated primal iterate $\vx_{t+1}$ is used to update the Lagrange multipliers $\vlambda_{t+1}$).
+- The dual learning rate matches the current value of the penalty coefficient $\eta_{\vlambda} = \eta_{\vmu} = c_t$.
+
+If you are interested in using the Augmented Lagrangian method in **Cooper**, use the {py:class}`~cooper.optim.PrimalDualOptimizer` constrained optimizer and ensure that the learning rate of the dual variables is linked to the penalty coefficient by doing:
+
+```python
+TODO
+```
+:::
+
+
+```{eval-rst}
+.. autoclass:: AugmentedLagrangianFormulation
+    :members:
+```
+
+## Base Formulation Class
+
+If you are interested in implementing your own formulation, you can inherit from the {py:class}`~cooper.formulation.Formulation` abstract base class.
+
+```{eval-rst}
+.. autoclass:: Formulation
+    :members: compute_contribution_to_primal_lagrangian, compute_contribution_to_dual_lagrangian
+```
+
+## Utils
+
+```{eval-rst}
+.. automodule:: cooper.formulations.utils
+    :members:
+```
diff --git a/docs/source/index.md b/docs/source/index.md
index 7fbc9d39..976dc4c4 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -7,6 +7,8 @@
 :maxdepth: 2
 
 readme
+constrained_optimization
+faq
 notebooks/index
 ```
 
@@ -16,7 +18,7 @@ notebooks/index
 :maxdepth: 2
 
 problem
-lagrangian_formulation
+formulations
 constrained_optimizer
 optim
 multipliers
diff --git a/docs/source/lagrangian_formulation.md b/docs/source/lagrangian_formulation.md
deleted file mode 100644
index 3796cc6e..00000000
--- a/docs/source/lagrangian_formulation.md
+++ /dev/null
@@ -1,167 +0,0 @@
-(lagrangian-formulations)=
-
-```{eval-rst}
-.. currentmodule:: cooper.formulation.lagrangian
-```
-
-# Lagrangian Formulations
-
-Once equipped with a {py:class}`~cooper.problem.ConstrainedMinimizationProblem`,
-several algorithmic approaches can be adopted for finding an approximation to
-the solution of the constrained problem.
-
-Recall that we consider constrained minimization problems (CMPs) expressed as:
-
-$$
-\min_{x \in \Omega} & \,\, f(x) \\
-\text{s.t. } & \,\, g(x) \le \mathbf{0} \\
-             & \,\, h(x) = \mathbf{0}
-$$
-
-## Lagrangian Formulation
-
-The *Lagrangian* problem associated with the CMP above is given by:
-
-$$
-\min_{x \in \Omega} \max_{{\lambda^g} \ge 0, \, {\lambda^h}} \mathcal{L}(x,\lambda) \triangleq f(x) + {\lambda^g}^{\top} g(x) + {\lambda^h}^{\top} h(x)
-$$
-
-The vectors ${\lambda^g}$ and ${\lambda^h}$ are called the **Lagrange
-multipliers** or **dual variables** associated with the CMP. Observe that
-$\mathcal{L}(x,\lambda)$ is a concave function of $\lambda$ regardless
-of the convexity properties of $f, g$ and $h$.
-
-A pair $(x^*,\lambda^*)$ is called a *saddle-point* of
-$\mathcal{L}(x,\lambda)$ if for all $(x,\lambda)$,
-
-$$
-\mathcal{L}(x^*,\lambda) \le \mathcal{L}(x^*,\lambda^*) \le \mathcal{L}(x,\lambda^*).
-$$
-
-This approach can be interpreted as a zero-sum two-player game, where the
-"primal" player $x$ aims to minimize $\mathcal{L}(x,\lambda)$ and
-the goal of the "dual" player $\lambda$ is to maximize
-$\mathcal{L}(x,\lambda)$ (or equiv. minimize
-$-\mathcal{L}(x,\lambda)$).
-
-Note that the notion of a saddle-point of the Lagrangian is in fact equivalent
-to that of a (pure) Nash equilibrium of the zero-sum game. If
-$(x^*,\lambda^*)$ is a saddle-point of $\mathcal{L}(x,\lambda)$,
-then, by definition, neither of the two players can improve their payoffs by
-unilaterally deviating from $(x^*,\lambda^*)$.
-
-In the context of a convex CMP (convex objectives, constraints and
-$\Omega$), given certain technical conditions (e.g. [Slater's condition](https://en.wikipedia.org/wiki/Slater%27s_condition)
-(see $\S$ 5.2.3 in {cite:p}`boyd2004convex`), or compactness of the domains),
-the existence of a pure Nash equilibrium is guaranteed {cite:p}`vonNeumann1928theorie`.
-
-:::{warning}
-A constrained non-convex problem might have an optimal feasible solution,
-and yet its Lagrangian might not have a pure Nash equilibrium. See example
-in Fig 1. of {cite:t}`cotter2019JMLR`.
-:::
-
-% .. admonition:: Theorem (:math:`\S` 5.2.3, :cite:t:`boyd2004convex`)
-
-% :class: hint
-
-% Convex problem + Slater condition :math:`\Rightarrow` Strong duality
-
-% .. admonition:: Theorem (:math:`\S` 5.4.2, :cite:t:`boyd2004convex`)
-
-% :class: hint
-
-% (:math:`x^*, \lambda^*`) primal and dual optimal and strong duality
-
-% :math:`\Leftrightarrow` (:math:`x^*, \lambda^*`) is a saddle point of the
-
-% Lagrangian.
-
-% .. admonition:: Theorem
-
-% :class: hint
-
-% Every convex CMP with compact domain (for which strong duality holds) has
-
-% a Lagrangian for which a saddle point (i.e. pure Nash Equilibrium) exists.
-
-% (cite von Neumann?)
-
-```{eval-rst}
-.. autoclass:: LagrangianFormulation
-    :members:
-
-```
-
-```{eval-rst}
-.. currentmodule:: cooper.formulation.augmented_lagrangian
-```
-
-(augmented-lagrangian-formulation)=
-
-## Augmented Lagrangian Formulation
-
-The Augmented Lagrangian Method (ALM) considers a `sequence` of unconstrained
-minimization problems on the primal variables:
-
-$$
-L_{c_t}(x, \lambda^t) \triangleq f(x) + \lambda_{g, t}^{\top} \, g(x) + \lambda_{h, t}^{\top} \, h(x) + \frac{c_t}{2} ||g(x) \odot \mathbf{1}_{g(x_t) \ge 0 \vee \lambda_{g, t} > 0}||^2 +  \frac{c_t}{2} ||h(x_t)||^2
-$$
-
-This problem is (approximately) minimized over the primal variables to obtain:
-
-$$
-x^{t+1} = \arg \min_{x \in \Omega} \mathcal{L}_{c^t}(x, \lambda^t)
-$$
-
-The found $x^{t+1}$ is used to update the estimate for the Lagrange multiplier:
-
-$$
-\begin{align}     \lambda_{t+1}^g &= \left[\lambda_t^g + c^t g(x^{t+1}) \right]^+ \\     \lambda_{t+1}^h &= \lambda_t^h + c^t h(x^{t+1}) \end{align}
-$$
-
-The main advantage of the ALM compared to the quadratic penalty method
-(see $\S$ 4.2.1 in {cite:p}`bertsekas1999NonlinearProgramming`) is that
-(under some reasonable assumptions), the algorithm can be successful without
-requiring the unbounded increase of the penalty parameter sequence $c^t$.
-The use of explicit estimates for the Lagrange multipliers contribute to
-avoiding the  ill-conditioning that is inherent in the quadratic penalty method.
-
-See $\S$ 4.2.1 in {cite:p}`bertsekas1999NonlinearProgramming` and
-$\S$ 17 in {cite:p}`nocedal2006NumericalOptimization` for a comprehensive
-treatment of the Augmented Lagrangian method.
-
-:::{important}
-Please visit {ref}`this section<augmented_lagrangian_const_opt>` for
-practical considerations on using the Augmented Lagrangian method in
-**Cooper**.
-
-In particular, the sequence of penalty coefficients $c_t$ is handled
-in **Cooper** as a
-{ref}`scheduler on the dual learning rate<dual_lr_scheduler>`.
-:::
-
-```{eval-rst}
-.. autoclass:: AugmentedLagrangianFormulation
-    :members:
-
-
-```
-
-```{eval-rst}
-.. currentmodule:: cooper.formulation.lagrangian
-```
-
-## Proxy-Lagrangian Formulation
-
-```{eval-rst}
-.. autoclass:: ProxyLagrangianFormulation
-    :members:
-```
-
-## Base Lagrangian Formulation
-
-```{eval-rst}
-.. autoclass:: BaseLagrangianFormulation
-    :members:
-```
diff --git a/docs/source/multipliers.md b/docs/source/multipliers.md
index b872e69c..c282d2e6 100644
--- a/docs/source/multipliers.md
+++ b/docs/source/multipliers.md
@@ -1,11 +1,18 @@
 (multipliers)=
 
-# Multipliers
+# Multipliers and Penalty Coefficients
+
 
 ```{eval-rst}
 .. currentmodule:: cooper.multipliers
 ```
 
+```{eval-rst}
+.. automodule:: cooper.multipliers
+    :members:
+```
+
+
 :::{note}
 Multipliers are mostly handled internally by the
 {py:class}`~cooper.formulation.Formulation`s. This handling includes:
diff --git a/docs/source/notebooks/plot_gaussian_mixture.md b/docs/source/notebooks/plot_gaussian_mixture.md
index c98a02e0..6e711dec 100644
--- a/docs/source/notebooks/plot_gaussian_mixture.md
+++ b/docs/source/notebooks/plot_gaussian_mixture.md
@@ -5,7 +5,7 @@ jupytext:
     extension: .md
     format_name: myst
     format_version: 0.13
-    jupytext_version: 1.16.3
+    jupytext_version: 1.16.4
 kernelspec:
   display_name: Python 3
   language: python
diff --git a/docs/source/notebooks/plot_infrequent_true_constraint.md b/docs/source/notebooks/plot_infrequent_true_constraint.md
index 858b5648..2111fdc7 100644
--- a/docs/source/notebooks/plot_infrequent_true_constraint.md
+++ b/docs/source/notebooks/plot_infrequent_true_constraint.md
@@ -5,7 +5,7 @@ jupytext:
     extension: .md
     format_name: myst
     format_version: 0.13
-    jupytext_version: 1.16.3
+    jupytext_version: 1.16.4
 kernelspec:
   display_name: Python 3
   language: python
diff --git a/docs/source/notebooks/plot_max_entropy.md b/docs/source/notebooks/plot_max_entropy.md
index 4f44f046..bcc85394 100644
--- a/docs/source/notebooks/plot_max_entropy.md
+++ b/docs/source/notebooks/plot_max_entropy.md
@@ -5,7 +5,7 @@ jupytext:
     extension: .md
     format_name: myst
     format_version: 0.13
-    jupytext_version: 1.16.3
+    jupytext_version: 1.16.4
 kernelspec:
   display_name: Python 3
   language: python
diff --git a/docs/source/notebooks/plot_max_entropy_augmented_lagrangian.md b/docs/source/notebooks/plot_max_entropy_augmented_lagrangian.md
index 0031be1b..62ee4e62 100644
--- a/docs/source/notebooks/plot_max_entropy_augmented_lagrangian.md
+++ b/docs/source/notebooks/plot_max_entropy_augmented_lagrangian.md
@@ -5,7 +5,7 @@ jupytext:
     extension: .md
     format_name: myst
     format_version: 0.13
-    jupytext_version: 1.16.3
+    jupytext_version: 1.16.4
 kernelspec:
   display_name: Python 3
   language: python
diff --git a/docs/source/notebooks/plot_min_norm.md b/docs/source/notebooks/plot_min_norm.md
index 398c73a0..d81fc719 100644
--- a/docs/source/notebooks/plot_min_norm.md
+++ b/docs/source/notebooks/plot_min_norm.md
@@ -5,7 +5,7 @@ jupytext:
     extension: .md
     format_name: myst
     format_version: 0.13
-    jupytext_version: 1.16.3
+    jupytext_version: 1.16.4
 kernelspec:
   display_name: Python 3
   name: python3
diff --git a/docs/source/notebooks/plot_mnist_logistic_regression.ipynb b/docs/source/notebooks/plot_mnist_logistic_regression.ipynb
index 7b24ff32..bfebfded 100644
--- a/docs/source/notebooks/plot_mnist_logistic_regression.ipynb
+++ b/docs/source/notebooks/plot_mnist_logistic_regression.ipynb
@@ -136,7 +136,7 @@
     "    start_epoch = 0\n",
     "    all_metrics = defaultdict(list)\n",
     "else:\n",
-    "    checkpoint = torch.load(checkpoint_path + \"/checkpoint.pth\")\n",
+    "    checkpoint = torch.load(checkpoint_path + \"/checkpoint.pth\", weights_only=True)\n",
     "    batch_ix = checkpoint[\"batch_ix\"]\n",
     "    start_epoch = checkpoint[\"epoch\"] + 1\n",
     "    all_metrics = checkpoint[\"all_metrics\"]\n",
@@ -182,7 +182,7 @@
     "del batch_ix, all_metrics, model, cmp, cooper_optimizer\n",
     "\n",
     "# Post-training analysis and plotting\n",
-    "all_metrics = torch.load(checkpoint_path + \"/checkpoint.pth\")[\"all_metrics\"]\n",
+    "all_metrics = torch.load(checkpoint_path + \"/checkpoint.pth\", weights_only=True)[\"all_metrics\"]\n",
     "\n",
     "fig, (ax0, ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=4, sharex=True, figsize=(18, 4))\n",
     "\n",
diff --git a/docs/source/notebooks/plot_mnist_logistic_regression.md b/docs/source/notebooks/plot_mnist_logistic_regression.md
index bc76a855..806fbdb7 100644
--- a/docs/source/notebooks/plot_mnist_logistic_regression.md
+++ b/docs/source/notebooks/plot_mnist_logistic_regression.md
@@ -5,7 +5,7 @@ jupytext:
     extension: .md
     format_name: myst
     format_version: 0.13
-    jupytext_version: 1.16.3
+    jupytext_version: 1.16.4
 kernelspec:
   display_name: Python 3
   name: python3
@@ -121,7 +121,7 @@ if not os.path.isfile(checkpoint_path + "/checkpoint.pth"):
     start_epoch = 0
     all_metrics = defaultdict(list)
 else:
-    checkpoint = torch.load(checkpoint_path + "/checkpoint.pth")
+    checkpoint = torch.load(checkpoint_path + "/checkpoint.pth", weights_only=True)
     batch_ix = checkpoint["batch_ix"]
     start_epoch = checkpoint["epoch"] + 1
     all_metrics = checkpoint["all_metrics"]
@@ -167,7 +167,7 @@ for epoch_num in range(start_epoch, 7):
 del batch_ix, all_metrics, model, cmp, cooper_optimizer
 
 # Post-training analysis and plotting
-all_metrics = torch.load(checkpoint_path + "/checkpoint.pth")["all_metrics"]
+all_metrics = torch.load(checkpoint_path + "/checkpoint.pth", weights_only=True)["all_metrics"]
 
 fig, (ax0, ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=4, sharex=True, figsize=(18, 4))
 
diff --git a/docs/source/optim.md b/docs/source/optim.md
index 511e140d..d6ecf46b 100644
--- a/docs/source/optim.md
+++ b/docs/source/optim.md
@@ -1,210 +1,338 @@
-# Optim Module
+(optim) =
+
 
 ```{eval-rst}
 .. currentmodule:: cooper.optim
 ```
 
-(partial-optimizer-instantiation)=
-
-## Partial optimizer instantiation
-
-When constructing a {py:class}`~cooper.optim.constrained_optimizesr.ConstrainedOptimizer`, the
-`dual_optimizer` parameter is expected to be a
-{py:class}`torch.optim.Optimizer` for which the `params` argument has **not
-yet** been passed. The rest of the instantiation of the `dual_optimizer` is
-handled internally by **Cooper**.
+```{eval-rst}
+.. autoclass:: CooperOptimizer
+    :members:
+```
 
-The {py:meth}`cooper.optim.partial_optimizer` method below allows you to provide a
-configuration for your `dual_optimizer`'s hyperparameters (e.g. learning
-rate, momentum, etc.)
 
 ```{eval-rst}
-.. automethod:: cooper.optim.partial_optimizer
+.. currentmodule:: cooper.optim.constrained_optimizers
 ```
 
-## Learning rate schedulers
+# Optim
 
-**Cooper** supports learning rate schedulers for the primal and dual optimizers.
-Recall that **Cooper** handles the primal and dual optimizers in slightly
-different ways: the primal optimizer is "fully" instantiated by the user, while
-we expect a "partially" instantiated dual optimizer. We follow a similar pattern
-for the learning rate schedulers.
+Talk about ConstrainedOptimizers
+We also implement torch optimizers
 
-**Example:**
+## `ConstrainedOptimizer` Class
 
-> ```{code-block} python
-> :emphasize-lines: 7,8,10,15
-> :linenos: true
->
-> from torch.optim.lr_scheduler import StepLR, ExponentialLR
->
-> ...
-> primal_optimizer = torch.optim.SGD(...)
-> dual_optimizer = cooper.optim.partial_optimizer(...)
->
-> primal_scheduler = StepLR(primal_optimizer, step_size=1, gamma=0.1)
-> dual_scheduler = cooper.optim.partial_scheduler(ExponentialLR, **scheduler_kwargs)
->
-> const_optim = cooper.ConstrainedOptimizer(..., primal_optimizer, dual_optimizer, dual_scheduler)
->
-> for step in range(num_steps):
->     ...
->     const_optim.step() # Cooper calls dual_scheduler.step() internally
->     primal_scheduler.step()  # You must call this explicitly
-> ```
+The {py:class}`ConstrainedOptimizer` class is the cornerstone of **Cooper**. A {py:class}`ConstrainedOptimizer` performs parameter updates to solve a
+{py:class}`~cooper.ConstrainedMinimizationProblem`.
 
-### Primal learning rate scheduler
 
-(primal-lr-scheduler)=
+A {py:class}`ConstrainedOptimizer` wraps two {py:class}`torch.optim.Optimizer` objects: one for the primal parameters and one for the dual parameters. The {py:class}`ConstrainedOptimizer` defines how to perform the calls to the `step` method of the each of the two optimizers. Namely,
+- {py:class}`UnconstrainedOptimizer` performs updates on the primal parameters only. Provided for consistency with the other optimizers.
+- {py:class}`SimultaneousOptimizer` performs simultaneous updates of the primal and dual parameters.
+- {py:class}`AlternatingPrimalDualOptimizer` performs alternating updates, updating the primal parameters first and then the dual parameters.
+- {py:class}`AlternatingDualPrimalOptimizer` performs alternating updates, updating the dual parameters first and then the primal parameters.
+- {py:class}`ExtrapolationConstrainedOptimizer` performs updates using the extra-gradient method.
 
-You must instantiate the scheduler for the learning rate used by each
-`primal_optimizer` and call the scheduler's `step` method explicitly, as is
-usual in PyTorch. See {py:mod}`torch.optim.lr_scheduler` for details.
 
-### Dual learning rate scheduler
+TODO: talk about the roll function
 
-(dual-lr-scheduler)=
 
-When constructing a
-{py:class}`~cooper.optim.constrained_optimizers.ConstrainedOptimizer`,
-the `dual_scheduler` parameter is expected to be a *partially instantiated*
-learning rate scheduler from PyTorch, for which the `optimizer` argument has
-**not yet** been passed. The {py:meth}`cooper.optim.partial_scheduler` method
-allows you to provide a  configuration for your `dual_scheduler`'s
-hyperparameters. The rest of the instantiation of the `dual_scheduler` is
-managed internally by **Cooper**.
+Example of how to use a `ConstrainedOptimizer`:
+- **\[Line 8\]**: Define a `primal_optimizer` for the primal parameters.
+- **\[Line 12\]**: Define a `dual_optimizer` for the dual parameters. Set `maximize=True` since the dual parameters maximize the Lagrangian.
+- **\[Lines 17-21\]**: Define a `ConstrainedOptimizer` with the `primal_optimizer`, `dual_optimizer`, and the `cmp`. **Cooper** supports multiple `primal_optimizers` and `dual_optimizers` by passing an iterable of optimizers.
+- **\[Line 28\]**: Use the `roll` method to perform a single update step of both the primal and dual parameters.
 
-:::{note}
-The call to the `step()` method of the dual optimizer is handled
-internally by **Cooper**. However, you must perform the call to the dual
-scheduler's `step` method manually. This will usually come after several
-calls to {py:meth}`cooper.optim.constrained_optimizers.ConstrainedOptimizer.step`.
-
-The reasoning behind this design is to provide you, the user, with greater
-visibility and control on the dual learning rate scheduler. For example, you
-might want to synchronize the changes in the dual learning rate scheduler
-depending on the number of training epochs ellapsed so far.
-
-This flexibility is also desirable when using an
-{ref}`Augmented Lagrangian Formulation<augmented_lagrangian_formulation>`,
-since the penalty coefficient for the augmented Lagrangian can be controlled
-directly via the dual learning rate scheduler.
-:::
+```{code-block} python
+:emphasize-lines: 8, 12, 17-21, 28
+:linenos: true
 
-### `PartialScheduler` Class
+import torch
+import cooper
 
-```{eval-rst}
-.. automethod:: cooper.optim.partial_scheduler
+train_loader = ...
+model = ...
+cmp = ... # containing `Constraint`s and their associated `Multiplier`s
 
-```
+primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
 
-(extra-gradient-optimizers)=
+# `cmp.dual_parameters()` returns the parameters associated with the multipliers.
+# `maximize=True` since the multipliers maximize the Lagrangian.
+dual_optimizer = torch.optim.SGD(cmp.dual_parameters(), lr=1e-3, maximize=True)
 
-## Extra-gradient optimizers
+# `ConstrainedOptimizer`s need access to the cmp to compute the loss, constraints, and
+# Lagrangian. Note that some `ConstrainedOptimizer`s do these calculations multiple
+# times.
+constrained_optimizer = cooper.SimultaneousOptimizer(
+    primal_optimizers=primal_optimizer,
+    dual_optimizer=dual_optimizer,
+    cmp=cmp,
+)
 
-The extra-gradient method {cite:p}`korpelevich1976extragradient` is a standard
-approach for solving min-max games as those appearing in the
-{py:class}`~cooper.formulation.LagrangianFormulation`.
+for inputs, targets in train_loader:
+    # kwargs used by `cmp.compute_cmp_state` method to compute the loss and constraints.
+    kwargs = {"model": model, "inputs": inputs, "targets": targets}
 
-Given a Lagrangian $\mathcal{L}(x,\lambda)$, define the joint variable
-$\omega = (x,\lambda)$ and the "gradient" operator:
+    # roll is a convenience method that
+    constrained_optimizer.roll(compute_cmp_state_kwargs={"model": model, "inputs": inputs, "targets": targets})
+```
 
-$$
-F(\omega) = [\nabla_x \mathcal{L}(x,\lambda), -\nabla_{\lambda} \mathcal{L}(x,\lambda)]^{\top}
-$$
+### Simultaneous updates
+# TODO: math, class, example, describe roll
 
-The extra-gradient update can be summarized as:
+A simple approach to updating the primal and dual parameters is to perform **simultaneous** updates. According to the choice of primal and dual optimizers, the updates are performed as follows:
 
 $$
-\omega_{t+1/2} &= P_{\Omega}[\omega_{t+} - \eta F(\omega_{t})] \\
-\omega_{t+1} &= P_{\Omega}[\omega_{t} - \eta F(\omega_{t+1/2})]
+x_{t+1} &= \texttt{primal_optimizer_update} \left( x_{t}, \nabla_{x} \mathcal{L}(x, \lambda_t)|_{x=x_t} \right)\\
+\lambda_{t+1} &= \texttt{dual_optimizer_update} \left( \lambda_{t}, {\color{red} \mathbf{-}} \nabla_{\lambda} \mathcal{L}({x_{t}}, \lambda)|_{\lambda=\lambda_t} \right)
 $$
 
-:::{note}
-In the *unconstrained* case, the extra-gradient update is "intrinsically
-different" from that of Nesterov momentum {cite:p}`gidel2018variational`.
-The current version of **Cooper** raises a {py:class}`RuntimeError` when
-trying to use an {py:class}`ExtragradientOptimizer`. This
-restriction might be lifted in future releases.
-:::
+```{eval-rst}
+.. autoclass:: SimultaneousOptimizer
+    :members:
+```
 
-The implementations of {py:class}`~cooper.optim.ExtraSGD` and
-{py:class}`~cooper.optim.ExtraAdam` included in **Cooper** are minor edits from
-those originally written by [Hugo Berard](https://github.com/GauthierGidel/Variational-Inequality-GAN/blob/master/optim/extragradient.py).
-{cite:t}`gidel2018variational` provides a concise presentation of the
-extra-gradient in the context of solving Variational Inequality Problems.
 
-:::{warning}
-If you decide to use extra-gradient optimizers for defining a
-{py:class}`~cooper.optim.constrained_optimizers.ConstrainedOptimizer`, the primal
-and dual optimizers must **both** be instances of classes inheriting from
-{py:class}`ExtragradientOptimizer`.
+### Alternating updates
 
-When provided with extrapolation-capable optimizers, **Cooper** will
-automatically trigger the calls to the extrapolation function.
+Point about efficiency:
+- PrimalDual: use compute_violations
 
-Due to the calculation of gradients at the "look-ahead" point
-$\omega_{t+1/2}$, the call to
-{py:meth}`cooper.optim.constrained_optimizers.ConstrainedOptimizer.step` requires
-passing the parameters needed for the computation of the
-{py:meth}`cooper.problem.ConstrainedMinimizationProblem.closure`.
+```{eval-rst}
+.. autoclass:: PrimalDualOptimizer
+    :members:
+```
 
-**Example:**
+```{eval-rst}
+.. autoclass:: DualPrimalOptimizer
+    :members:
+```
 
-```{code-block} python
-:emphasize-lines: 11,12,31
-:linenos: true
 
-model = ...
+### Extra-gradient updates
 
-cmp = cooper.ConstrainedMinimizationProblem()
-formulation = cooper.Formulation(...)
+```{eval-rst}
+.. autoclass:: ExtragradientConstrainedOptimizer
+    :members:
+```
 
-# Non-extra-gradient optimizers
-primal_optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
-dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3)
 
-# Extra-gradient optimizers
-primal_optimizer = cooper.optim.ExtraSGD(model.parameters(), lr=1e-2)
-dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraSGD, lr=1e-3)
 
-const_optim = cooper.ConstrainedOptimizer(
-    formulation=formulation,
-    primal_optimizers=primal_optimizer,
-    dual_optimizer=dual_optimizer,
-)
+### Base Class
 
-for step in range(num_steps):
-    const_optim.zero_grad()
-    lagrangian = formulation.compute_lagrangian(cmp.closure, model, inputs)
-    formulation.backward(lagrangian)
+How to implement a custom optimizer
 
-    # Non-extra-gradient optimizers
-    # Passing (cmp.closure, model, inputs) to step will simply be ignored
-    const_optim.step()
+Description of the zero_grad -> forward -> Lagrangian -> backward -> step -> projection
 
-    # Extra-gradient optimizers
-    # Must pass (cmp.closure, model, inputs) to step
-    const_optim.step(cmp.closure, model, inputs)
-```
-:::
 
 ```{eval-rst}
-.. autoclass:: ExtragradientOptimizer
+.. autoclass:: ConstrainedOptimizer
     :members:
 ```
 
-```{eval-rst}
-.. autoclass:: ExtraSGD
-    :members:
-```
 
-```{eval-rst}
-.. autoclass:: ExtraAdam
-    :members:
+## Unconstrained Optimizers
+
+
+## Torch Optimizers
+
+
+### nuPI
+
+
+### Extra-gradient Optimizers
+
+
+
+TODO: stuff after this is old
+
+### Construction
+
+The main ingredients to build a `ConstrainedOptimizer` are a
+{py:class}`~cooper.formulation.Formulation` (associated with a
+{py:class}`~cooper.problem.ConstrainedMinimizationProblem`) and a
+{py:class}`torch.optim.Optimizer` corresponding to a `primal_optimizer`.
+
+:::{note}
+**Cooper** supports the use of multiple `primal_optimizers`, each
+corresponding to different groups of primal variables. The
+`primal_optimizers` argument accepts a single optimizer, or a list
+of optimizers. See {ref}`multiple-primal_optimizers`.
+:::
+
+If the `ConstrainedMinimizationProblem` you are dealing with is in fact
+constrained, depending on your formulation, you might also need to provide a
+`dual_optimizer`. Check out the section on {ref}`partial_optimizer_instantiation`
+for more details on defining `dual_optimizer`s.
+
+:::{note}
+**Cooper** includes extra-gradient implementations of SGD and Adam which can
+be used as primal or dual optimizers. See {ref}`extra-gradient_optimizers`.
+:::
+
+#### Examples
+
+The highlighted lines below show the small changes required to go from an
+unconstrained to a constrained problem. Note that these changes should also be
+accompanied with edits to the custom problem class which inherits from
+{py:class}`~cooper.problem.ConstrainedMinimizationProblem`. More details on
+the definition of a CMP can be found under the entry for {ref}`cmp`.
+
+- **Unconstrained problem**
+
+  > ```{code-block} python
+  > :linenos: true
+  >
+  > model =  ModelClass(...)
+  > cmp = cooper.ConstrainedMinimizationProblem()
+  > formulation = cooper.formulation.Formulation(...)
+  >
+  > primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
+  >
+  > constrained_optimizer = cooper.UnconstrainedOptimizer(
+  >     formulation=formulation,
+  >     primal_optimizers=primal_optimizer,
+  > )
+  > ```
+
+- **Constrained problem**
+
+  > ```{code-block} python
+  > :emphasize-lines: 7,9,12
+  > :linenos: true
+  >
+  > model =  ModelClass(...)
+  > cmp = cooper.ConstrainedMinimizationProblem()
+  > formulation = cooper.formulation.Formulation(...)
+  >
+  > primal_optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
+  > # Note that dual_optimizer is "partly instantiated", *without* parameters
+  > dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3, momentum=0.9)
+  >
+  > constrained_optimizer = cooper.SimultaneousOptimizer(
+  >     formulation=formulation,
+  >     primal_optimizers=primal_optimizer,
+  >     dual_optimizer=dual_optimizer,
+  > )
+  > ```
+
+### The training loop
+
+We have gathered all the ingredients we need for tackling our CMP: the
+custom {py:class}`~cooper.problem.ConstrainedMinimizationProblem` class, along
+with your {py:class}`ConstrainedOptimizer` of choice and a
+{py:class}`ConstrainedOptimizer` for updating the parameters. Now it is time to
+put them to good use.
+
+The typical training loop for solving a CMP in a machine learning set up using
+**Cooper** (with a {ref}`Lagrangian Formulation<lagrangian_formulations>`)
+will involve the following steps:
+
+:::{admonition} Overview of main steps in a training loop
+:class: hint
+
+1. (Optional) Iterate over your dataset and sample of mini-batch.
+2. Call {py:meth}`constrained_optimizer.zero_grad()<zero_grad>` to reset the parameters' gradients
+3. Compute the current {py:class}`CMPState` (or estimate it with the minibatch) and calculate the Lagrangian using {py:meth}`formulation.compute_lagrangian(cmp.closure, ...)<cooper.formulation.LagrangianFormulation.compute_lagrangian>`.
+4. Populate the primal and dual gradients with {py:meth}`formulation.backward(lagrangian)<cooper.formulation.Formulation.backward>`
+5. Perform updates on the parameters using the primal and dual optimizers based on the recently computed gradients, via a call to {py:meth}`constrained_optimizer.step()<step>`.
+:::
+
+#### Example
+
+> ```{code-block} python
+> :linenos: true
+>
+> model = ModelClass(...)
+> cmp = cooper.ConstrainedMinimizationProblem(...)
+> formulation = cooper.LagrangianFormulation(...)
+>
+> primal_optimizer = torch.optim.SGD(model.parameters(), lr=primal_lr)
+> # Note that dual_optimizer is "partly instantiated", *without* parameters
+> dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=primal_lr)
+>
+> constrained_optimizer = cooper.ConstrainedOptimizer(
+>     formulation=formulation,
+>     primal_optimizers=primal_optimizer,
+>     dual_optimizer=dual_optimizer,
+> )
+>
+> for inputs, targets in dataset:
+>     # Clear gradient buffers
+>     constrained_optimizer.zero_grad()
+>
+>     # The closure is required to compute the Lagrangian
+>     # The closure might in turn require the model, inputs, targets, etc.
+>     lagrangian = formulation.compute_lagrangian(cmp.closure, ...)
+>
+>     # Populate the primal and dual gradients
+>     formulation.backward(lagrangian)
+>
+>     # Perform primal and dual parameter updates
+>     constrained_optimizer.step()
+> ```
+
+(basic-parameter-updates)=
+
+### Parameter updates
+
+By default, parameter updates are performed using **simultaneous** gradient
+descent-ascent updates (according to the choice of primal and dual optimizers).
+Formally,
+
+$$
+x_{t+1} &= \texttt{primal_optimizer_update} \left( x_{t}, \nabla_{x} \mathcal{L}(x, \lambda_t)|_{x=x_t} \right)\\
+\lambda_{t+1} &= \texttt{dual_optimizer_update} \left( \lambda_{t}, {\color{red} \mathbf{-}} \nabla_{\lambda} \mathcal{L}({x_{t}}, \lambda)|_{\lambda=\lambda_t} \right)
+$$
+
+:::{note}
+We explicitly include a negative sign in front of the gradient for
+$\lambda$ in order to highlight the fact that $\lambda$ solves
+**maximization** problem. **Cooper** handles the sign flipping internally, so
+you should provide your definition for a `dual_optimizer` using a non-negative
+learning rate, as usual!
+:::
+
+:::{admonition} Multiplier projection
+:class: note
+
+Lagrange multipliers associated with inequality constraints should remain
+non-negative. **Cooper** executes the standard projection to
+$\mathbb{R}^{+}$ by default for
+{py:class}`~cooper.optim.multipliers.DenseMultiplier`s. For more details
+on using custom projection operations, see the section on {ref}`multipliers`.
+:::
+
+Other update strategies supported by {py:class}`~ConstrainedOptimizer` include:
+
+- {ref}`Alternating updates<alternating_updates>` for (projected) gradient descent-ascent
+
+- The {ref}`Augmented Lagrangian<augmented_lagrangian_const_opt>` method (ALM)
+
+- Using {ref}`Extra-gradient<extra-gradient_optimizers>`
+
+  - Extra-gradient-based optimizers require an extra call to the
+    {py:meth}`cmp.closure()<cooper.problem.ConstrainedMinimizationProblem.closure>`.
+    See the section on {ref}`extra-gradient_optimizers` for usage details.
+
+The `ConstrainedOptimizer` implements a {py:meth}`ConstrainedOptimizer.step`
+method, that updates the primal and dual parameters (if `Formulation` has any).
+The nature of the update depends on the attributes provided during the
+initialization of the `ConstrainedOptimizer`. By default, updates are via
+gradient descent on the primal parameters and (projected) ascent
+on the dual parameters, with simultaneous updates.
+
+:::{note}
+When applied to an unconstrained problem, {py:meth}`ConstrainedOptimizer.step`
+will be equivalent to performing `optimizer.step()` on all of the
+`primal_optimizers` based on the gradient of the loss with respect to the
+primal parameters.
+:::
+
+```{include} additional_features.md
 ```
 
 ```{eval-rst}
-.. autoclass:: PID
+.. autoclass:: AlternatingPrimalDualOptimizer
     :members:
 ```
diff --git a/docs/source/optim_old.md b/docs/source/optim_old.md
new file mode 100644
index 00000000..01dee9e4
--- /dev/null
+++ b/docs/source/optim_old.md
@@ -0,0 +1,209 @@
+# Optim Module
+
+```{eval-rst}
+.. currentmodule:: cooper.optim
+```
+
+(partial-optimizer-instantiation)=
+
+## Partial optimizer instantiation
+
+When constructing a {py:class}`~cooper.optim.constrained_optimizesr.ConstrainedOptimizer`, the
+`dual_optimizer` parameter is expected to be a
+{py:class}`torch.optim.Optimizer` for which the `params` argument has **not
+yet** been passed. The rest of the instantiation of the `dual_optimizer` is
+handled internally by **Cooper**.
+
+The {py:meth}`cooper.optim.partial_optimizer` method below allows you to provide a
+configuration for your `dual_optimizer`'s hyperparameters (e.g. learning
+rate, momentum, etc.)
+
+```{eval-rst}
+.. automethod:: cooper.optim.partial_optimizer
+```
+
+## Learning rate schedulers
+
+**Cooper** supports learning rate schedulers for the primal and dual optimizers.
+Recall that **Cooper** handles the primal and dual optimizers in slightly
+different ways: the primal optimizer is "fully" instantiated by the user, while
+we expect a "partially" instantiated dual optimizer. We follow a similar pattern
+for the learning rate schedulers.
+
+**Example:**
+
+> ```{code-block} python
+> :emphasize-lines: 7,8,10,15
+> :linenos: true
+>
+> from torch.optim.lr_scheduler import StepLR, ExponentialLR
+>
+> ...
+> primal_optimizer = torch.optim.SGD(...)
+> dual_optimizer = cooper.optim.partial_optimizer(...)
+>
+> primal_scheduler = StepLR(primal_optimizer, step_size=1, gamma=0.1)
+> dual_scheduler = cooper.optim.partial_scheduler(ExponentialLR, **scheduler_kwargs)
+>
+> const_optim = cooper.ConstrainedOptimizer(..., primal_optimizer, dual_optimizer, dual_scheduler)
+>
+> for step in range(num_steps):
+>     ...
+>     const_optim.step() # Cooper calls dual_scheduler.step() internally
+>     primal_scheduler.step()  # You must call this explicitly
+> ```
+
+### Primal learning rate scheduler
+
+(primal-lr-scheduler)=
+
+You must instantiate the scheduler for the learning rate used by each
+`primal_optimizer` and call the scheduler's `step` method explicitly, as is
+usual in PyTorch. See {py:mod}`torch.optim.lr_scheduler` for details.
+
+### Dual learning rate scheduler
+
+(dual-lr-scheduler)=
+
+When constructing a
+{py:class}`~cooper.optim.constrained_optimizers.ConstrainedOptimizer`,
+the `dual_scheduler` parameter is expected to be a *partially instantiated*
+learning rate scheduler from PyTorch, for which the `optimizer` argument has
+**not yet** been passed. The {py:meth}`cooper.optim.partial_scheduler` method
+allows you to provide a configuration for your `dual_scheduler`'s
+hyperparameters. The rest of the instantiation of the `dual_scheduler` is
+managed internally by **Cooper**.
+
+:::{note}
+The call to the `step()` method of the dual optimizer is handled
+internally by **Cooper**. However, you must perform the call to the dual
+scheduler's `step` method manually. This will usually come after several
+calls to {py:meth}`cooper.optim.constrained_optimizers.ConstrainedOptimizer.step`.
+
+The reasoning behind this design is to provide you, the user, with greater
+visibility and control on the dual learning rate scheduler. For example, you
+might want to synchronize the changes in the dual learning rate scheduler
+depending on the number of training epochs ellapsed so far.
+
+This flexibility is also desirable when using an
+{ref}`Augmented Lagrangian Formulation<augmented_lagrangian_formulation>`,
+since the penalty coefficient for the augmented Lagrangian can be controlled
+directly via the dual learning rate scheduler.
+:::
+
+### `PartialScheduler` Class
+
+```{eval-rst}
+.. automethod:: cooper.optim.partial_scheduler
+```
+
+(extra-gradient-optimizers)=
+
+## Extra-gradient optimizers
+
+The extra-gradient method {cite:p}`korpelevich1976extragradient` is a standard
+approach for solving min-max games as those appearing in the
+{py:class}`~cooper.formulation.LagrangianFormulation`.
+
+Given a Lagrangian $\mathcal{L}(x,\lambda)$, define the joint variable
+$\omega = (x,\lambda)$ and the "gradient" operator:
+
+$$
+F(\omega) = [\nabla_x \mathcal{L}(x,\lambda), -\nabla_{\lambda} \mathcal{L}(x,\lambda)]^{\top}
+$$
+
+The extra-gradient update can be summarized as:
+
+$$
+\omega_{t+1/2} &= P_{\Omega}[\omega_{t+} - \eta F(\omega_{t})] \\
+\omega_{t+1} &= P_{\Omega}[\omega_{t} - \eta F(\omega_{t+1/2})]
+$$
+
+:::{note}
+In the *unconstrained* case, the extra-gradient update is "intrinsically
+different" from that of Nesterov momentum {cite:p}`gidel2018variational`.
+The current version of **Cooper** raises a {py:class}`RuntimeError` when
+trying to use an {py:class}`ExtragradientOptimizer`. This
+restriction might be lifted in future releases.
+:::
+
+The implementations of {py:class}`~cooper.optim.ExtraSGD` and
+{py:class}`~cooper.optim.ExtraAdam` included in **Cooper** are minor edits from
+those originally written by [Hugo Berard](https://github.com/GauthierGidel/Variational-Inequality-GAN/blob/master/optim/extragradient.py).
+{cite:t}`gidel2018variational` provides a concise presentation of the
+extra-gradient in the context of solving Variational Inequality Problems.
+
+:::{warning}
+If you decide to use extra-gradient optimizers for defining a
+{py:class}`~cooper.optim.constrained_optimizers.ConstrainedOptimizer`, the primal
+and dual optimizers must **both** be instances of classes inheriting from
+{py:class}`ExtragradientOptimizer`.
+
+When provided with extrapolation-capable optimizers, **Cooper** will
+automatically trigger the calls to the extrapolation function.
+
+Due to the calculation of gradients at the "look-ahead" point
+$\omega_{t+1/2}$, the call to
+{py:meth}`cooper.optim.constrained_optimizers.ConstrainedOptimizer.step` requires
+passing the parameters needed for the computation of the
+{py:meth}`cooper.problem.ConstrainedMinimizationProblem.closure`.
+
+**Example:**
+
+```{code-block} python
+:emphasize-lines: 11,12,31
+:linenos: true
+
+model = ...
+
+cmp = cooper.ConstrainedMinimizationProblem()
+formulation = cooper.Formulation(...)
+
+# Non-extra-gradient optimizers
+primal_optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
+dual_optimizer = cooper.optim.partial_optimizer(torch.optim.SGD, lr=1e-3)
+
+# Extra-gradient optimizers
+primal_optimizer = cooper.optim.ExtraSGD(model.parameters(), lr=1e-2)
+dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraSGD, lr=1e-3)
+
+const_optim = cooper.ConstrainedOptimizer(
+    formulation=formulation,
+    primal_optimizers=primal_optimizer,
+    dual_optimizer=dual_optimizer,
+)
+
+for step in range(num_steps):
+    const_optim.zero_grad()
+    lagrangian = formulation.compute_lagrangian(cmp.closure, model, inputs)
+    formulation.backward(lagrangian)
+
+    # Non-extra-gradient optimizers
+    # Passing (cmp.closure, model, inputs) to step will simply be ignored
+    const_optim.step()
+
+    # Extra-gradient optimizers
+    # Must pass (cmp.closure, model, inputs) to step
+    const_optim.step(cmp.closure, model, inputs)
+```
+:::
+
+```{eval-rst}
+.. autoclass:: ExtragradientOptimizer
+    :members:
+```
+
+```{eval-rst}
+.. autoclass:: ExtraSGD
+    :members:
+```
+
+```{eval-rst}
+.. autoclass:: ExtraAdam
+    :members:
+```
+
+```{eval-rst}
+.. autoclass:: PID
+    :members:
+```
diff --git a/docs/source/problem.md b/docs/source/problem.md
index 056426f5..06305f69 100644
--- a/docs/source/problem.md
+++ b/docs/source/problem.md
@@ -1,171 +1,120 @@
-```{eval-rst}
-.. currentmodule:: cooper.problem
-```
-
 (cmp)=
 
-# Constrained Minimization Problem
+# Constrained Minimization Problems
 
-We consider constrained minimization problems (CMPs) expressed as:
+We consider constrained minimization problems (CMPs) of the form:
 
 $$
-\min_{x \in \Omega} & \,\, f(x) \\ \text{s.t. } & \,\, g(x) \le \mathbf{0} \\              & \,\, h(x) = \mathbf{0}
+\min_{\vx \in \reals^d} & \,\, f(\vx) \\ \text{s.t. }
+& \,\, \vg(\vx) \le \vzero \\ & \,\, \vh(\vx) = \vzero
 $$
 
-Here $\Omega$ represents the domain of definition of the functions
-$f, g$ and $h$. Note that $f$ is a scalar-valued function.
-We group together all the inequality and equality constraints into the
-vector-valued mappings $g$ and $h$. In other words, a component
-function $h_i(x)$ corresponds to the scalar constraint
-$h_i(x) \le 0$.
-
-:::{admonition} Brief notes on conventions and terminology
-- We refer to $f$ as the **loss** or **main objective** to be minimized.
-- Many authors prefer making the constraint levels explicit (e.g.
-  $g(x) \le \mathbf{\epsilon}$). To improve the readability of the
-  code, we adopt the convention that the constraint levels have been
-  "absorbed" in the definition of the functions $g$ and $h$.
-- Based on this convention, we use the terms **defect** and **constraint violation**
-  interchangeably to denote the quantities $g(x)$ or $h(x)$. Note
-  that equality constraints $h(x)$ are satisfied *only* when their
-  defect is zero. On the other hand, a *negative* defect for an inequality
-  constraint  $g(x)$ means that the constraint is *strictly* satisfied;
-  while a *positive* defect means that the inequality constraint is being
-  violated.
-:::
-
-## CMP State
-
-We represent computationally the "state" of a CMP using a {py:class}`CMPState`
-object. A `CMPState` is a {py:class}`dataclasses.dataclass` which contains the
-information about the loss and equality/inequality violations at a given point
-$x$. If a problem has no equality or inequality constraints, these
-arguments can be omitted in the creation of the `CMPState`.
+See {ref}`here<overview>` for a brief introduction to constrained optimization. In this section, we will discuss how to represent CMPs using **Cooper**. To do this, consider the following objects:
+- {py:class}`~cooper.constraints.Constraint`: represents a group of either equality or inequality constraints.
+- {py:class}`~cooper.ConstrainedMinimizationProblem`: represents the constrained minimization problem itself. It must include a method `compute_cmp_state` that computes the loss and constraint violations at a given point.
 
-:::{admonition} Stochastic estimates in `CMPState`
-:class: important
-
-In problems for which computing the loss or constraints exactly is prohibitively
-expensive, the {py:class}`CMPState` may contain stochastic estimates of the
-loss/constraints. For example, this is the case when the loss corresponds to a
-sum over a large number of terms, such as training examples. In this case, the
-loss and constraints may be estimated using mini-batches of data.
-
-Note that, just as in the unconstrained case, these approximations can
-entail a compromise in the stability of the optimization process.
-:::
-
-```{eval-rst}
-.. autoclass:: CMPState
-    :members: as_tuple
-
-```
-
-For details on the use of proxy-constraints and the `proxy_ineq_defect` and
-`proxy_eq_defect` attributes, please see {ref}`lagrangian_formulations`.
-
-## Constrained Minimization Problem
-
-```{eval-rst}
-.. autoclass:: ConstrainedMinimizationProblem
-    :members:
-```
+Moreover, in order to package the values of the loss and constraints, we will define the following objects:
+- {py:class}`~cooper.constraints.ConstraintState`: represents the state of a {py:class}`~cooper.constraints.Constraint` by packaging its violation.
+- {py:class}`~cooper.CMPState`: represents the state of a CMP at a given point. It contains the values of the loss and {py:class}`~cooper.constraints.ConstraintState` objects for some (or all) of its associated constraints.
 
 ## Example
 
-The example below illustrates the main steps that need to be carried out to
-define a `ConstrainedMinimizationProblem` in **Cooper**.
+The example below illustrates the main steps that need to be carried out to define a {py:class}`~cooper.ConstrainedMinimizationProblem` class. In this
 
-1. *\[Line 4\]* Define a custom class which inherits from {py:class}`ConstrainedMinimizationProblem`.
-2. *\[Line 10\]* Write a closure function that computes the loss and constraints.
-3. *\[Line 14\]* Note how the `misc` attribute can be use to store previous results.
-4. *\[Line 18\]* Return the information about the loss and constraints packaged into a {py:class}`CMPState`.
-5. *\[Line 18\]* (Optional) Modularize the code to allow for evaluating the constraints `only`.
+1. **\[Line 4\]** Define a custom class which inherits from {py:class}`~cooper.ConstrainedMinimizationProblem`.
+2. **\[Line 7\]** Define a multiplier object for the constraints.
+3. **\[Lines 9-11\]** Define the constraint object.
+4. **\[Line 13\]** Implement the `compute_cmp_state` method that computes the loss and constraints.
+5. **\[Line 18\]** Return the information about the loss and constraints packaged into a {py:class}`~cooper.CMPState`.
+6. **\[Line 20\]** (Optional) Modularize the code to allow for evaluating the constraints **only**. This is useful for optimization algorithms that sometimes need to evaluate the constraints without computing the loss.
 
 ```{code-block} python
-:emphasize-lines: 4,10,14,18,20
+:emphasize-lines: 4, 7, 9-11, 13, 18, 20
 :linenos: true
 
 import torch
 import cooper
 
-class MyCustomCMP(cooper.ConstrainedMinimizationProblem):
-    def __init__(self, problem_attributes, criterion):
-        self.problem_attributes = problem_attributes
-        self.criterion = criterion
+class MyCMP(cooper.ConstrainedMinimizationProblem):
+    def __init__(self):
         super().__init__()
+        multiplier = cooper.multipliers.DenseMultiplier(num_constraints=..., device=...)
+        # By default, constraints are built using `formulation_type=cooper.LagrangianFormulation`
+        self.constraint = cooper.Constraint(
+            multiplier=multiplier, constraint_type=cooper.ConstraintType.INEQUALITY
+        )
+
+    def compute_cmp_state(self, model, inputs, targets):
+        loss = ...
+        cmp_state = self.compute_violations(model, inputs, targets)
+        cmp_state.loss = loss
 
-    def closure(self, model, inputs, targets):
+        return cmp_state
 
-        cmp_state = self.defect_fn(model, inputs, targets)
+    def compute_violations(self, model, inputs, targets):
+        # This method is optional. It allows for evaluating the constraints without
+        # computing the loss.
+        violation = ... # ensure that the constraint follows the convention "g <= 0"
+        constraint_state = cooper.ConstraintState(violation=...)
+        observed_constraints = {self.constraint: constraint_state}
 
-        logits = cmp_state.misc["logits"]
-        loss = self.criterion(logits, targets)
-        cmp_state.loss = loss
+        return cooper.CMPState(loss=None, observed_constraints=observed_constraints)
+```
 
-        return cmp_state
 
-    def defect_fn(self, model, inputs, targets):
+## Constraints
 
-        logits = model.forward(inputs)
+{py:class}`~cooper.constraints.Constraint` objects are used to group similar constraints together. While it is possible to have multiple constraints represented by the same {py:class}`~cooper.constraints.Constraint` object, they must share the same type (i.e., all equality or all inequality constraints) and all must be handled through the same {py:class}`~cooper.formulation.Formulation` (for example, all with a Lagrangian formulation). For problems with different types of constraints or formulations, you should use separate {py:class}`~cooper.constraints.Constraint` objects.
 
-        const_level0, const_level1 = self.problem_attributes
+```{eval-rst}
+.. currentmodule:: cooper.constraints
+```
 
-        # Remember to write the constraints using the convention "g <= 0"!
 
-        # (Greater than) Inequality that only depends on the model properties or parameters
-        # g_0 >= const_level0 --> const_level0 - g_0 <= 0
-        defect0 = const_level0 - ineq_const0(model)
+```{eval-rst}
+.. autoclass:: Constraint
+```
 
-        # (Less than) Inequality that depends on the model's predictions
-        # g_1 <= const_level1 --> g_2  - const_level1 <= 0
-        defect1 = ineq_const1(logits) - const_level1
+## ConstraintStates
 
-        # We recommend using torch.stack to ensure the dependencies in the computational
-        # graph are properly preserved.
-        ineq_defect = torch.stack([defect0, defect1])
+In their simplest form, {py:class}`~cooper.constraints.ConstraintState` objects simply contain the value of the constraint violation. However, they can be extended to enable extra functionality:
+- **Sampled constraints**: if not all violations of a {py:class}`Constraint` are observed at every step, you can still use **Cooper** by providing the *observed* constraint violations in the {py:class}`~cooper.constraints.ConstraintState`. To do this, provide only the observed violations in `violation`, their corresponding indices in `constraint_features`, and make sure that you are using an {py:class}`~cooper.multipliers.IndexedMultiplier` as the multiplier associated with the constraint. **Cooper** will then know which entries to consider when computing contributions of the constraint to the Lagrangian, and which to ignore.
+>
+- **Implicit parameterization of the Lagrange multipliers** {cite:p}`narasimhan2020multiplier`: similar to the sampled constraints case, you can use an implicit parameterization for the Lagrange multipliers (a neural network, for example). In this case, the `constraint_features` must contain the input features to the Lagrange multiplier model associated with the evaluated constraints. Implicit multipliers are discussed in more detail in {doc}`multipliers`.
+>
+- **Proxy constraints** {cite:p}`cotter2019proxy`: in some settings, it is desirable to use different constraint violations for updating the primal and dual variables (see {ref}`here<proxy>` for more details). This can be achieved by providing a `violation`, which will be used for updating the primal variables, and a `strict_violation`, which will be used for updating the dual variables. When following this approach, ensure that the `violation` is differentiable with respect to the primal variables. Note that proxy constraints can be used in conjunction with sampled constraints and implicit parameterization of the Lagrange multipliers, by providing both `constraint_features` and `strict_constraint_features`.
 
-        return cooper.CMPState(ineq_defect=ineq_defect, eq_defect=None, misc={'logits': logits})
+```{eval-rst}
+.. autoclass:: ConstraintState
 ```
 
-:::{warning}
-**Cooper** is primarily oriented towards **non-convex** CMPs that arise
-in many machine/deep learning settings. That is, problems for which one of
-the functions $f, g, h$ or the set $\Omega$ is non-convex.
-
-Whenever possible, we provide references to appropriate literature
-describing convergence results for our implemented (under suitable
-assumptions). In general, however, the use of Lagrangian-based approaches
-for solving non-convex CMPs does not come with guarantees regarding
-optimality or feasibility.
-
-Some theoretical results can be obtained when considering mixed strategies
-(distributions over actions for the primal and dual players), or by relaxing
-the game-theoretic solution concept (i.e. aiming for approximate/correlated
-equilibria), even for problems which are non-convex on the primal (model)
-parameters. For more details, see the work of {cite:t}`cotter2019JMLR` and
-{cite:t}`lin2020gradient` and references therein. We plan to include some
-of these techniques in future versions of **Cooper**.
-
-If you are dealing with optimization problems under "nicely behaved" convex
-constraints (e.g. cones or $L_p$-balls) we encourage you to check out
-[CHOP](https://github.com/openopt/chop). If your problems involves "manifold"
-constraints (e.g. orthogonal or PSD matrices), you might consider using
-[GeoTorch](https://github.com/Lezcano/geotorch).
-:::
+
+## CMPs
 
 ```{eval-rst}
-.. currentmodule:: cooper.formulation
+.. currentmodule:: cooper
 ```
 
-## Formulation
+{py:class}`ConstrainedMinimizationProblem` objects must be implemented by the user, as exemplified in the [example](#example) above.
 
-Formulations denote mathematical or algorithmic techniques aimed at solving a
-specific (family of) CMP. **Cooper** is heavily (but not exclusively!) designed
-for an easy integration of Lagrangian-based formulations. You can find more
-details in {doc}`lagrangian_formulation`.
+```{eval-rst}
+.. autoclass:: ConstrainedMinimizationProblem
+    :members:
+```
+
+## CMPStates
+
+We represent computationally the "state" of a CMP using a {py:class}`CMPState` object. A {py:class}`CMPState` is a dataclass containing the information about the loss and the equality/inequality violations at a given point $\boldsymbol{x}$. The constraints included in the {py:class}`CMPState` must be passed as a dictionary, where the keys are the {py:class}`Constraint` objects and the values are the associated {py:class}`ConstraintState` objects.
+
+:::{admonition} Stochastic estimates in {py:class}`CMPState`
+:class: important
+
+In problems for which computing the loss or constraints exactly is prohibitively expensive, the {py:class}`CMPState` may contain **stochastic estimates** of the loss/constraints. For example, when using mini-batches to estimate the loss and constraints.
+
+Note that, just as in the unconstrained case, these approximations can entail a compromise in the stability of the optimization process.
+:::
 
 ```{eval-rst}
-.. autoclass:: Formulation
+.. autoclass:: CMPState
     :members:
 ```
diff --git a/docs/source/references.bib b/docs/source/references.bib
index a569fb00..db52cb17 100644
--- a/docs/source/references.bib
+++ b/docs/source/references.bib
@@ -1,4 +1,3 @@
-
 @book{nocedal2006NumericalOptimization,
   title = {Numerical Optimization},
   author = {Nocedal, Jorge and Wright, Stephen J.},
@@ -16,7 +15,7 @@ @book{bertsekas1999NonlinearProgramming
   publisher = {{Athena scientific}},
   address = {{Belmont, Mass}},
 }
-@article{cotter2019JMLR,
+@article{cotter2019proxy,
   author  = {Andrew Cotter and Heinrich Jiang and Maya Gupta and Serena Wang and Taman Narayan and Seungil You and Karthik Sridharan},
   title   = {{Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals}},
   journal = {Journal of Machine Learning Research},
@@ -72,7 +71,6 @@ @inproceedings{lin2020gradient
   year      = {2020},
   url       = {https://proceedings.mlr.press/v119/lin20a.html}
 }
-
 @inproceedings{sutskever2013initialization,
   title     = {On the importance of initialization and momentum in deep learning},
   author    = {Sutskever, Ilya and Martens, James and Dahl, George and Hinton, Geoffrey},
diff --git a/pyproject.toml b/pyproject.toml
index 739ede65..e82416d0 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,8 +1,8 @@
 [build-system]
 build-backend = "hatchling.build"
 requires = [
-    "hatchling>=1.1.0",
     "hatch-vcs>=0.4.0",
+    "hatchling>=1.1.0",
 ]
 
 [project]
@@ -35,7 +35,6 @@ classifiers = [
 ]
 dynamic = ["version"]
 dependencies = [
-    "numpy>=1.22.0,<2.0.0", # PyTorch 2.2.2 and older don't support NumPy 2.0.
     "torch>=1.13.1",
     "typing-extensions>=4.8.0",
 ]
@@ -43,43 +42,38 @@ dependencies = [
 [project.optional-dependencies]
 dev = [
     "build>=1.2.1",
-    "cvxpy~=1.5.1",
-    "mypy~=1.7.1",
-    "pre-commit>=3.6.0",
-    "pytest~=8.2.2",
-    "pytest-cov==4.1.0",
-    "ruff~=0.5.6",
-    "tox==4.11.4",
-    "twine==4.0.2",
+    "coverage>=7.6.1",
+    "cvxpy>=1.5.2",
+    "jupytext==1.16.4",
+    "mypy>=1.11.1",
+    "numpy>=1.22.0,<2.0.0", # PyTorch 2.2.2 and older don't support NumPy 2.0.
+    "pre-commit>=3.7.1",
+    "pytest>=8.3.2",
+    "ruff==0.6.1",
+    "twine>=5.1.1",
 ]
 docs = [
-    "ipykernel>=6.5.0,<7.0.0",
-    "ipywidgets>=7.6.0,<8.0.0",
-    "jupytext>=1.16.3",
-    "matplotlib>=3.5.0,<4.0.0",
+    "matplotlib>=3.8.4,<4.0.0",
     "myst-nb>=1.1.1",
-    "sphinx>=4.3.1",
-    "sphinx-autobuild>=2021.3.14",
-    "sphinx-autodoc-typehints>=1.12.0",
-    "sphinx-copybutton>=0.4.0",
-    "sphinx-rtd-theme>=1.0.0",
-    "sphinxcontrib-bibtex>=2.4.1",
-    "sphinxcontrib-contentui>=0.2.5",
-    "sphinxcontrib-katex>=0.8.6",
+    "numpy>=1.22.0,<2.0.0", # PyTorch 2.2.2 and older don't support NumPy 2.0.
+    "sphinx>=7.4.7",
+    "sphinx-autobuild>=2024.4.16",
+    "sphinx-autodoc-typehints>=2.2.3",
+    "sphinx-copybutton>=0.5.2",
+    "sphinx-rtd-theme>=2.0.0",
+    "sphinxcontrib-bibtex>=2.6.2",
     "torchvision>=0.13.0,<1.0.0",
 ]
-examples = [
-    "ipykernel>=6.5.0,<7.0.0",
-    "ipywidgets>=7.6.0,<8.0.0",
-    "matplotlib>=3.5.0,<4.0.0",
-    "numpy==1.22.0",
+notebooks = [
+    "matplotlib>=3.8.4,<4.0.0",
+    "numpy>=1.22.0,<2.0.0", # PyTorch 2.2.2 and older don't support NumPy 2.0.
     "torchvision>=0.13.0,<1.0.0",
 ]
 tests = [
-    "cvxpy~=1.5.1",
-    "pytest~=8.2.2",
-    "pytest-cov==4.1.0",
-    "tox==4.11.4",
+    "coverage>=7.6.1",
+    "cvxpy>=1.5.2",
+    "numpy>=1.22.0,<2.0.0", # PyTorch 2.2.2 and older don't support NumPy 2.0.
+    "pytest>=8.3.2",
 ]
 
 [project.urls]
@@ -102,64 +96,61 @@ exclude = [
 source = "vcs"
 
 [tool.mypy]
-mypy_path = "cooper"
+packages = ["cooper"]
 warn_unused_configs = true
 
+[tool.coverage.run]
+relative_files = true
+
 [tool.ruff]
 line-length = 120
 target-version = "py39"
-extend-include = ["*.ipynb"]
 
-[tool.ruff.lint.isort]
-known-first-party = ["cooper", "tests"]
+[tool.ruff.lint.pydocstyle]
+convention = "google"
+
+[tool.ruff.lint.pycodestyle]
+max-doc-length = 88
 
 [tool.ruff.lint]
 preview = true
 select = ["ALL"]
 ignore = [
+    "ANN401",   # Any type annotation
     "B028",     # Stacklevel in warnings
-    "E501",     # Line length (handled by ruff-format)
-    "E731",     # Lambda function
-    "W505",     # Doc Line length
+    "COM812",   # Fixed by ruff-format
     "D1",       # TODO: Remove this line when we have docstrings for all functions
     "D205",     # 1 blank line required between summary line and description in docstrings
-    "DOC",      # Docstring missing exceptions/returns
+    "E501",     # Line length (handled by ruff-format)
+    "E731",     # Lambda function
+    "FURB140",  # Use itertools.starmap instead of list comprehension
     "ISC001",   # Fixed by ruff-format
-    "COM812",   # Fixed by ruff-format
-    "FA",       # Future type annotations
-    "CPY",      # Copyright notice
-    "TRY003",   # Long Exception message
-    "SLF",      # Private (underscore) attribute access
-    "EM",       # Exception message not in seperate msg variable
+    "NPY002",   # numpy.random.Generator is preferred over numpy.random.seed
     "PLR09",    # Too many arguments
     "PLR2004",  # Use of value instead of constant variable
     "PLR6104",  # Forces in-place operations, for example, x += 1 instead of x = x + 1
     "PLR6301",  # Self not used in method
     "PLW2901",  # For loop variable is overwritten
+    "RET504",   # Unnecessary assignment before return
+    "S101",     # Use of assert
+    "TRY003",   # Long Exception message
+    "W505",     # Doc Line length
+    "CPY",      # Copyright notice
+    "DOC",      # Docstring missing exceptions/returns
+    "EM",       # Exception message not in seperate msg variable
+    "FA",       # Future type annotations
     "FBT",      # Boolean trap
     "FIX",      # Fixmes
-    "TD",       # TODOs
-    "ANN401",   # Any type annotation
     "PTH",      # Use Pathlib instead of os.path
-    "FURB140",  # Use itertools.starmap instead of list comprehension
-    "NPY002",   # numpy.random.Generator is preferred over numpy.random.seed
-    "RET504",   # Unnecessary assignment before return
-    "S101",     # Use of assert
+    "SLF",      # Private (underscore) attribute access
+    "TD",       # TODOs
 ]
 
 [tool.ruff.lint.per-file-ignores]
 "__init__.py" = ["F401"]
+"testing/*" = ["ANN", "N801", "N802", "N803", "N806"]
 "tests/*" = ["ANN", "C901", "N801", "N802", "N803", "N806"]
 "docs/*" = ["ANN"]
 "docs/source/conf.py" = ["A001", "ERA001", "INP001"]
 "docs/source/notebooks/*" = ["N801", "N802", "N803", "N806"]
 "src/cooper/optim/torch_optimizers/nupi_optimizer.py" = ["C901", "N801", "N802", "N803", "N806"]
-
-[tool.ruff.lint.pydocstyle]
-convention = "google"
-
-[tool.ruff.lint.pycodestyle]
-max-doc-length = 88
-
-[tool.coverage.run]
-relative_files = true
diff --git a/requirements.txt b/requirements.txt
index 7a155d77..9fb4985c 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1 +1 @@
---editable '.[dev, docs, tests, examples]'
+--editable '.[dev, docs, tests, notebooks]'
diff --git a/src/cooper/cmp.py b/src/cooper/cmp.py
index 68268652..e1b64741 100644
--- a/src/cooper/cmp.py
+++ b/src/cooper/cmp.py
@@ -43,10 +43,10 @@ class CMPState:
 
     Args:
         loss: Value of the loss or main objective to be minimized :math:`f(x)`
-        observed_constraints: Dictionary with :py:class:`~.Constraint` instances as keys
-            and :py:class:`~.ConstraintState` instances as values.
+        observed_constraints: Dictionary with :py:class:`~Constraint` instances as keys
+            and :py:class:`~ConstraintState` instances as values.
         misc: Optional storage space for additional information relevant to the state of
-            the CMP. This dict enables persisting the results of certain computations
+            the ``cmp``. This dict enables persisting the results of certain computations
             for post-processing. For example, one may want to retain the value of the
             predictions/logits computed over a given minibatch during the call to
             :py:meth:`~.ConstrainedMinimizationProblem.compute_cmp_state` to measure or
@@ -67,22 +67,19 @@ def _compute_primal_or_dual_lagrangian(self, primal_or_dual: Literal["primal", "
         check_contributes_fn = lambda cs: getattr(cs, f"contributes_to_{primal_or_dual}_update")
         contributing_constraints = {c: cs for c, cs in self.observed_constraints.items() if check_contributes_fn(cs)}
 
-        if len(contributing_constraints) == 0:
-            if self.loss is None:
-                return LagrangianStore()
+        if not contributing_constraints:
             # No observed constraints contribute to the Lagrangian.
-            lagrangian = self.loss.clone() if primal_or_dual == "primal" else None
+            lagrangian = self.loss.clone() if primal_or_dual == "primal" and self.loss is not None else None
             return LagrangianStore(lagrangian=lagrangian)
 
         lagrangian = self.loss.clone() if primal_or_dual == "primal" and self.loss is not None else 0.0
-
         multiplier_values = {}
         penalty_coefficient_values = {}
+
         for constraint, constraint_state in contributing_constraints.items():
             contribution_store = constraint.compute_contribution_to_lagrangian(constraint_state, primal_or_dual)
             if contribution_store is not None:
                 lagrangian = lagrangian + contribution_store.lagrangian_contribution
-
                 multiplier_values[constraint] = contribution_store.multiplier_value
                 if contribution_store.penalty_coefficient_value is not None:
                     penalty_coefficient_values[constraint] = contribution_store.penalty_coefficient_value
@@ -103,10 +100,10 @@ def compute_dual_lagrangian(self) -> LagrangianStore:
         """Computes and accumulates the dual-differentiable Lagrangian based on the
         contribution of the observed constraints.
 
-        Note: The dual Lagrangian contained in `LagrangianStore.lagrangian` ignores the
+        The dual Lagrangian contained in ``LagrangianStore.lagrangian`` ignores the
         contribution of the loss, since the objective function does not depend on the
-        dual variables. Therefore, `LagrangianStore.lagrangian == 0` regardless of
-        the value of `self.loss`.
+        dual variables. Therefore, ``LagrangianStore.lagrangian == 0`` regardless of
+        the value of ``self.loss``.
         """
         return self._compute_primal_or_dual_lagrangian(primal_or_dual="dual")
 
@@ -134,7 +131,7 @@ def __init__(self) -> None:
         self._constraints = OrderedDict()
 
     def _register_constraint(self, name: str, constraint: Constraint) -> None:
-        """Registers a constraint with the CMP.
+        """Registers a constraint with the ``cmp``.
 
         Args:
             name: Name of the constraint.
@@ -148,32 +145,32 @@ def _register_constraint(self, name: str, constraint: Constraint) -> None:
         self._constraints[name] = constraint
 
     def constraints(self) -> Iterator[Constraint]:
-        """Return an iterator over the registered constraints of the CMP."""
+        """Return an iterator over the registered constraints of the ``cmp``."""
         yield from self._constraints.values()
 
     def named_constraints(self) -> Iterator[tuple[str, Constraint]]:
-        """Return an iterator over the registered constraints of the CMP, yielding
-        tuples of the form `(constraint_name, constraint)`.
+        """Return an iterator over the registered constraints of the ``cmp``, yielding
+        tuples of the form ``(constraint_name, constraint)``.
         """
         yield from self._constraints.items()
 
     def multipliers(self) -> Iterator[Multiplier]:
         """Returns an iterator over the multipliers associated with the registered
-        constraints of the CMP.
+        constraints of the ``cmp``.
         """
         for constraint in self.constraints():
             yield constraint.multiplier
 
     def named_multipliers(self) -> Iterator[tuple[str, Multiplier]]:
         """Returns an iterator over the multipliers associated with the registered
-        constraints of the CMP, yielding tuples of the form `(constraint_name, multiplier)`.
+        constraints of the ``cmp``, yielding tuples of the form ``(constraint_name, multiplier)``.
         """
         for constraint_name, constraint in self.named_constraints():
             yield constraint_name, constraint.multiplier
 
     def penalty_coefficients(self) -> Iterator[PenaltyCoefficient]:
         """Returns an iterator over the penalty coefficients associated with the
-        registered constraints of the CMP. Constraints without penalty coefficients
+        registered constraints of the ``cmp``. Constraints without penalty coefficients
         are skipped.
         """
         for constraint in self.constraints():
@@ -182,8 +179,8 @@ def penalty_coefficients(self) -> Iterator[PenaltyCoefficient]:
 
     def named_penalty_coefficients(self) -> Iterator[tuple[str, PenaltyCoefficient]]:
         """Returns an iterator over the penalty coefficients associated with the
-        registered  constraints of the CMP, yielding tuples of the form
-        `(constraint_name, penalty_coefficient)`. Constraints without penalty
+        registered  constraints of the ``cmp``, yielding tuples of the form
+        ``(constraint_name, penalty_coefficient)``. Constraints without penalty
         coefficients are skipped.
         """
         for constraint_name, constraint in self.named_constraints():
@@ -192,15 +189,20 @@ def named_penalty_coefficients(self) -> Iterator[tuple[str, PenaltyCoefficient]]
 
     def dual_parameters(self) -> Iterator[torch.nn.Parameter]:
         """Return an iterator over the parameters of the multipliers associated with the
-        registered constraints of the CMP. This method is useful for instantiating the
+        registered constraints of the ``cmp``. This method is useful for instantiating the
         dual optimizers. If a multiplier is shared by several constraints, we only
         return its parameters once.
         """
         for multiplier in {constraint.multiplier for constraint in self.constraints()}:
             yield from multiplier.parameters()
 
+    def to_(self, *args: str, **kwargs: str) -> None:
+        # TODO: document, test
+        for constraint in self.constraints():
+            constraint.multiplier = constraint.multiplier.to(*args, **kwargs)
+
     def state_dict(self) -> dict:
-        """Returns the state of the CMP. This includes the state of the multipliers and penalty coefficients."""
+        """Returns the state of the ``cmp``. This includes the state of the multipliers and penalty coefficients."""
         state_dict = {
             "multipliers": {name: multiplier.state_dict() for name, multiplier in self.named_multipliers()},
             "penalty_coefficients": {name: pc.state_dict() for name, pc in self.named_penalty_coefficients()},
@@ -208,10 +210,10 @@ def state_dict(self) -> dict:
         return state_dict
 
     def load_state_dict(self, state_dict: dict) -> None:
-        """Loads the state of the CMP. This includes the state of the multipliers and penalty coefficients.
+        """Loads the state of the ``cmp``. This includes the state of the multipliers and penalty coefficients.
 
         Args:
-            state_dict: A state dictionary containing the state of the CMP.
+            state_dict: A state dictionary containing the state of the ``cmp``.
         """
         for name, multiplier_state_dict in state_dict["multipliers"].items():
             self._constraints[name].multiplier.load_state_dict(multiplier_state_dict)
@@ -248,37 +250,43 @@ def __repr__(self) -> str:
 
     @abc.abstractmethod
     def compute_cmp_state(self, *args: Any, **kwargs: Any) -> CMPState:
-        """Computes the state of the CMP based on the current value of the primal
+        """Computes the state of the ``cmp`` based on the current value of the primal
         parameters.
 
-        The signature of this abstract function may be changed to accommodate situations
+        The signature of this function may be changed to accommodate situations
         that require a model, (mini-batched) inputs/targets, or other arguments to be
         passed.
-
-        Structuring the CMP class around this method, enables the re-use of shared
-        sections of a computational graph. For example, consider a case where we want to
-        minimize a model's cross entropy loss subject to a constraint on the entropy of
-        its predictions. Both of these quantities depend on the predicted logits (on a
-        minibatch). This closure-centric design allows flexible problem specifications
-        while avoiding re-computation.
         """
 
-    def compute_violations(self, *args: Any, **kwargs: Any) -> CMPState:
-        """Computes the violation of (a subset of) the constraints of the CMP based on
-        the current value of the primal parameters. This function returns a
-        :py:class:`cooper.problem.CMPState` collecting the values of the observed
-        constraints. Note that the returned ``CMPState`` may have ``loss=None`` since,
-        by design, the value of the loss is not necessarily computed when evaluating
-        `only` the constraints.
+    def sanity_check_cmp_state(self, cmp_state: CMPState) -> None:
+        if cmp_state.loss is not None and cmp_state.loss.grad is None:
+            raise ValueError("The loss tensor must have a valid gradient.")
+
+        for constraint, constraint_state in cmp_state.observed_constraints.items():
+            if constraint_state.violation.grad is None:
+                raise ValueError(f"The violation tensor of constraint {constraint} must have a valid gradient.")
+            if constraint_state.strict_violation.grad is not None:
+                raise ValueError(
+                    f"The strict violation tensor of constraint {constraint} has a non-null gradient: "
+                    f"{constraint_state.strict_violation.grad}."
+                )
 
-        The signature of this "abstract" function may be changed to accommodate
-        situations that require a model, (mini-batched) inputs/targets, or other
-        arguments to be passed.
+    def compute_violations(self, *args: Any, **kwargs: Any) -> CMPState:
+        """Computes the violation of the constraints of the ``cmp`` based on the current
+        value of the primal parameters. This function returns a
+        :py:class:`~.CMPState` collecting the values of the observed
+        constraints. Note that the returned :py:class:`~.CMPState` may have
+        ``loss=None`` since, by design, the value of the loss is not necessarily
+        computed when evaluating `only` the constraints.
+
+        The signature of this function may be changed to accommodate situations
+        that require a model, (mini-batched) inputs/targets, or other arguments to be
+        passed.
 
         Depending on the problem at hand, the computation of the constraints can be
         compartimentalized in a way that is independent of the evaluation of the loss.
-        Alternatively, :py:meth:`~.ConstrainedMinimizationProblem.compute_violations`
-        may be called during the execution of the
+        In such cases, :py:meth:`~.ConstrainedMinimizationProblem.compute_violations` may
+        be called during the execution of the
         :py:meth:`~.ConstrainedMinimizationProblem.compute_cmp_state` method.
         """
         raise NotImplementedError
diff --git a/src/cooper/constraints/constraint.py b/src/cooper/constraints/constraint.py
index 1694cb47..53534b42 100644
--- a/src/cooper/constraints/constraint.py
+++ b/src/cooper/constraints/constraint.py
@@ -7,9 +7,19 @@
 
 
 class Constraint:
-    """Constraint."""
-
-    # TODO(gallego-posada): Add documentation
+    """This class is used to define a constraint in the optimization problem.
+
+    Args:
+        constraint_type: One of :py:class:`cooper.ConstraintType.EQUALITY` or
+            :py:class:`cooper.ConstraintType.INEQUALITY`.
+        multiplier: The Lagrange multiplier associated with the constraint.
+        formulation_type: The type of formulation for the constrained optimization
+            problem. Must be a subclass of :py:class:`~cooper.formulations.Formulation`.
+            The default is :py:class:`~cooper.formulations.LagrangianFormulation`.
+        penalty_coefficient: The penalty coefficient used to penalize the constraint
+            violation. This is only used for some formulations, such as the
+            :py:class:`~cooper.formulations.AugmentedLagrangianFormulation`.
+    """
 
     def __init__(
         self,
diff --git a/src/cooper/constraints/constraint_state.py b/src/cooper/constraints/constraint_state.py
index fa277247..45aacdbd 100644
--- a/src/cooper/constraints/constraint_state.py
+++ b/src/cooper/constraints/constraint_state.py
@@ -6,7 +6,7 @@
 
 @dataclass
 class ConstraintState:
-    """State of a constraint describing the current constraint violation.
+    r"""State of a constraint, including the current constraint violation.
 
     Args:
         violation: Measurement of the constraint violation at some value of the primal
@@ -14,28 +14,29 @@ class ConstraintState:
             primal parameters.
         constraint_features: The features of the (differentiable) constraint. This is
             used to evaluate the Lagrange multiplier associated with a constraint.
-            For example, an `IndexedMultiplier` expects the indices of the constraints
-            whose Lagrange multipliers are to be retrieved; while an
-            `ImplicitMultiplier` expects general tensor-valued features for the
-            constraints. This field is not used for `DenseMultiplier`//s.
-            This can be used in conjunction with an `IndexedMultiplier` to indicate the
-            measurement of the violation for only a subset of the constraints within a
-            `Constraint`.
+            For example, an :py:class:`~cooper.multipliers.IndexedMultiplier` expects the
+            indices of the constraints whose Lagrange multipliers are to be retrieved;
+            while an :py:class:`~cooper.multipliers.ImplicitMultiplier` expects general
+            tensor-valued features for the constraints. This can be used in
+            conjunction with an :py:class:`~cooper.multipliers.IndexedMultiplier` to
+            indicate the measurement of the violation for only a subset of the
+            constraints within a :py:class:`~cooper.constraints.Constraint`.
+            This field is ignored for :py:class:`~cooper.multipliers.DenseMultiplier`\\s.
         strict_violation: Measurement of the constraint violation which may be
             non-differentiable with respect to the primal parameters. When provided,
-            the (necessarily differentiable) `violation` is used to compute the gradient
+            the (necessarily differentiable) ``violation`` is used to compute the gradient
             of the Lagrangian with respect to the primal parameters, while the
-            `strict_violation` is used to compute the gradient of the Lagrangian with
+            ``strict_violation`` is used to compute the gradient of the Lagrangian with
             respect to the dual parameters. For more details, see the proxy-constraint
             proposal of :cite:t:`cotter2019JMLR`.
         strict_constraint_features: The features of the (possibly non-differentiable)
-            constraint. For more details, see `constraint_features`.
-        contributes_to_primal_update: When `False`, we ignore the contribution of the
-            current observed constraint violation towards the primal Lagrangian, but
-            keep their contribution to the dual Lagrangian. In other words, the observed
-            violations affect the update for the dual variables but not the update for
-            the primal variables.
-        contributes_to_dual_update: When `False`, we ignore the contribution of the
+            constraint. For more details, see ``constraint_features``.
+        contributes_to_primal_update: When ``False``, we ignore the contribution of the
+            current observed constraint violation towards the **primal** Lagrangian, but
+            keep their contribution to the **dual** Lagrangian. In other words, the
+            observed violations affect the update for the dual variables but not the
+            update for the primal variables.
+        contributes_to_dual_update: When ``False``, we ignore the contribution of the
             current observed constraint violation towards the dual Lagrangian, but keep
             their contribution to the primal Lagrangian. In other words, the observed
             violations affect the update for the primal variables but not the update
@@ -52,13 +53,13 @@ class ConstraintState:
 
     def __post_init__(self) -> None:
         if self.strict_constraint_features is not None and self.strict_violation is None:
-            raise ValueError("strict_violation must be provided if strict_constraint_features is provided.")
+            raise ValueError("`strict_violation` must be provided if `strict_constraint_features` is provided.")
 
     def extract_violations(self, do_unsqueeze: bool = True) -> tuple[torch.Tensor, torch.Tensor]:
-        """Extracts the violation and strict violation from the constraint state. If
-        strict violations are not provided, patches them with the violation.
-        This function also unsqueeze the violation tensors to ensure thay have at least
-        1-dimension.
+        """Extracts the violation and strict violation from the constraint state as a
+        tuple. If strict violations are not provided, this function returns
+        `violation, violation`. If `do_unsqueeze` is set to `True`, this function also
+        unsqueezes the violation tensors to ensure thay have at least 1-dimension.
         """
         violation = self.violation
 
@@ -66,7 +67,8 @@ def extract_violations(self, do_unsqueeze: bool = True) -> tuple[torch.Tensor, t
 
         if do_unsqueeze:
             # If the violation is a scalar, we unsqueeze it to ensure that it has at
-            # least one dimension for using einsum.
+            # least one dimension. This is important since we use einsum to compute the
+            # contribution of the constraint to the Lagrangian.
             if violation.dim() == 0:
                 violation = violation.unsqueeze(0)
             if strict_violation.dim() == 0:
@@ -75,11 +77,9 @@ def extract_violations(self, do_unsqueeze: bool = True) -> tuple[torch.Tensor, t
         return violation, strict_violation
 
     def extract_constraint_features(self) -> tuple[torch.Tensor, torch.Tensor]:
-        """Extracts the constraint features from the constraint state.
-        If strict constraint features are not provided, attempts to patch them with the
-        differentiable constraint features. Similarly, if differentiable constraint
-        features are not provided, attempts to patch them with the strict constraint
-        features.
+        """Extracts the constraint features from the constraint state as a tuple.
+        If strict constraint features are not provided, this function returns
+        `constraint_features, constraint_features`.
         """
         constraint_features = self.constraint_features
 
diff --git a/src/cooper/formulations/formulations.py b/src/cooper/formulations/formulations.py
index 03067276..70459ed2 100644
--- a/src/cooper/formulations/formulations.py
+++ b/src/cooper/formulations/formulations.py
@@ -19,6 +19,8 @@ class Formulation(abc.ABC):
     """Formulations prescribe how the different constraints contribute to the primal- and
     dual-differentiable Lagrangians. In other words, they define how the constraints
     affect the gradients of the Lagrangian with respect to the primal and dual variables.
+
+    TODO: expects_penalty_coefficient
     """
 
     expects_penalty_coefficient: bool
@@ -69,6 +71,7 @@ def compute_contribution_to_dual_lagrangian(
         multiplier: Multiplier,
         penalty_coefficient: Optional[PenaltyCoefficient],
     ) -> Optional[ContributionStore]:
+        """TODO(juan43ramirez): Add docstring."""
         if not constraint_state.contributes_to_dual_update:
             return None
 
@@ -86,7 +89,8 @@ def compute_contribution_to_dual_lagrangian(
 
     @abc.abstractmethod
     def compute_contribution_to_primal_lagrangian(self, *args: Any, **kwargs: Any) -> Optional[ContributionStore]:
-        pass
+        """TODO(juan43ramirez): Add docstring."""
+        return NotImplemented
 
 
 class LagrangianFormulation(Formulation):
@@ -118,7 +122,8 @@ class AugmentedLagrangianFormulation(Formulation):
     """Implements the Augmented Lagrangian formulation.
 
     .. warning::
-        The dual optimizers must all be SGD with a ``lr=1.0`` and ``maximize=True``.
+        The dual optimizers must all be SGD with ``lr=1.0`` and ``maximize=True`` to
+        replicate the updates of the Augmented Lagrangian *method*.
     """
 
     expects_penalty_coefficient = True
diff --git a/testing/__init__.py b/testing/__init__.py
new file mode 100644
index 00000000..3f91c369
--- /dev/null
+++ b/testing/__init__.py
@@ -0,0 +1,8 @@
+from .cooper_helpers import (
+    AlternationType,
+    SquaredNormLinearCMP,
+    build_cooper_optimizer,
+    build_dual_optimizers,
+    build_primal_optimizers,
+)
+from .utils import frozen_rand_generator, validate_state_dicts
diff --git a/tests/helpers/cooper_test_utils.py b/testing/cooper_helpers.py
similarity index 100%
rename from tests/helpers/cooper_test_utils.py
rename to testing/cooper_helpers.py
diff --git a/tests/helpers/testing_utils.py b/testing/utils.py
similarity index 100%
rename from tests/helpers/testing_utils.py
rename to testing/utils.py
diff --git a/tests/constraints/test_constraint_state.py b/tests/constraints/test_constraint_state.py
index 21c26ea4..03f7956d 100644
--- a/tests/constraints/test_constraint_state.py
+++ b/tests/constraints/test_constraint_state.py
@@ -4,7 +4,7 @@
 import torch
 
 import cooper
-from tests.helpers import testing_utils
+import testing
 
 
 @pytest.fixture(params=[1, 10])
@@ -14,7 +14,7 @@ def num_constraints(request):
 
 @pytest.fixture
 def violation(num_constraints):
-    violation = torch.randn(num_constraints, generator=testing_utils.frozen_rand_generator(0))
+    violation = torch.randn(num_constraints, generator=testing.frozen_rand_generator(0))
     if num_constraints == 1:
         violation.squeeze_()
     return violation
@@ -22,7 +22,7 @@ def violation(num_constraints):
 
 @pytest.fixture
 def strict_violation(num_constraints):
-    strict_violation = torch.randn(num_constraints, generator=testing_utils.frozen_rand_generator(1))
+    strict_violation = torch.randn(num_constraints, generator=testing.frozen_rand_generator(1))
     if num_constraints == 1:
         strict_violation.squeeze_()
     return strict_violation
@@ -30,12 +30,12 @@ def strict_violation(num_constraints):
 
 @pytest.fixture
 def constraint_features(num_constraints):
-    return torch.randperm(num_constraints, generator=testing_utils.frozen_rand_generator(2))
+    return torch.randperm(num_constraints, generator=testing.frozen_rand_generator(2))
 
 
 @pytest.fixture
 def strict_constraint_features(num_constraints):
-    return torch.randperm(num_constraints, generator=testing_utils.frozen_rand_generator(3))
+    return torch.randperm(num_constraints, generator=testing.frozen_rand_generator(3))
 
 
 @pytest.fixture(params=[True, False])
@@ -87,7 +87,7 @@ def test_constraint_state_initialization(
 
 def test_constraint_state_initialization_failure(violation, strict_constraint_features):
     with pytest.raises(
-        ValueError, match="strict_violation must be provided if strict_constraint_features is provided."
+        ValueError, match="`strict_violation` must be provided if `strict_constraint_features` is provided."
     ):
         cooper.ConstraintState(violation=violation, strict_constraint_features=strict_constraint_features)
 
diff --git a/tests/helpers/__init__.py b/tests/helpers/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/tests/multipliers/conftest.py b/tests/multipliers/conftest.py
index 69f7252b..62032d74 100644
--- a/tests/multipliers/conftest.py
+++ b/tests/multipliers/conftest.py
@@ -2,7 +2,7 @@
 import torch
 
 import cooper
-from tests.helpers import testing_utils
+import testing
 
 
 @pytest.fixture
@@ -27,7 +27,7 @@ def multiplier_class(request):
 
 @pytest.fixture
 def init_multiplier_tensor(constraint_type, num_constraints, random_seed):
-    generator = testing_utils.frozen_rand_generator(random_seed)
+    generator = testing.frozen_rand_generator(random_seed)
     raw_init = torch.randn(num_constraints, generator=generator)
     if constraint_type == cooper.ConstraintType.INEQUALITY:
         return raw_init.relu()
diff --git a/tests/multipliers/test_explicit_multipliers.py b/tests/multipliers/test_explicit_multipliers.py
index abe2b192..7819d672 100644
--- a/tests/multipliers/test_explicit_multipliers.py
+++ b/tests/multipliers/test_explicit_multipliers.py
@@ -5,7 +5,7 @@
 import torch
 
 import cooper
-from tests.helpers import testing_utils
+import testing
 
 
 def evaluate_multiplier(multiplier, all_indices):
@@ -126,7 +126,7 @@ def test_ineq_post_step_(constraint_type, multiplier_class, init_multiplier_tens
 
 
 def check_save_load_state_dict(multiplier, explicit_multiplier_class, num_constraints, random_seed):
-    generator = testing_utils.frozen_rand_generator(random_seed)
+    generator = testing.frozen_rand_generator(random_seed)
 
     multiplier_init = torch.randn(num_constraints, generator=generator)
     new_multiplier = explicit_multiplier_class(init=multiplier_init)
@@ -134,7 +134,7 @@ def check_save_load_state_dict(multiplier, explicit_multiplier_class, num_constr
     # Save to file to force reading from file so we can ensure correct loading
     with tempfile.TemporaryDirectory() as tmpdirname:
         torch.save(multiplier.state_dict(), os.path.join(tmpdirname, "multiplier.pt"))
-        state_dict = torch.load(os.path.join(tmpdirname, "multiplier.pt"))
+        state_dict = torch.load(os.path.join(tmpdirname, "multiplier.pt"), weights_only=True)
 
     new_multiplier.load_state_dict(state_dict)
 
diff --git a/tests/multipliers/test_penalty_coefficients.py b/tests/multipliers/test_penalty_coefficients.py
index 20d17057..e11c86c2 100644
--- a/tests/multipliers/test_penalty_coefficients.py
+++ b/tests/multipliers/test_penalty_coefficients.py
@@ -1,8 +1,8 @@
 import pytest
 import torch
 
+import testing
 from cooper import multipliers
-from tests.helpers import testing_utils
 
 
 @pytest.fixture(params=[1, 100])
@@ -12,7 +12,7 @@ def num_constraints(request):
 
 @pytest.fixture
 def init_tensor(num_constraints):
-    generator = testing_utils.frozen_rand_generator()
+    generator = testing.frozen_rand_generator()
     return torch.rand(num_constraints, generator=generator)
 
 
diff --git a/tests/pipeline/conftest.py b/tests/pipeline/conftest.py
index 9738dfc4..0a61a268 100644
--- a/tests/pipeline/conftest.py
+++ b/tests/pipeline/conftest.py
@@ -4,8 +4,7 @@
 import torch
 
 import cooper
-from cooper.multipliers import MultiplicativePenaltyCoefficientUpdater
-from tests.helpers import cooper_test_utils
+import testing
 
 PRIMAL_LR = 3e-2
 DUAL_LR = 2e-1
@@ -75,13 +74,13 @@ def extrapolation(request, formulation_type):
 
 @pytest.fixture(
     params=[
-        cooper_test_utils.AlternationType.FALSE,
-        cooper_test_utils.AlternationType.PRIMAL_DUAL,
-        cooper_test_utils.AlternationType.DUAL_PRIMAL,
+        testing.AlternationType.FALSE,
+        testing.AlternationType.PRIMAL_DUAL,
+        testing.AlternationType.DUAL_PRIMAL,
     ]
 )
 def alternation_type(request, extrapolation, formulation_type):
-    is_alternation = request.param != cooper_test_utils.AlternationType.FALSE
+    is_alternation = request.param != testing.AlternationType.FALSE
 
     if extrapolation and is_alternation:
         pytest.skip("Extrapolation is only supported for simultaneous updates.")
@@ -92,7 +91,7 @@ def alternation_type(request, extrapolation, formulation_type):
 
 @pytest.fixture
 def unconstrained_cmp(device, num_variables):
-    cmp = cooper_test_utils.SquaredNormLinearCMP(num_variables=num_variables, device=device)
+    cmp = testing.SquaredNormLinearCMP(num_variables=num_variables, device=device)
     return cmp
 
 
@@ -151,18 +150,16 @@ def cmp(
     cmp_kwargs[f"{prefix}_formulation_type"] = formulation_type
     cmp_kwargs[f"{prefix}_penalty_coefficient_type"] = penalty_coefficient_type
 
-    cmp = cooper_test_utils.SquaredNormLinearCMP(**cmp_kwargs)
+    cmp = testing.SquaredNormLinearCMP(**cmp_kwargs)
     return cmp
 
 
 @pytest.fixture
 def cooper_optimizer_no_constraint(unconstrained_cmp, params):
-    primal_optimizers = cooper_test_utils.build_primal_optimizers(
+    primal_optimizers = testing.build_primal_optimizers(
         params, primal_optimizer_kwargs=[{"lr": PRIMAL_LR} for _ in range(len(params))]
     )
-    cooper_optimizer = cooper_test_utils.build_cooper_optimizer(
-        cmp=unconstrained_cmp, primal_optimizers=primal_optimizers
-    )
+    cooper_optimizer = testing.build_cooper_optimizer(cmp=unconstrained_cmp, primal_optimizers=primal_optimizers)
     return cooper_optimizer
 
 
@@ -173,11 +170,11 @@ def cooper_optimizer(
     primal_optimizer_kwargs = [{"lr": PRIMAL_LR}]
     if use_multiple_primal_optimizers:
         primal_optimizer_kwargs.append({"lr": 10 * PRIMAL_LR, "betas": (0.0, 0.0), "eps": 10.0})
-    primal_optimizers = cooper_test_utils.build_primal_optimizers(
+    primal_optimizers = testing.build_primal_optimizers(
         params, extrapolation, primal_optimizer_kwargs=primal_optimizer_kwargs
     )
 
-    cooper_optimizer = cooper_test_utils.build_cooper_optimizer(
+    cooper_optimizer = testing.build_cooper_optimizer(
         cmp=cmp,
         primal_optimizers=primal_optimizers,
         extrapolation=extrapolation,
@@ -193,7 +190,7 @@ def cooper_optimizer(
 def penalty_updater(formulation_type):
     if formulation_type != cooper.AugmentedLagrangianFormulation:
         return None
-    penalty_updater = MultiplicativePenaltyCoefficientUpdater(
+    penalty_updater = cooper.multipliers.MultiplicativePenaltyCoefficientUpdater(
         growth_factor=PENALTY_GROWTH_FACTOR, violation_tolerance=PENALTY_VIOLATION_TOLERANCE
     )
     return penalty_updater
diff --git a/tests/pipeline/test_checkpoint.py b/tests/pipeline/test_checkpoint.py
index 3058d9b7..8ef98141 100644
--- a/tests/pipeline/test_checkpoint.py
+++ b/tests/pipeline/test_checkpoint.py
@@ -7,7 +7,7 @@
 import torch
 
 import cooper
-from tests.helpers import cooper_test_utils, testing_utils
+import testing
 
 DUAL_LR = 1e-2
 
@@ -33,7 +33,7 @@ def construct_cmp(multiplier_type, num_constraints, num_variables, device):
     A = torch.randn(num_constraints, num_variables, device=device, generator=generator)
     b = torch.randn(num_constraints, device=device, generator=generator)
 
-    return cooper_test_utils.SquaredNormLinearCMP(
+    return testing.SquaredNormLinearCMP(
         num_variables=num_variables,
         has_ineq_constraint=True,
         ineq_multiplier_type=multiplier_type,
@@ -53,8 +53,8 @@ def test_checkpoint(multiplier_type, use_multiple_primal_optimizers, num_constra
 
     cmp = construct_cmp(multiplier_type, num_constraints, num_variables, device)
 
-    primal_optimizers = cooper_test_utils.build_primal_optimizers(list(model.parameters()))
-    cooper_optimizer = cooper_test_utils.build_cooper_optimizer(
+    primal_optimizers = testing.build_primal_optimizers(list(model.parameters()))
+    cooper_optimizer = testing.build_cooper_optimizer(
         cmp=cmp, primal_optimizers=primal_optimizers, dual_optimizer_kwargs={"lr": DUAL_LR}
     )
     cooper_optimizer_class = type(cooper_optimizer)
@@ -77,9 +77,9 @@ def test_checkpoint(multiplier_type, use_multiple_primal_optimizers, num_constra
         del cooper_optimizer_state_dict_100
         del cmp_state_dict_100
 
-        model_state_dict_100 = torch.load(os.path.join(tmpdirname, "model.pt"))
-        cooper_optimizer_state_dict_100 = torch.load(os.path.join(tmpdirname, "cooper_optimizer.pt"))
-        cmp_state_dict_100 = torch.load(os.path.join(tmpdirname, "cmp.pt"))
+        model_state_dict_100 = torch.load(os.path.join(tmpdirname, "model.pt"), weights_only=True)
+        cooper_optimizer_state_dict_100 = torch.load(os.path.join(tmpdirname, "cooper_optimizer.pt"), weights_only=True)
+        cmp_state_dict_100 = torch.load(os.path.join(tmpdirname, "cmp.pt"), weights_only=True)
 
     # ------------ Train for *another* 100 steps ------------
     for _ in range(100):
@@ -100,10 +100,10 @@ def test_checkpoint(multiplier_type, use_multiple_primal_optimizers, num_constra
     loaded_model.load_state_dict(model_state_dict_100)
     loaded_model.to(device=device)
 
-    loaded_primal_optimizers = cooper_test_utils.build_primal_optimizers(list(loaded_model.parameters()))
+    loaded_primal_optimizers = testing.build_primal_optimizers(list(loaded_model.parameters()))
     loaded_dual_optimizers = None
     if any(new_cmp.constraints()):
-        loaded_dual_optimizers = cooper_test_utils.build_dual_optimizers(
+        loaded_dual_optimizers = testing.build_dual_optimizers(
             dual_parameters=new_cmp.dual_parameters(), dual_optimizer_kwargs={"lr": DUAL_LR}
         )
 
@@ -120,6 +120,6 @@ def test_checkpoint(multiplier_type, use_multiple_primal_optimizers, num_constra
 
     # ------------ Compare checkpoint and loaded-then-trained objects ------------
     # Compare 0-200 state_dicts versus the 0-100;100-200 state_dicts
-    assert testing_utils.validate_state_dicts(loaded_model.state_dict(), model_state_dict_200)
-    assert testing_utils.validate_state_dicts(loaded_cooper_optimizer.state_dict(), cooper_optimizer_state_dict_200)
-    assert testing_utils.validate_state_dicts(new_cmp.state_dict(), cmp_state_dict_200)
+    assert testing.validate_state_dicts(loaded_model.state_dict(), model_state_dict_200)
+    assert testing.validate_state_dicts(loaded_cooper_optimizer.state_dict(), cooper_optimizer_state_dict_200)
+    assert testing.validate_state_dicts(new_cmp.state_dict(), cmp_state_dict_200)
diff --git a/tests/pipeline/test_convergence.py b/tests/pipeline/test_convergence.py
index 8cdc0650..d19e2526 100644
--- a/tests/pipeline/test_convergence.py
+++ b/tests/pipeline/test_convergence.py
@@ -1,6 +1,6 @@
 import torch
 
-from tests.helpers import cooper_test_utils
+import testing
 
 
 def test_convergence_no_constraint(unconstrained_cmp, params, cooper_optimizer_no_constraint):
@@ -21,7 +21,7 @@ def test_convergence_with_constraint(
 
     for _ in range(2000):
         roll_kwargs = {"compute_cmp_state_kwargs": {"x": torch.cat(params)}}
-        if alternation_type == cooper_test_utils.AlternationType.PRIMAL_DUAL:
+        if alternation_type == testing.AlternationType.PRIMAL_DUAL:
             roll_kwargs["compute_violations_kwargs"] = {"x": torch.cat(params)}
 
         roll_out = cooper_optimizer.roll(**roll_kwargs)
diff --git a/tests/pipeline/test_manual.py b/tests/pipeline/test_manual.py
index a1c115a5..3c42fa0d 100644
--- a/tests/pipeline/test_manual.py
+++ b/tests/pipeline/test_manual.py
@@ -5,8 +5,7 @@
 import torch
 
 import cooper
-from cooper.multipliers import MultiplicativePenaltyCoefficientUpdater
-from tests.helpers import cooper_test_utils
+import testing
 
 PRIMAL_LR = 3e-2
 DUAL_LR = 2e-1
@@ -51,14 +50,14 @@ def test_manual_step(self, extrapolation, alternation_type):
         The manual implementation assumes Stochastic Gradient Descent (SGD) is used for both the primal
         and dual optimizers.
         """
-        if alternation_type == cooper_test_utils.AlternationType.PRIMAL_DUAL and self.is_indexed_multiplier:
+        if alternation_type == testing.AlternationType.PRIMAL_DUAL and self.is_indexed_multiplier:
             pytest.skip("Cannot test IndexedMultiplier with PRIMAL_DUAL alternation.")
 
         x = torch.nn.Parameter(torch.ones(self.num_variables, device=self.device))
         optimizer_class = cooper.optim.ExtraSGD if extrapolation else torch.optim.SGD
         primal_optimizers = optimizer_class([x], lr=PRIMAL_LR)
 
-        cooper_optimizer = cooper_test_utils.build_cooper_optimizer(
+        cooper_optimizer = testing.build_cooper_optimizer(
             cmp=self.cmp,
             primal_optimizers=primal_optimizers,
             extrapolation=extrapolation,
@@ -70,12 +69,12 @@ def test_manual_step(self, extrapolation, alternation_type):
 
         penalty_updater = None
         if self.is_augmented_lagrangian:
-            penalty_updater = MultiplicativePenaltyCoefficientUpdater(
+            penalty_updater = cooper.multipliers.MultiplicativePenaltyCoefficientUpdater(
                 growth_factor=PENALTY_GROWTH_FACTOR, violation_tolerance=PENALTY_VIOLATION_TOLERANCE
             )
 
         roll_kwargs = {"compute_cmp_state_kwargs": {"x": x}}
-        if alternation_type == cooper_test_utils.AlternationType.PRIMAL_DUAL:
+        if alternation_type == testing.AlternationType.PRIMAL_DUAL:
             roll_kwargs["compute_violations_kwargs"] = {"x": x}
 
         manual_x = torch.ones(self.num_variables, device=self.device)
@@ -90,7 +89,7 @@ def test_manual_step(self, extrapolation, alternation_type):
             if self.is_augmented_lagrangian:
                 penalty_updater.step(roll_out.cmp_state.observed_constraints)
 
-            if alternation_type == cooper_test_utils.AlternationType.PRIMAL_DUAL:
+            if alternation_type == testing.AlternationType.PRIMAL_DUAL:
                 observed_multipliers = torch.cat(list(roll_out.dual_lagrangian_store.observed_multiplier_values()))
             else:
                 observed_multipliers = torch.cat(list(roll_out.primal_lagrangian_store.observed_multiplier_values()))
@@ -197,7 +196,7 @@ def _dual_lagrangian(self, x, multiplier, strict_features, penalty_coeff=None):
         return torch.sum(penalty_coeff[strict_features] * multiplier[strict_features] * violation)
 
     def _update_penalty_coefficients(self, x, x_prev, strict_features, alternation_type, penalty_coeff):
-        if alternation_type == cooper_test_utils.AlternationType.PRIMAL_DUAL:
+        if alternation_type == testing.AlternationType.PRIMAL_DUAL:
             strict_violation = self._violation(x, strict=True)[strict_features]
         else:
             strict_violation = self._violation(x_prev, strict=True)[strict_features]
@@ -263,10 +262,10 @@ def _extragradient_roll(self, x, multiplier, features, strict_features):
     def manual_roll(self, x, multiplier, features, strict_features, alternation_type, penalty_coeff, extrapolation):
         if extrapolation:
             return self._extragradient_roll(x, multiplier, features, strict_features)
-        if alternation_type == cooper_test_utils.AlternationType.FALSE:
+        if alternation_type == testing.AlternationType.FALSE:
             return self._simultaneous_roll(x, multiplier, features, strict_features)
-        if alternation_type == cooper_test_utils.AlternationType.DUAL_PRIMAL:
+        if alternation_type == testing.AlternationType.DUAL_PRIMAL:
             return self._dual_primal_roll(x, multiplier, features, strict_features, penalty_coeff)
-        if alternation_type == cooper_test_utils.AlternationType.PRIMAL_DUAL:
+        if alternation_type == testing.AlternationType.PRIMAL_DUAL:
             return self._primal_dual_roll(x, multiplier, features, strict_features, penalty_coeff)
         raise ValueError(f"Unknown alternation type: {alternation_type}")
diff --git a/tests/setup.cfg b/tests/setup.cfg
deleted file mode 100644
index 1f826159..00000000
--- a/tests/setup.cfg
+++ /dev/null
@@ -1,2 +0,0 @@
-[tool:pytest]
-norecursedirs=tests/helpers
diff --git a/tests/test_cmp.py b/tests/test_cmp.py
index 6f43bc72..906232b8 100644
--- a/tests/test_cmp.py
+++ b/tests/test_cmp.py
@@ -76,7 +76,10 @@ def test_lagrangian_store_backward_none():
 
 def test_lagrangian_store_observed_multiplier_values(eq_constraint, ineq_constraint):
     lagrangian_store = cooper.LagrangianStore(
-        multiplier_values={eq_constraint: torch.tensor(1.0), ineq_constraint: torch.tensor(2.0)}
+        multiplier_values={
+            eq_constraint: torch.tensor(1.0),
+            ineq_constraint: torch.tensor(2.0),
+        }
     )
     observed_values = list(lagrangian_store.observed_multiplier_values())
     assert observed_values == [torch.tensor(1.0), torch.tensor(2.0)]
@@ -84,7 +87,10 @@ def test_lagrangian_store_observed_multiplier_values(eq_constraint, ineq_constra
 
 def test_lagrangian_store_observed_penalty_coefficient_values(eq_constraint, ineq_constraint):
     lagrangian_store = cooper.LagrangianStore(
-        penalty_coefficient_values={eq_constraint: torch.tensor(3.0), ineq_constraint: torch.tensor(4.0)}
+        penalty_coefficient_values={
+            eq_constraint: torch.tensor(3.0),
+            ineq_constraint: torch.tensor(4.0),
+        }
     )
     observed_values = list(lagrangian_store.observed_penalty_coefficient_values())
     assert observed_values == [torch.tensor(3.0), torch.tensor(4.0)]
@@ -149,7 +155,8 @@ def test_observed_strict_violations(cmp_state, eq_constraint):
 
 def test_observed_constraint_features(cmp_state, eq_constraint):
     constraint_state = cooper.ConstraintState(
-        violation=torch.tensor(0.0), constraint_features=torch.tensor(0, dtype=torch.long)
+        violation=torch.tensor(0.0),
+        constraint_features=torch.tensor(0, dtype=torch.long),
     )
     cmp_state.observed_constraints[eq_constraint] = constraint_state
     constraint_features = list(cmp_state.observed_constraint_features())
@@ -227,7 +234,10 @@ def test_cmp_named_penalty_coefficients(cmp_instance, eq_constraint):
     cmp_instance._register_constraint("test_constraint", eq_constraint)
     named_penalty_coefficients = list(cmp_instance.named_penalty_coefficients())
     assert len(named_penalty_coefficients) == 1
-    assert named_penalty_coefficients[0] == ("test_constraint", eq_constraint.penalty_coefficient)
+    assert named_penalty_coefficients[0] == (
+        "test_constraint",
+        eq_constraint.penalty_coefficient,
+    )
 
 
 def test_cmp_dual_parameters(cmp_instance, eq_constraint):
@@ -275,3 +285,32 @@ def test_repr(cmp_instance, eq_constraint):
     repr_str = repr(cmp_instance)
     assert "test_constraint" in repr_str
     assert cmp_instance.__class__.__name__ in repr_str
+
+
+def test_sanity_check_cmp_state_loss(cmp_instance):
+    with pytest.raises(ValueError, match="The loss tensor must have a valid gradient."):
+        cmp_instance.sanity_check_cmp_state(cooper.CMPState(loss=torch.tensor(1.0)))
+
+
+def test_sanity_check_cmp_state_violation(cmp_instance, eq_constraint):
+    cmp_instance._register_constraint("test_constraint", eq_constraint)
+    with pytest.raises(ValueError, match="The violation tensor of constraint .*"):
+        cmp_instance.sanity_check_cmp_state(
+            cooper.CMPState(
+                observed_constraints={eq_constraint: cooper.ConstraintState(violation=torch.tensor(1.0))},
+            )
+        )
+
+
+def test_sanity_check_cmp_state_strict_violation(cmp_instance, eq_constraint):
+    cmp_instance._register_constraint("test_constraint", eq_constraint)
+    violation = torch.tensor(1.0, requires_grad=True)
+    violation.backward()
+    with pytest.raises(ValueError, match=".has a non-null gradient."):
+        cmp_instance.sanity_check_cmp_state(
+            cooper.CMPState(
+                observed_constraints={
+                    eq_constraint: cooper.ConstraintState(violation=violation, strict_violation=violation)
+                },
+            )
+        )
diff --git a/tox.ini b/tox.ini
deleted file mode 100644
index 8c5ecef7..00000000
--- a/tox.ini
+++ /dev/null
@@ -1,40 +0,0 @@
-[tox]
-minversion = 3.9.0
-envlist =  python{3.9, 3.10}-torch{13}-{linux,macos,windows}, python{3.9, 3.10, 3.11}-torch{20, 21}-{linux,macos,windows}, lint, mypy
-isolated_build = true
-
-[gh-actions]
-python =
-    3.9: python3.9
-    3.10: python3.10, lint, mypy
-    3.11: python3.11
-
-[gh-actions:env]
-PLATFORM =
-    ubuntu-latest: linux
-    macos-latest: macos
-    windows-latest: windows
-
-[testenv]
-setenv =
-    PYTHONPATH = {toxinidir}
-extras = tests
-whitelist_externals = pytest
-deps =
-    torch13: torch == 1.13.1
-    torch20: torch == 2.0.0
-    torch21: torch == 2.1.1
-commands =
-    pytest --basetemp={envtmpdir}
-
-[testenv:lint]
-basepython = python3.10
-extras = dev
-commands =
-    flake8 cooper --count --exit-zero --statistics
-    black --check --diff .
-    isort cooper tutorials tests
-
-[testenv:mypy]
-basepython = python3.10
-commands = mypy cooper