Skip to content

Commit

Permalink
Task denoising method dca (#431)
Browse files Browse the repository at this point in the history
* Create alra.py

Add alra.py (includes existing bug).

* pre-commit

* import alra

* pre-commit

* set alra version

* split up alra oneliner

* debug

* fix syntax error

* pre-commit

* use dgCMatrix

* output is stored in obsm

* remove prints

* pre-commit

* Update alra.py

add to_csr() to fix coo matrix error

* fix csr casting

* Update alra.py

try adding custom exception to catch shape mismatch from ALRA

* pre-commit

* add ValueError

* pre-commit

* simplify ValueError to avoid errors

* pre-commit

* cast to array for MSE

Now getting an error in MSE--seems like this was already the case with earlier code, but attempting to fix regardless!

* pre-commit

* separate error line functions

Seems something about ALRA is failing tests. Separate out obsm call to get cleaner traceback

* Remove to_array()

* pre-commit

* try casting to a matrix one more time

* notate that wes' ALRA fork must be used instead

* pre-commit

* source from wes' code

* fix URL

* shorten line lengths

* Check output is ndarray

* Fix typo

* Return dense data

* don't need tocsr now that the data is dense

* Return directly to denoised

* code cleanup

* Revert debugging

* Don't edit adata.obsm['train']

* access train_norm

* Add warning about editing adata.obsm['train']

* pre-commit

* check train and test are not modified

* pre-commit

* Retry ALRA on failure

* pre-commit

* Switch t(as.matrix()) order

* Check dense data

* Return sparse data

* Check input data is sparse

* Fix typo

* pre-commit

* Don't send the full AnnData to R

* Expect sparse input, dense array output

* train and test must be floats

* Convert back to float

* Fail on final attempt

* put the retry inside python

* Remove the retry from R

* pre-commit

* layers['counts'] might not be sparse

* pre-commit

* Log error each time

* import logging

* pre-commit

* Better way to check matrices

* pre-commit

* fix array equal comparison

* add explicit comment

* More explicit toarray

* Can't check for untouched train/test

* Don't import scprep

* Just use a fixed target_sum

* Sample data should match API

* pre-commit

* flake8

* no_denoising still needs to densify

* convert to csc

* pre-commit

* Convert to csr

* conversion of sparse doesn't work, try anndata

* accept sce

* pre-commit

* Convert to dense

* pre-commit

* Convert to dense

* pre-commit

* Try `.tocsr()`

* Create dca.py

* pre-commit

* Create dca.py

* pre-commit

* add dca

* add dca

* Update dca.py

* Update dca.py

update import statement for DCA. Note that the main function, DCA(), might need to share the same name as the overall file (?), i.e. if it is DCA(), the file might need to be DCA.py

* pre-commit

* Update dca.py

* Update dca.py

* Delete dca.py

* Update requirements.txt

* Update __init__.py

* pre-commit

* Update dca.py

Try just importing dca

* pre-commit

* Update dca.py

* pre-commit

* put dca import inside method

* pre-commit

* Update dca.py

* Update requirements.txt

* pre-commit

* Create README.md

* Update README.md

* Create Dockerfile

* Create requirements.txt

* pre-commit

* Create requirements.txt

* pre-commit

* remove dca from python-extras readme

* fix image specification

* remove dca from here

* Update Dockerfile

* pin dca 0.3*

used ==, uncertain if = would've sufficed

* Update dca.py

* Update __init__.py

* Update requirements.txt

* Update README.md

* Update README.md

* Update README.md

* Update requirements.txt

* Update `check_version` api

* Require pyyaml==5.4.1 to prevent kopt error

Due to yaml/pyyaml#576

* pre-commit

* Fix keras version

* Update dca.py

Remove scprep normalization commands.
make adata2 object, which is adata made from just adata.obsm['train']

* pre-commit

* Update dca.py

* pre-commit

* Update dca.py

* pre-commit

* Update dca.py

* pre-commit

* Add test args

* fix thread count and pass epochs to dca

* pre-commit

* add in masking

* pre-commit

* Update README.md

* Update README.md

* add removezeros and insert_at functions

* pre-commit

* Update dca.py

* pre-commit

* Remove zero counts from train data

* Remove filtering from DCA

* Remove unused code

* pre-commit

* Don't need a line break

* Update utils.py

* pre-commit

* Use epochs if passed

* Fix metric descriptions

* don't compute coverage on non-test args

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Scott Gigante <scott.gigante@immunai.com>
Co-authored-by: Scott Gigante <scottgigante@users.noreply.github.com>
Co-authored-by: Scott Gigante <84813314+scottgigante-immunai@users.noreply.github.com>
  • Loading branch information
5 people committed Jul 13, 2022
1 parent aa73981 commit 903469a
Show file tree
Hide file tree
Showing 7 changed files with 78 additions and 3 deletions.
15 changes: 15 additions & 0 deletions docker/openproblems-python-tf2.4/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
FROM singlecellopenproblems/openproblems:latest

ARG NB_USER="sagemaker-user"
ARG NB_UID="1000"
ARG NB_GID="100"

USER root
WORKDIR /

# Install Python packages
COPY ./docker/openproblems-python-tf2.4/requirements.txt ./requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

USER $NB_UID
WORKDIR /home/$NB_USER
14 changes: 14 additions & 0 deletions docker/openproblems-python-tf2.4/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# openproblems-python-tf2.4 Docker image

Base image: singlecellopenproblems/openproblems

OS: Debian Stretch

Python: 3.8

Python packages:


* keras >=2.4,<2.6
* tensorflow >=2.4,<2.5
* dca
4 changes: 4 additions & 0 deletions docker/openproblems-python-tf2.4/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dca==0.3.*
keras>=2.4,<2.6 # pinned in dca
pyyaml==5.4.1 # pinned in #431
tensorflow==2.4.* # pinned in dca
6 changes: 3 additions & 3 deletions openproblems/tasks/denoising/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ A key challenge in evaluating denoising methods is the general lack of a ground

# The metrics

Metrics for data denoising aim to
Metrics for data denoising aim to assess denoising accuracy by comparing the denoised *training* set to the randomly sampled *test* set.

* **TODO**: TODO
* **TODO**: TODO
* **MSE**: The mean squared error between the denoised counts of the training dataset and the true counts of the test dataset after reweighting by the train/test ratio.
* **Poisson**: The Poisson log likelihood of observing the true counts of the test dataset given the distribution given in the denoised dataset.

## API

Expand Down
6 changes: 6 additions & 0 deletions openproblems/tasks/denoising/datasets/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,12 @@ def split_data(
X_train, X_test = molecular_cross_validation.util.split_molecules(
X, 0.9, 0.0, random_state
)
# remove zero entries
is_missing = X_train.sum(axis=0) == 0
X_train, X_test = X_train[:, ~is_missing], X_test[:, ~is_missing]

adata = adata[:, ~is_missing].copy()
adata.obsm["train"] = scipy.sparse.csr_matrix(X_train).astype(float)
adata.obsm["test"] = scipy.sparse.csr_matrix(X_test).astype(float)

return adata
1 change: 1 addition & 0 deletions openproblems/tasks/denoising/methods/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .alra import alra
from .dca import dca
from .magic import magic
from .magic import magic_approx
from .no_denoising import no_denoising
35 changes: 35 additions & 0 deletions openproblems/tasks/denoising/methods/dca.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from ....tools.decorators import method
from ....tools.utils import check_version

import scanpy as sc


def _dca(adata, test=False, epochs=None):
if test:
epochs = epochs or 30
else: # pragma: nocover
epochs = epochs or 300
from dca.api import dca

# make adata object with train counts
adata_train = sc.AnnData(adata.obsm["train"])
# run DCA
dca(adata_train, epochs=epochs)

# set denoised to Xmat
adata.obsm["denoised"] = adata_train.X
# check version of dca
adata.uns["method_code_version"] = check_version("dca")
return adata


@method(
method_name="DCA",
paper_name="Single-cell RNA-seq denoising using a deep count autoencoder",
paper_url="https://www.nature.com/articles/s41467-018-07931-2",
paper_year=2019,
code_url="https://github.com/theislab/dca",
image="openproblems-python-tf2.4",
)
def dca(adata, test=False, epochs=None):
return _dca(adata, test=test, epochs=epochs)

0 comments on commit 903469a

Please sign in to comment.