Fully Integrate SCDL into Geneformer #480

savitha-eng · 2024-11-26T08:34:09Z

Summary

In this PR we refactor the Geneformer SingleCellDataset class to integrate the SingleCellMemmapDataset(SCDL). The goal of this is to streamline and increase readability of the dataset class.

Details

We make the following changes:

Input Format:
- The SingleCellDataset now assumes that the input path to the data is a directory formatted in the SingleCellMemmap format.
- The SingleCellModule now assumes that the train, val, and test input paths are to directories that are formatted in the SingleCellMemmap format
Get Item:
- _get_item() now leverages the get_row function from SCDL (so we eliminate the need to store and parse information in metadata.json)
Error Handling for Genes not in the Tokenizer Vocabulary:
- We add an optional parameter to SingleCellDataset and SingleCellDataModule called bypass_tokenizer_vocab which is by default False. So by default, we throw an error if a gene ID is not in the tokenizer vocabulary. If a user wants to bypass this, they can change bypass_tokenizer_vocab to True.
Error Handling for Genes with Zero Expression Values:
- We throw an invalid input error in the cases that certain cells have no gene expression values (i.e. sc_dataset.scdl.get_item() returns [] for the gene data value)

Usage

The main change from a user perspective is to ensure that they convert their single cell h5ad files (or directories of h5ad files) to SingleCellMemmap format.

For a single h5ad file, i.e. data.h5ad, they can simply run the following, where output_path is the file path the SingleCellMemmap directory should be written to:
SingleCellMemMapDataset(output_path, data.h5ad)
For a directory of h5ad files, they can simply run the convert_h5ad_to_scdl script (more information available in the SCDL ReadMe).

Testing

We test that the updated SingleCellDataset produces the same output as the old dataset on synthetic samples and samples from the cellxsmall dataset. We also test for Megatron compatibility (as this dataset uses the MultiEpochDatasampler / Epoch Index) and for correct error handling of the above cases.
Tests for these changes can be run via:

pytest -vsub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_dataset.py

Note that we have also updated the following test files to use the MemMap dataset format + set bypass_tokenizer_vocab=True in them, because the cellxsmall dataset does have a few genes not in the HuggingFace tokenizer vocab and so the tests will error otherwise:
sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py
scripts/singlecell/geneformer/test_train.py
sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_stop_and_go.py

…tionaries instead of Pandas dataframes in the feature array

…ure index

…g a np memmap array to users

…ternal state

…IDIA/bionemo-fw-ea into savitha/scdl-performance-improvements

savitha-eng · 2024-11-26T08:35:01Z

/build-ci

…m:NVIDIA/bionemo-fw-ea into savitha/integrate-scdl-geneformer-rebased

savitha-eng · 2024-11-26T21:58:43Z

/build-ci

…eformer

savitha-eng · 2024-12-02T21:51:23Z

/build-ci

sub-packages/bionemo-core/src/bionemo/core/data/resources/scdl.yaml

sub-packages/bionemo-core/src/bionemo/core/data/resources/single_cell.yaml

…ested in review

savitha-eng · 2024-12-04T21:44:35Z

/build-ci

Signed-off-by: savitha-eng <savithas@nvidia.com>

savitha-eng · 2024-12-04T21:49:44Z

\build-ci

savitha-eng · 2024-12-05T00:07:56Z

/build-ci

savitha-eng · 2024-12-05T19:05:08Z

/build-ci

sub-packages/bionemo-scdl/tests/bionemo/scdl/conftest.py

sub-packages/bionemo-geneformer/tests/bionemo/geneformer/scripts/test_pydantic_train.py

sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py

sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/dataset.py

trvachov · 2024-12-05T20:31:28Z

Looks fine overall except there's one CLI arg where the understand the boolean logic can be better explained, I think.

Signed-off-by: polinabinder1 <pbinder@nvidia.com>

polinabinder1 · 2024-12-13T23:42:19Z

/build-ci

polinabinder1 · 2024-12-19T21:14:10Z

/build-ci

polinabinder1 · 2024-12-20T00:34:32Z

/build-ci

Savitha Srinivasan and others added 11 commits November 20, 2024 14:47

Change RowFeatureIndex and RowFeatureIndex tests to use a list of dic…

08ef851

…tionaries instead of Pandas dataframes in the feature array

Update load_h5ad to append features in dict format to to the row feat…

c9ff683

…ure index

Modify Single Cell Memmap Dataset unit tests to reflect changes

5f86da4

remove conversion to np.array in get_row for now

896bad0

Convert values and col indices to np array so that we're not returnin…

eb17845

…g a np memmap array to users

Revert conversion to np array, and refactor num_vars_at_row to use in…

1663903

…ternal state

Merge branch 'main' into savitha/scdl-performance-improvements

0497c98

Made changes requested in review.

7a43706

Merge branch 'savitha/scdl-performance-improvements' of github.com:NV…

da395b4

…IDIA/bionemo-fw-ea into savitha/scdl-performance-improvements

Integrate SCDL into Geneformer, rebased on the latest changes in main

9e11ab8

Merge branch 'main' into savitha/integrate-scdl-geneformer-rebased

37de5d1

Savitha Srinivasan added 4 commits November 26, 2024 00:38

Tests for Geneformer SingleCellDataset

546f84e

Merge branch 'savitha/integrate-scdl-geneformer-rebased' of github.co…

0dcc56b

…m:NVIDIA/bionemo-fw-ea into savitha/integrate-scdl-geneformer-rebased

Data directory fixtures needed for pytest

9d4c6a4

Add bypass_tokenize_vocab to the arguments for this script

eea6b42

Savitha Srinivasan and others added 4 commits December 2, 2024 13:39

Changes to Inference tutorial notebook to support SCDL integrated Gen…

e642bc9

…eformer

modify dataset dir creation

d755901

all scdl integration changes

507e31b

Merge branch 'main' into savitha/integrate-scdl-geneformer-rebased

f6d9380

savitha-eng marked this pull request as ready for review December 2, 2024 23:55

savitha-eng requested review from jstjohn, malcolmgreaves, skothenhill-nv, DejunL, dorotat-nv, farhadrgh and guoqing-zhou as code owners December 2, 2024 23:55

jstjohn reviewed Dec 3, 2024

View reviewed changes

sub-packages/bionemo-core/src/bionemo/core/data/resources/scdl.yaml Outdated Show resolved Hide resolved

jstjohn reviewed Dec 3, 2024

View reviewed changes

sub-packages/bionemo-core/src/bionemo/core/data/resources/single_cell.yaml Outdated Show resolved Hide resolved

Updated documentation, removed refs to sc_memmap, & made changes requ…

9846894

…ested in review

Merge branch 'main' into savitha/integrate-scdl-geneformer-rebased

6afed04

Signed-off-by: savitha-eng <savithas@nvidia.com>

Merge branch 'main' into savitha/integrate-scdl-geneformer-rebased

2e18cbb

trvachov requested changes Dec 5, 2024

View reviewed changes

polinabinder1 and others added 3 commits December 11, 2024 15:11

Merge branch 'main' into savitha/integrate-scdl-geneformer-rebased

dc093e2

Signed-off-by: polinabinder1 <pbinder@nvidia.com>

Make CLI argument for checking token vocab more understandable

13ec59f

merge main

1e9b58a

trvachov approved these changes Dec 13, 2024

View reviewed changes

polinabinder1 added 5 commits December 18, 2024 12:00

merge main

452dc6c

adding fixed length

7bdddb0

notebook updates

71bddc3

adding correct notebook

fdfc18c

Merge branch 'main' into savitha/integrate-scdl-geneformer-rebased

40e34cc

polinabinder1 added 2 commits December 19, 2024 16:09

Merge branch 'main' into savitha/integrate-scdl-geneformer-rebased

1cbbf05

test case fixes

696d42c

polinabinder1 approved these changes Dec 20, 2024

View reviewed changes

polinabinder1 enabled auto-merge (squash) December 20, 2024 00:34

jstjohn approved these changes Dec 20, 2024

View reviewed changes

polinabinder1 merged commit 30527b1 into main Dec 20, 2024
4 checks passed

polinabinder1 deleted the savitha/integrate-scdl-geneformer-rebased branch December 20, 2024 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully Integrate SCDL into Geneformer #480

Fully Integrate SCDL into Geneformer #480

savitha-eng commented Nov 26, 2024

savitha-eng commented Nov 26, 2024

savitha-eng commented Nov 26, 2024

savitha-eng commented Dec 2, 2024

savitha-eng commented Dec 4, 2024

savitha-eng commented Dec 4, 2024

savitha-eng commented Dec 5, 2024

savitha-eng commented Dec 5, 2024

trvachov commented Dec 5, 2024

polinabinder1 commented Dec 13, 2024

polinabinder1 commented Dec 19, 2024

polinabinder1 commented Dec 20, 2024

Fully Integrate SCDL into Geneformer #480

Fully Integrate SCDL into Geneformer #480

Conversation

savitha-eng commented Nov 26, 2024

Summary

Details

Usage

Testing

savitha-eng commented Nov 26, 2024

savitha-eng commented Nov 26, 2024

savitha-eng commented Dec 2, 2024

savitha-eng commented Dec 4, 2024

savitha-eng commented Dec 4, 2024

savitha-eng commented Dec 5, 2024

savitha-eng commented Dec 5, 2024

trvachov commented Dec 5, 2024

polinabinder1 commented Dec 13, 2024

polinabinder1 commented Dec 19, 2024

polinabinder1 commented Dec 20, 2024