Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully Integrate SCDL into Geneformer #480

Merged
merged 32 commits into from
Dec 20, 2024

Conversation

savitha-eng
Copy link
Collaborator

Summary

In this PR we refactor the Geneformer SingleCellDataset class to integrate the SingleCellMemmapDataset(SCDL). The goal of this is to streamline and increase readability of the dataset class.

Details

We make the following changes:

  • Input Format:
    • The SingleCellDataset now assumes that the input path to the data is a directory formatted in the SingleCellMemmap format.
    • The SingleCellModule now assumes that the train, val, and test input paths are to directories that are formatted in the SingleCellMemmap format
  • Get Item:
    • _get_item() now leverages the get_row function from SCDL (so we eliminate the need to store and parse information in metadata.json)
  • Error Handling for Genes not in the Tokenizer Vocabulary:
    • We add an optional parameter to SingleCellDataset and SingleCellDataModule called bypass_tokenizer_vocab which is by default False. So by default, we throw an error if a gene ID is not in the tokenizer vocabulary. If a user wants to bypass this, they can change bypass_tokenizer_vocab to True.
  • Error Handling for Genes with Zero Expression Values:
    • We throw an invalid input error in the cases that certain cells have no gene expression values (i.e. sc_dataset.scdl.get_item() returns [] for the gene data value)

Usage

The main change from a user perspective is to ensure that they convert their single cell h5ad files (or directories of h5ad files) to SingleCellMemmap format.

  1. For a single h5ad file, i.e. data.h5ad, they can simply run the following, where output_path is the file path the SingleCellMemmap directory should be written to:
    SingleCellMemMapDataset(output_path, data.h5ad)
  2. For a directory of h5ad files, they can simply run the convert_h5ad_to_scdl script (more information available in the SCDL ReadMe).

Testing

We test that the updated SingleCellDataset produces the same output as the old dataset on synthetic samples and samples from the cellxsmall dataset. We also test for Megatron compatibility (as this dataset uses the MultiEpochDatasampler / Epoch Index) and for correct error handling of the above cases.
Tests for these changes can be run via:

pytest -vsub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_dataset.py

Note that we have also updated the following test files to use the MemMap dataset format + set bypass_tokenizer_vocab=True in them, because the cellxsmall dataset does have a few genes not in the HuggingFace tokenizer vocab and so the tests will error otherwise:
sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py
scripts/singlecell/geneformer/test_train.py
sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_stop_and_go.py

@savitha-eng
Copy link
Collaborator Author

/build-ci

@savitha-eng
Copy link
Collaborator Author

/build-ci

@savitha-eng
Copy link
Collaborator Author

/build-ci

@savitha-eng
Copy link
Collaborator Author

/build-ci

Signed-off-by: savitha-eng <savithas@nvidia.com>
@savitha-eng
Copy link
Collaborator Author

\build-ci

@savitha-eng
Copy link
Collaborator Author

/build-ci

@savitha-eng
Copy link
Collaborator Author

/build-ci

@trvachov
Copy link
Collaborator

trvachov commented Dec 5, 2024

Looks fine overall except there's one CLI arg where the understand the boolean logic can be better explained, I think.

@polinabinder1
Copy link
Collaborator

/build-ci

@polinabinder1
Copy link
Collaborator

/build-ci

@polinabinder1 polinabinder1 enabled auto-merge (squash) December 20, 2024 00:34
@polinabinder1
Copy link
Collaborator

/build-ci

@polinabinder1 polinabinder1 merged commit 30527b1 into main Dec 20, 2024
4 checks passed
@polinabinder1 polinabinder1 deleted the savitha/integrate-scdl-geneformer-rebased branch December 20, 2024 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants