-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully Integrate SCDL into Geneformer #480
Merged
polinabinder1
merged 32 commits into
main
from
savitha/integrate-scdl-geneformer-rebased
Dec 20, 2024
Merged
Fully Integrate SCDL into Geneformer #480
polinabinder1
merged 32 commits into
main
from
savitha/integrate-scdl-geneformer-rebased
Dec 20, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…tionaries instead of Pandas dataframes in the feature array
…g a np memmap array to users
…IDIA/bionemo-fw-ea into savitha/scdl-performance-improvements
/build-ci |
…m:NVIDIA/bionemo-fw-ea into savitha/integrate-scdl-geneformer-rebased
/build-ci |
/build-ci |
jstjohn
reviewed
Dec 3, 2024
sub-packages/bionemo-core/src/bionemo/core/data/resources/scdl.yaml
Outdated
Show resolved
Hide resolved
jstjohn
reviewed
Dec 3, 2024
sub-packages/bionemo-core/src/bionemo/core/data/resources/single_cell.yaml
Outdated
Show resolved
Hide resolved
/build-ci |
Signed-off-by: savitha-eng <savithas@nvidia.com>
\build-ci |
/build-ci |
/build-ci |
trvachov
requested changes
Dec 5, 2024
sub-packages/bionemo-geneformer/tests/bionemo/geneformer/scripts/test_pydantic_train.py
Show resolved
Hide resolved
sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-geneformer/src/bionemo/geneformer/data/singlecell/dataset.py
Outdated
Show resolved
Hide resolved
Looks fine overall except there's one CLI arg where the understand the boolean logic can be better explained, I think. |
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
/build-ci |
trvachov
approved these changes
Dec 13, 2024
/build-ci |
polinabinder1
approved these changes
Dec 20, 2024
/build-ci |
jstjohn
approved these changes
Dec 20, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
In this PR we refactor the Geneformer
SingleCellDataset
class to integrate theSingleCellMemmapDataset
(SCDL). The goal of this is to streamline and increase readability of the dataset class.Details
We make the following changes:
SingleCellDataset
now assumes that the input path to the data is a directory formatted in theSingleCellMemmap
format.SingleCellMemmap
format_get_item()
now leverages the get_row function from SCDL (so we eliminate the need to store and parse information in metadata.json)bypass_tokenizer_vocab
which is by defaultFalse
. So by default, we throw an error if a gene ID is not in the tokenizer vocabulary. If a user wants to bypass this, they can changebypass_tokenizer_vocab
toTrue
.sc_dataset.scdl.get_item()
returns[]
for the gene data value)Usage
The main change from a user perspective is to ensure that they convert their single cell h5ad files (or directories of h5ad files) to SingleCellMemmap format.
data.h5ad
, they can simply run the following, whereoutput_path
is the file path the SingleCellMemmap directory should be written to:SingleCellMemMapDataset(output_path, data.h5ad)
convert_h5ad_to_scdl
script (more information available in the SCDL ReadMe).Testing
We test that the updated SingleCellDataset produces the same output as the old dataset on synthetic samples and samples from the cellxsmall dataset. We also test for Megatron compatibility (as this dataset uses the MultiEpochDatasampler / Epoch Index) and for correct error handling of the above cases.
Tests for these changes can be run via:
Note that we have also updated the following test files to use the MemMap dataset format + set bypass_tokenizer_vocab=True in them, because the cellxsmall dataset does have a few genes not in the HuggingFace tokenizer vocab and so the tests will error otherwise:
sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_model.py
scripts/singlecell/geneformer/test_train.py
sub-packages/bionemo-geneformer/tests/bionemo/geneformer/test_stop_and_go.py