Skip to content

Commit

Permalink
WIP: update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
grst committed Feb 9, 2025
1 parent 2be1e8f commit a1d509e
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 1 deletion.
1 change: 1 addition & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,7 @@ distance metrics
ir_dist.metrics.IdentityDistanceCalculator
ir_dist.metrics.LevenshteinDistanceCalculator
ir_dist.metrics.HammingDistanceCalculator
ir_dist.metrics.GPUHammingDistanceCalculator
ir_dist.metrics.AlignmentDistanceCalculator
ir_dist.metrics.FastAlignmentDistanceCalculator
ir_dist.metrics.TCRdistDistanceCalculator
1 change: 1 addition & 0 deletions docs/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ Tutorials
tutorials/tutorial_io.ipynb
tutorials/tutorial_3k_tcr.ipynb
tutorials/tutorial_5k_bcr.ipynb
tutorials/large-datasets.md
29 changes: 29 additions & 0 deletions docs/tutorials/large-datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Working with >1M cells

This page is a work-in-progess collection with advice how to scale up the scirpy workflow beyond 1M cells.

## Use an up-to-date version

Scalability has been a major focus of recent developments in Scirpy. Make sure you use the latest version
when working with large datasets to take advantage of all speedups.

## Choose an appropriate distance metric for `pp.ir_dist`

Some distance metrics are significantly faster than others. Here are the distance metrics, roughly ordered by speed:

`identity` > `gpu_hamming` > `hamming` = `normalized_hamming` > `tcrdist` > `levenshtein` > `fastalignment` > `alignment`

TCRdist, fastalignment and alignment are conceptually very similar, but tcrdist is by far the fastest. For this
reason, we'd always recommend to go with `tcrdist`, when looking for a metric taking into account a substitution matrix.

## Multi-machine paralellization with dask

## Using GPU acceleration for hamming distance

The Hamming distance metric supports GPU acceleration via [cupy](https://cupy.dev/).

First, install the optional `cupy` dependency:

```
!pip install scirpy[cupy]
```
2 changes: 1 addition & 1 deletion src/scirpy/ir_dist/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def IrNeighbors(*args, **kwargs):
* `hamming` -- Hamming distance for CDR3 sequences of equal length.
See :class:`~scirpy.ir_dist.metrics.HammingDistanceCalculator`.
* `gpu_hamming` -- Hamming distance for CDR3 sequences of equal length calculated with a GPU.
See \\:class:`~scirpy.ir_dist.metrics.GPUHammingDistanceCalculator`.
See :class:`~scirpy.ir_dist.metrics.GPUHammingDistanceCalculator`.
* `normalized_hamming` -- Normalized Hamming distance (in percent) for CDR3 sequences of equal length.
See :class:`~scirpy.ir_dist.metrics.HammingDistanceCalculator`.
* `alignment` -- Distance based on pairwise sequence alignments using the
Expand Down

0 comments on commit a1d509e

Please sign in to comment.