-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add Dask Implementation of PCA Functions #259
base: master
Are you sure you want to change the base?
Conversation
preprocessing.py: Re-add the asarray_ndim calls to ensure input is either a dask or numpy array util.py: add checks to asarray_ndim to ensure that dask arrays are not converted to numpy arrays
So far I have tested these changes on a 4-node Dask distributed system, and it appears to be working properly. One issue I have noticed however is whenever I try using larger data sets with SVD. If I try to include more than 65536 variants in the data set for PCA, I get an error from the underlying Dask One workaround I found for this is to rechunk the Dask array such that all variants are included in a single chunk and only chunk in the samples dimension. Not sure if this is the best way to resolve the issue though. |
…semble other dask array usages)
Hi @eauel, very cool to see this, thanks for making the PR. Unfortunately I'm about to go offline for a couple of weeks and so can't follow up in detail, but hopefully we can catch up when I'm back online. cc @mrocklin - thought you might be interested to see this. If you had any bandwidth there's a couple of points where your advice would be much appreciated. The first is regarding the chunking that dask's SVD requires, see @eauel's comment above - does it require input data to be rechunked in any way? The second is regarding whether this PR might make use of NEP 18 instead of having conditional logic depending on the input array type, and if so what that would look like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first is regarding the chunking that dask's SVD requires, see @eauel's comment above - does it require input data to be rechunked in any way?
Yes, the most efficient algorithm would prefer to have all columns in one chunk. You might want to do something like the following:
x = x.rechunk({1: -1, 0: 'auto'})
Chunks of size (64, 64k) seem smallish to me, especially if the dtype is compact.
The second is regarding whether this PR might make use of NEP 18 instead of having conditional logic depending on the input array type, and if so what that would look like.
Perhaps. Dask array supports the NEP-18 __array_function__
protocol in master, but not yet in latest release. Similarly NumPy doesn't have it on by default yet (you have to set an environment variable). As long as you're comfortable with this feature only working on recent releases then yes, this seems cleaner. This seems reasonable to me, but depends on what you're looking for.
Note that I think only np.linalg.svd
respects this protocol, not the scipy version.
self.std_ = da.std(gn, axis=1, keepdims=True) | ||
else: | ||
self.mean_ = np.mean(gn, axis=1, keepdims=True) | ||
self.std_ = np.std(gn, axis=1, keepdims=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just use gn.mean(...)
which should work in both cases.
kwargs.pop('copy', False) | ||
a = da.array(a, **kwargs) | ||
else: | ||
a = np.array(a, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would raise an issue with Dask asking them to support this keyword
Hi @alimanfoo,
I have been working on adding the Dask implementations of SVD and randomized SVD to scikit-allel, and I wanted to post my current progress on here for feedback and suggestions.
This PR adds the ability to use the Dask versions of these functions whenever using Dask arrays (i.e.
DaskGenotypeArray
), and it will fall back to using numpy svd functions otherwise (when usingChunkedGenotypeArray
orGenotypeArray
, for example).cc @ebegoli