Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Dask Implementation of PCA Functions #259

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

eauel
Copy link

@eauel eauel commented Mar 23, 2019

Hi @alimanfoo,

I have been working on adding the Dask implementations of SVD and randomized SVD to scikit-allel, and I wanted to post my current progress on here for feedback and suggestions.

This PR adds the ability to use the Dask versions of these functions whenever using Dask arrays (i.e. DaskGenotypeArray), and it will fall back to using numpy svd functions otherwise (when using ChunkedGenotypeArray or GenotypeArray, for example).

cc @ebegoli

Eric Auel added 2 commits March 6, 2019 20:34
preprocessing.py:  Re-add the asarray_ndim calls to ensure input is either a dask or numpy array
util.py: add checks to asarray_ndim to ensure that dask arrays are not converted to numpy arrays
@eauel
Copy link
Author

eauel commented Mar 23, 2019

So far I have tested these changes on a 4-node Dask distributed system, and it appears to be working properly. One issue I have noticed however is whenever I try using larger data sets with SVD. If I try to include more than 65536 variants in the data set for PCA, I get an error from the underlying Dask tsqr() function. I believe this is because the function expects a single chunk for the variants dimension, and the chunk size I'm using on the transposed array is (64, 65536). So If I try to include more variants than this, it results in more than one chunk in this dimension, causing the error.

One workaround I found for this is to rechunk the Dask array such that all variants are included in a single chunk and only chunk in the samples dimension. Not sure if this is the best way to resolve the issue though.

@alimanfoo
Copy link
Contributor

Hi @eauel, very cool to see this, thanks for making the PR.

Unfortunately I'm about to go offline for a couple of weeks and so can't follow up in detail, but hopefully we can catch up when I'm back online.

cc @mrocklin - thought you might be interested to see this. If you had any bandwidth there's a couple of points where your advice would be much appreciated.

The first is regarding the chunking that dask's SVD requires, see @eauel's comment above - does it require input data to be rechunked in any way?

The second is regarding whether this PR might make use of NEP 18 instead of having conditional logic depending on the input array type, and if so what that would look like.

Copy link

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first is regarding the chunking that dask's SVD requires, see @eauel's comment above - does it require input data to be rechunked in any way?

Yes, the most efficient algorithm would prefer to have all columns in one chunk. You might want to do something like the following:

x = x.rechunk({1: -1, 0: 'auto'})

Chunks of size (64, 64k) seem smallish to me, especially if the dtype is compact.

The second is regarding whether this PR might make use of NEP 18 instead of having conditional logic depending on the input array type, and if so what that would look like.

Perhaps. Dask array supports the NEP-18 __array_function__ protocol in master, but not yet in latest release. Similarly NumPy doesn't have it on by default yet (you have to set an environment variable). As long as you're comfortable with this feature only working on recent releases then yes, this seems cleaner. This seems reasonable to me, but depends on what you're looking for.

Note that I think only np.linalg.svd respects this protocol, not the scipy version.

self.std_ = da.std(gn, axis=1, keepdims=True)
else:
self.mean_ = np.mean(gn, axis=1, keepdims=True)
self.std_ = np.std(gn, axis=1, keepdims=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use gn.mean(...) which should work in both cases.

kwargs.pop('copy', False)
a = da.array(a, **kwargs)
else:
a = np.array(a, **kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would raise an issue with Dask asking them to support this keyword

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants