[WIP] Add Dask Implementation of PCA Functions #259

eauel · 2019-03-23T21:31:39Z

I have been working on adding the Dask implementations of SVD and randomized SVD to scikit-allel, and I wanted to post my current progress on here for feedback and suggestions.

This PR adds the ability to use the Dask versions of these functions whenever using Dask arrays (i.e. DaskGenotypeArray), and it will fall back to using numpy svd functions otherwise (when using ChunkedGenotypeArray or GenotypeArray, for example).

cc @ebegoli

preprocessing.py: Re-add the asarray_ndim calls to ensure input is either a dask or numpy array util.py: add checks to asarray_ndim to ensure that dask arrays are not converted to numpy arrays

eauel · 2019-03-23T21:31:59Z

So far I have tested these changes on a 4-node Dask distributed system, and it appears to be working properly. One issue I have noticed however is whenever I try using larger data sets with SVD. If I try to include more than 65536 variants in the data set for PCA, I get an error from the underlying Dask tsqr() function. I believe this is because the function expects a single chunk for the variants dimension, and the chunk size I'm using on the transposed array is (64, 65536). So If I try to include more variants than this, it results in more than one chunk in this dimension, causing the error.

One workaround I found for this is to rechunk the Dask array such that all variants are included in a single chunk and only chunk in the samples dimension. Not sure if this is the best way to resolve the issue though.

…semble other dask array usages)

alimanfoo · 2019-03-28T12:43:45Z

Hi @eauel, very cool to see this, thanks for making the PR.

Unfortunately I'm about to go offline for a couple of weeks and so can't follow up in detail, but hopefully we can catch up when I'm back online.

cc @mrocklin - thought you might be interested to see this. If you had any bandwidth there's a couple of points where your advice would be much appreciated.

The first is regarding the chunking that dask's SVD requires, see @eauel's comment above - does it require input data to be rechunked in any way?

The second is regarding whether this PR might make use of NEP 18 instead of having conditional logic depending on the input array type, and if so what that would look like.

mrocklin

The first is regarding the chunking that dask's SVD requires, see @eauel's comment above - does it require input data to be rechunked in any way?

Yes, the most efficient algorithm would prefer to have all columns in one chunk. You might want to do something like the following:

x = x.rechunk({1: -1, 0: 'auto'})

Chunks of size (64, 64k) seem smallish to me, especially if the dtype is compact.

The second is regarding whether this PR might make use of NEP 18 instead of having conditional logic depending on the input array type, and if so what that would look like.

Perhaps. Dask array supports the NEP-18 __array_function__ protocol in master, but not yet in latest release. Similarly NumPy doesn't have it on by default yet (you have to set an environment variable). As long as you're comfortable with this feature only working on recent releases then yes, this seems cleaner. This seems reasonable to me, but depends on what you're looking for.

Note that I think only np.linalg.svd respects this protocol, not the scipy version.

mrocklin · 2019-03-28T15:04:44Z

allel/stats/preprocessing.py

+            self.std_ = da.std(gn, axis=1, keepdims=True)
+        else:
+            self.mean_ = np.mean(gn, axis=1, keepdims=True)
+            self.std_ = np.std(gn, axis=1, keepdims=True)


I would just use gn.mean(...) which should work in both cases.

mrocklin · 2019-03-28T15:05:17Z

allel/util.py

+        kwargs.pop('copy', False)
+        a = da.array(a, **kwargs)
+    else:
+        a = np.array(a, **kwargs)


I would raise an issue with Dask asking them to support this keyword

Eric Auel added 2 commits March 6, 2019 20:34

Dask PCA testing

a2597ba

Don't convert dask arrays to numpy array

cad56d1

preprocessing.py: Re-add the asarray_ndim calls to ensure input is either a dask or numpy array util.py: add checks to asarray_ndim to ensure that dask arrays are not converted to numpy arrays

Eric Auel added 2 commits March 27, 2019 15:49

PCA: Dask array lazy evaluation (remove dask compute() function to re…

d0f2225

…semble other dask array usages)

PCA: Pass iterated_power into Dask svd_compressed function

bdef938

mrocklin reviewed Mar 28, 2019

View reviewed changes

Eric Auel added 2 commits August 3, 2019 21:08

Merge branch 'master' into pca_dask_updates

fbbe1af

Merge branch 'pca_dask_updates' into pca_dask

ec7d766

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add Dask Implementation of PCA Functions #259

[WIP] Add Dask Implementation of PCA Functions #259

eauel commented Mar 23, 2019

eauel commented Mar 23, 2019

alimanfoo commented Mar 28, 2019

mrocklin left a comment

mrocklin Mar 28, 2019

mrocklin Mar 28, 2019

[WIP] Add Dask Implementation of PCA Functions #259

Are you sure you want to change the base?

[WIP] Add Dask Implementation of PCA Functions #259

Conversation

eauel commented Mar 23, 2019

eauel commented Mar 23, 2019

alimanfoo commented Mar 28, 2019

mrocklin left a comment

Choose a reason for hiding this comment

mrocklin Mar 28, 2019

Choose a reason for hiding this comment

mrocklin Mar 28, 2019

Choose a reason for hiding this comment