Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Enables array_api for LinearDiscriminantAnalysis #102

Closed
wants to merge 9 commits into from

Conversation

thomasjpfan
Copy link
Owner

Another example of using array_api + scikit-learn's LinearDiscriminantAnalysis.

There is a good speed up 14x speed up when using cupy compared to numpy as seen in this gist.

Most workarounds are similar to the ones in #99.

Copy link
Owner Author

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments highlighting some decision points that are different from #99

Comment on lines 120 to 126
if is_array_api:
for i in range(classes.shape[0]):
means[i, :] = np.mean(X[y == i], axis=0)
else:
cnt = np.bincount(y)
np.add.at(means, y, X)
means /= cnt[:, None]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since array_api do not have np.add.at, we use a for loop to do the same computation.

Comment on lines 478 to 481
if is_array_api:
svd = np.linalg.svd
else:
svd = scipy.linalg.svd
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a bit interesting. Since I already have a wrapper for numpy: _NumPyApiWrapper, it could make sense to make numpy.scipy == scipy to avoid this conditional.

Comment on lines 59 to 61
def astype(self, x, dtype, *args, **kwargs):
# astype is not defined in the top level numpy namespace
return x.astype(dtype, *args, **kwargs)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrap numpy because I want np.astype to exist. In a sense, this is making NumPy more like array api. The alternative is to add astype to the array_api.Array object, but that involves patching the object.


ys_labels = set(chain.from_iterable((i for i in _unique_labels(y)) for y in ys))
Copy link
Owner Author

@thomasjpfan thomasjpfan Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array api does not go down this code path, because Arrays are not hashable.

Although using set + arrays feels like an anti pattern.

Comment on lines 38 to 41
def concatenate(self, arrays, *, axis=0, **kwargs):
# ignore parameters that is not supported by array-api
f = self._array_namespace.concat
return f(arrays, axis=axis)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's either this or what we see in https://github.com/scipy/scipy/pull/15395/files:

def _concatenate(arrays, axis):
    xp = _get_namespace(*arrays)
    if xp is np:
        return xp.concatenate(arrays, axis=axis)
    else:
        return xp.concat(arrays, axis=axis)

And importing _concatenate when needed.

Comment on lines 54 to 56
@property
def VisibleDeprecationWarning(self):
return DeprecationWarning
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure about how I feel about this workaround

warnings.warn("The priors do not sum to 1. Renormalizing", UserWarning)
self.priors_ = self.priors_ / self.priors_.sum()

# TODO: implement isclose in wrapper?
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to implement our own isclose.

@@ -110,11 +113,17 @@ def _class_means(X, y):
means : array-like of shape (n_classes, n_features)
Class means.
"""
np, is_array_api = get_namespace(X)
classes, y = np.unique(y, return_inverse=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique is not in the array API (this would be unique_inverse).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you are using a wrapper below. IMO it would be better to use the array API APIs in the wrapper and wrap non-compatible NumPy conventions to match them, not the other way around.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, going the other direction also works. I'll try it the other way around and see how it compares.

My guess is that it's fine.


if is_array_api:
for i in range(classes.shape[0]):
means[i, :] = np.mean(X[y == i], axis=0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be +=?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using mean here avoids needing to divide by the count. (In the end, I think it's the same computation. It's basically a groupby + mean aggregation.)

Copy link

@asmeurer asmeurer Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you moved the aggregation from at into the mean calculation.

I think you could get rid of the loop with something like

np.sum(np.where(y[None] == np.arange(classes.shape[0])[:, None], X, np.asarray(0.)), axis=1)/cnt

except None indexing isn't in the spec yet (data-apis/array-api#360), so you'd have to use expand_dims. That's still not as "efficient" as the original because you are adding a lot of redundant 0s in the sum, so depending on how many classes there typically are the loop version might be better anyway (at least in the sense that it's more readable). Another thing to note is that not all array API modules are guaranteed to have boolean indexing (https://data-apis.org/array-api/latest/API_specification/indexing.html#boolean-array-indexing).

Also, I think cnt can be gotten from unique_counts or unique_all in the array API (include_counts in NumPy).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to note is that not all array API modules are guaranteed to have boolean indexing

Thanks for the information! Some functionality would be hard to reproduce without boolean indexing. Looking into "Data-dependent output shape" more, I see that unique_all may also not work. This is a significant barrier for scikit-learn. Pragmatically, I would have the first version of array API support in scikit-learn to be restricted to array modules that support "Data-dependent output shape".

(I wanted to use nonzero to go from boolean index -> integer index -> take, but that looks to under "Data-dependent output shape")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah of course unique wouldn't work in such APIs either. So there's no point in worrying about it here.

@thomasjpfan thomasjpfan closed this Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants