Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Masked select and Masked fill #429

Merged
merged 10 commits into from
Apr 8, 2020
Merged

[RFC] Masked select and Masked fill #429

merged 10 commits into from
Apr 8, 2020

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Apr 5, 2020

This is the first step in supporting NumPy fancy/advanced indexing.
The second step being extending the [] and []= macros to automatically dispatch
to either of the index_select, masked_select, masked_fill or the about 10 other slicing variations that already exist.

Thanks Base2 Genomics for sponsoring the work (https://base2genomics.com/)

State

More tests to be added and documentation to be improved.
masked_axis_fill with a tensor instead of just a scalar which is very common for dataframes.

The current PR is currently more of a RFC to discuss proc names and clarify documentation.

API design

Turns out that the API is not straightforward at all and it's very easy to get confused, feedback welcome, that would also significantly help in improving the documentation:

masked_select

Inspiration Numpy: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing
image

and PyTorch: https://pytorch.org/docs/stable/torch.html#torch.masked_select

image

In Arraymancer:

test "Masked_select":
block: # Numpy reference doc
# https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing
# select non NaN
# x = np.array([[1., 2.], [np.nan, 3.], [np.nan, np.nan]])
# x[~np.isnan(x)]
# array([ 1., 2., 3.])
let x = [[1.0, 2.0],
[Nan, 3.0],
[Nan, Nan]].toTensor
let r = x.masked_select(x.isNotNan)
let expected = [1.0, 2.0, 3.0].toTensor()
check: r == expected

Signature: masked_select(t, mask) -> 1D Tensor

Tricky part

This returns a 1D Tensor like Numpy and PyTorch. As shown in this Stack overflow thread: https://stackoverflow.com/questions/51586364/tf-boolean-mask2d-2d-gives-1d-result
some users may want to filter the non-zero from [[1,0],[2,3]] and obtain [[1], [2, 3]].
This is not a valid "dense" tensor as [1] does not have the same shape as [2, 3].

Tensorflow provides RaggedTensor and PyTorch is exploring NestedTensor (https://www.tensorflow.org/guide/ragged_tensor) for this case.

masked_axis_select

Inspiration Numpy: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing
image

and Tensorflow: https://www.tensorflow.org/api_docs/python/tf/boolean_mask
image

Tricky part

The mask is 1D and corresponds to axis indices that will be retained or dropped.

Tensorflow allows multi-dimensional mask, I don't know at all what kind of result I should expect from a 2D masking operation with an axis so I assume it's Python overloading flexibility and combining both will result in an error

image

masked_fill

Straightforward: take a mutable tensor and a mask of the same shape and if the mask element is true fill with value.

masked_axis_fill

Inspiration Numpy: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing
image
(from #400 (comment))

var a = [[-1, -2, 1],
[ 1, 2, 0],
[ 1, -1, 1]].toTensor
let cond = squeeze(a.sum(axis = 0) .> 1)
a.masked_axis_fill(cond, axis = 1, -10)
let expected = [[-1, -2, -10],
[ 1, 2, -10],
[ 1, -1, -10]].toTensor
check: a == expected

Tricky part

The mask is 1D and corresponds to axis indices that will be filled or skipped.

masked_filled_along_axis

Inspiration: my flawed initial understanding of Numpy boolean masking

It takes a N-D mask, iterate along the axis of a tensor and apply masked_fill on the axis slice.

block:
var a = [[-1, -2, 1],
[ 1, 2, 0],
[ 1, -1, 1]].toTensor
a.masked_fill_along_axis(a.sum(axis = 0) .> 1, axis = 0, -10)
let expected = [[-1, -2, -10],
[ 1, 2, -10],
[ 1, -1, -10]].toTensor
check: a == expected

Tricky part

Notice how masked_axis_fill and masked_filled_along_axis have very similar names, similar application but it's easy to confuse the axes:

a.masked_axis_fill(cond, axis = 1, -10)

a.masked_fill_along_axis(a.sum(axis = 0) .> 1, axis = 0, -10)

@brentp
Copy link
Contributor

brentp commented Apr 5, 2020

awesome!
I've never needed mask with > 1 dimension. And those "limitations" seem sane (I also never use ragged arrays).
I should clarify that it's not me personally, it's a small genomics company I co-founded that's sponsoring the work.

@Vindaar
Copy link
Collaborator

Vindaar commented Apr 6, 2020

Turns out that the API is not straightforward at all and it's very
easy to get confused, feedback welcome, that would also significantly
help in improving the documentation:

Heh, indeed. I remember learning fancy indexing in numpy the first
time... and by now I forgot the little intricacies again (which is very
telling).

masked_select

I think for masked_select we can stress (either by a change of name or
in the documentation) that it should be considered as an action on the
underlying raw data, which is stored in a 1D form anyways.

That way it isn't surprising that the shape information is not retained
and data is returned as a 1D tensor.

Supporting returning data in a non 1D form is imo another story and not
necessarily part of a PR that introduces masked_select. Unless
arraymancer support for sparse tensors I don't think it's really a
good fit. Every use case I have, in which I have ragged data is simply
not something I use arraymancer for. That's exactly the reason my
TimepixAnalysis
codebase uses arraymancer sparingly. The data I deal with is extremely
sparse. The Timepix is an ASIC with 256x256 pixels. The events I look at
usually have less than 400 active pixels. Thus, I have many
seq[seq[T]] data structures around. There's just no nice way of
packaging such data and that's life imo. Forcing a tool to work with
something it was never meant to doesn't sound like a great idea to me.

Regarding the name, I never quite understood why a proc like this in
numpy etc. is not just called filter? That's all it really does. It
takes conditions and returns elements that match the condition. I think
people in functional programming land came up with pretty good names for
this sort of stuff.

Sure, masked_select can be considered the underlying operation that
would make a filter operation work. But given that the usage of such a
proc would usually result from fancy indexing, thus avoiding names in
general, it would imo make sense to give the underlying proc a useful
name that doesn't make me feel like the proc is discouraged from use.

The only argument against a name change would be that people coming from
torch, numpy etc. would be confused by the name, imo.

masked_axis_select

Here again I feel like it makes sense that the input is a 1D tensor. To
me the name implies that the given mask works along an axis. An axis is
by definition 1 dimensional, no matter how many dimensions the objects
along that axis have (which of course are N - 1 dimensional for an ND
tensor).

Supporting multiple dimensions as an input to an "axis select" seems
weird. I would stick to well defined assumptions that don't leave room
for ambiguities. If a need for a multi dimensional selection were to
arise, I'd give that a different name.

This could also be filter_axis.

masked_fill

As you say, pretty straight forward.

Could also take inspiriation from functional programming and call it
apply, although in this case I feel like the term apply is less
clear than filter for masked_select.

masked_axis_fill

In the same vein, this could be apply_axis.

This makes sense to me for the same reason masked_axis_select does.

masked_filled_along_axis

This does indeed confuse me.

Is this not essentially the inverse of the theoretical
masked_axis_select for ND tensors?

And due to adding the axis argument, we essentially restrict the usage
of the proc.

This is really confusing me. I think the reason is that it breaks my
idea of well defined input -> output. Since I don't know clearly what
the mask argument represents if it's the result of a comparison, it
makes it super hard to reason about what the meaning of axis is going
to be, because as far as I understand it, axis does not refer to the
tensor acted upon, but rather on the mask.

I don't know, maybe I'm just not thinking straight:

If I understand correctly this does not provide functionality that could
not either be done via masked_fill or masked_axis_fill already.

One thing that comes to mind, to support:

a.masked_fill_along_axis(a.sum(axis = 0) .> 1, axis = 0, -10) 

for masked_axis_fill instead:

a.masked_axis_fill(a.sum(axis = 0) .> 1, axis = 0, -10) 

would be to allow ND tensors as arguments to masked_axis_fill, but if
and only if the the input represents a squeezable tensor. Just to
provide convenience for results like a.sum(axis = 0) .> 1, which (as
far as I can read from your example) do not return an (N-1)D tensor, but
rather an ND tensor with only one element along the reduced axis.

But maybe this could also happen as part of reducing functionality.

@mratsim
Copy link
Owner Author

mratsim commented Apr 7, 2020

PR ready:

  • I have added the possibility to fill an axis with a tensor which is a common usecase for dataframes:
    block: # Fill with tensor
    # import numpy as np
    # a = np.array([[-1, -2, 1], [1, 2, 0], [1, -1, 1]])
    # print(a.sum(axis=0) > 1)
    # a[:, a.sum(axis=0) > 1] = np.array([-10, -20, -30])[:, np.newaxis]
    # print(a)
    var a = [[-1, -2, 1],
    [ 1, 2, 0],
    [ 1, -1, 1]].toTensor
    let b = [-10, -20, -30].toTensor.unsqueeze(1)
    let cond = squeeze(a.sum(axis = 0) .> 1)
    a.masked_axis_fill(cond, axis = 1, b)
    let expected = [[-1, -2, -10],
    [ 1, 2, -20],
    [ 1, -1, -30]].toTensor
    check: a == expected
  • I have given an example of 2D masking for masked_fill_along_axis. AFAIk it's not possible to do the same with the other proc, if you know a Numpy function that does the same I would love to steal know the name:
    block: # Apply a 2D mask on a RGB image
    var checkered = [
    [ # Red
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    ],
    [ # Green
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    ],
    [ # Blue
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    [uint8 255, 255, 255, 255],
    ]
    ].toTensor()
    var mask = [
    [false, true, false, true],
    [true, false, true, false],
    [false, true, false, true],
    [true, false, true, false]
    ].toTensor().unsqueeze(0)
    var expected = [
    [ # Red
    [uint8 255, 0, 255, 0],
    [uint8 0, 255, 0, 255],
    [uint8 255, 0, 255, 0],
    [uint8 0, 255, 0, 255],
    ],
    [ # Green
    [uint8 255, 0, 255, 0],
    [uint8 0, 255, 0, 255],
    [uint8 255, 0, 255, 0],
    [uint8 0, 255, 0, 255],
    ],
    [ # Blue
    [uint8 255, 0, 255, 0],
    [uint8 0, 255, 0, 255],
    [uint8 255, 0, 255, 0],
    [uint8 0, 255, 0, 255],
    ]
    ].toTensor()
    checkered.masked_fill_along_axis(mask, axis = 0, 0'u8)
    check: checkered == expected
  • Added a missing test for masked_fill

The next step would be to integrate those and index_select in the [] and []= macro.

@mratsim mratsim merged commit 532f45a into master Apr 8, 2020
@mratsim mratsim mentioned this pull request Apr 19, 2020
7 tasks
mratsim added a commit that referenced this pull request Apr 19, 2020
* index_select should use SomeInteger not SOmeNumber

* Overload index_select for arrays and sequences

* Masked Selector overload for openarrays

* Add masked overload for regular arrays and sequences

* Initial support of Numpy fancy indexing: index select

* Fix broadcast operators from #429 using deprecated syntax

* Stash dispatcher, working with types in macros is a minefield nim-lang/Nim#14021

* Masked indexing: closes #400, workaround nim-lang/Nim#14021

* Test for full masked fancy indexing

* Add index_fill

* Tensor mutation via fancy indexing

* Add tests for index mutation via fancy indexing

* Fancy indexing: supports broadcasting a value to a masked assignation

* Detect wrong mask or tensor axis length

* masked axis assign value test

* Add masked assign of broadcastable tensor

* Tag for changelog [skip ci]
@mratsim mratsim deleted the masked_select branch April 20, 2020 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants