[RFC] Masked select and Masked fill #429

mratsim · 2020-04-05T17:36:00Z

This is the first step in supporting NumPy fancy/advanced indexing.
The second step being extending the [] and []= macros to automatically dispatch
to either of the index_select, masked_select, masked_fill or the about 10 other slicing variations that already exist.

Thanks Base2 Genomics for sponsoring the work (https://base2genomics.com/)

State

More tests to be added and documentation to be improved.
masked_axis_fill with a tensor instead of just a scalar which is very common for dataframes.

The current PR is currently more of a RFC to discuss proc names and clarify documentation.

API design

Turns out that the API is not straightforward at all and it's very easy to get confused, feedback welcome, that would also significantly help in improving the documentation:

masked_select

Inspiration Numpy: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing

and PyTorch: https://pytorch.org/docs/stable/torch.html#torch.masked_select

In Arraymancer:

Arraymancer/tests/tensor/test_selectors.nim

Lines 43 to 57 in 7db6023

    
           test "Masked_select": 
        
             block: # Numpy reference doc 
        
                    # https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing 
        
                    # select non NaN 
        
               # x = np.array([[1., 2.], [np.nan, 3.], [np.nan, np.nan]]) 
        
               # x[~np.isnan(x)] 
        
               # array([ 1.,  2.,  3.]) 
        
               let x = [[1.0, 2.0], 
        
                        [Nan, 3.0], 
        
                        [Nan, Nan]].toTensor 
        
               let r = x.masked_select(x.isNotNan) 
        
               let expected = [1.0, 2.0, 3.0].toTensor() 
        
               check: r == expected

Signature: masked_select(t, mask) -> 1D Tensor

Tricky part

This returns a 1D Tensor like Numpy and PyTorch. As shown in this Stack overflow thread: https://stackoverflow.com/questions/51586364/tf-boolean-mask2d-2d-gives-1d-result
some users may want to filter the non-zero from [[1,0],[2,3]] and obtain [[1], [2, 3]].
This is not a valid "dense" tensor as [1] does not have the same shape as [2, 3].

Tensorflow provides RaggedTensor and PyTorch is exploring NestedTensor (https://www.tensorflow.org/guide/ragged_tensor) for this case.

masked_axis_select

Inspiration Numpy: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing

and Tensorflow: https://www.tensorflow.org/api_docs/python/tf/boolean_mask

Tricky part

The mask is 1D and corresponds to axis indices that will be retained or dropped.

Tensorflow allows multi-dimensional mask, I don't know at all what kind of result I should expect from a 2D masking operation with an axis so I assume it's Python overloading flexibility and combining both will result in an error

masked_fill

Straightforward: take a mutable tensor and a mask of the same shape and if the mask element is true fill with value.

masked_axis_fill

Inspiration Numpy: https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing

(from #400 (comment))

Arraymancer/tests/tensor/test_selectors.nim

Lines 91 to 102 in 7db6023

    
           var a = [[-1, -2, 1], 
        
                    [ 1,  2, 0], 
        
                    [ 1, -1, 1]].toTensor 
        
           let cond = squeeze(a.sum(axis = 0) .> 1) 
        
           a.masked_axis_fill(cond, axis = 1, -10) 
        
           let expected = [[-1, -2, -10], 
        
                           [ 1,  2, -10], 
        
                           [ 1, -1, -10]].toTensor 
        
           check: a == expected

Tricky part

The mask is 1D and corresponds to axis indices that will be filled or skipped.

masked_filled_along_axis

Inspiration: my flawed initial understanding of Numpy boolean masking

It takes a N-D mask, iterate along the axis of a tensor and apply masked_fill on the axis slice.

Arraymancer/tests/tensor/test_selectors.nim

Lines 105 to 116 in 7db6023

    
           block: 
        
             var a = [[-1, -2, 1], 
        
                      [ 1,  2, 0], 
        
                      [ 1, -1, 1]].toTensor 
        
             a.masked_fill_along_axis(a.sum(axis = 0) .> 1, axis = 0, -10) 
        
             let expected = [[-1, -2, -10], 
        
                             [ 1,  2, -10], 
        
                             [ 1, -1, -10]].toTensor 
        
             check: a == expected

Tricky part

Notice how masked_axis_fill and masked_filled_along_axis have very similar names, similar application but it's easy to confuse the axes:

Arraymancer/tests/tensor/test_selectors.nim

Line 96 in 7db6023

a.masked_axis_fill(cond, axis = 1, -10)

Arraymancer/tests/tensor/test_selectors.nim

Line 110 in 7db6023

a.masked_fill_along_axis(a.sum(axis = 0) .> 1, axis = 0, -10)

brentp · 2020-04-05T17:54:22Z

awesome!
I've never needed mask with > 1 dimension. And those "limitations" seem sane (I also never use ragged arrays).
I should clarify that it's not me personally, it's a small genomics company I co-founded that's sponsoring the work.

Vindaar · 2020-04-06T14:43:41Z

Turns out that the API is not straightforward at all and it's very
easy to get confused, feedback welcome, that would also significantly
help in improving the documentation:

Heh, indeed. I remember learning fancy indexing in numpy the first
time... and by now I forgot the little intricacies again (which is very
telling).

`masked_select`

I think for masked_select we can stress (either by a change of name or
in the documentation) that it should be considered as an action on the
underlying raw data, which is stored in a 1D form anyways.

That way it isn't surprising that the shape information is not retained
and data is returned as a 1D tensor.

Supporting returning data in a non 1D form is imo another story and not
necessarily part of a PR that introduces masked_select. Unless
arraymancer support for sparse tensors I don't think it's really a
good fit. Every use case I have, in which I have ragged data is simply
not something I use arraymancer for. That's exactly the reason my
TimepixAnalysis
codebase uses arraymancer sparingly. The data I deal with is extremely
sparse. The Timepix is an ASIC with 256x256 pixels. The events I look at
usually have less than 400 active pixels. Thus, I have many
seq[seq[T]] data structures around. There's just no nice way of
packaging such data and that's life imo. Forcing a tool to work with
something it was never meant to doesn't sound like a great idea to me.

Regarding the name, I never quite understood why a proc like this in
numpy etc. is not just called filter? That's all it really does. It
takes conditions and returns elements that match the condition. I think
people in functional programming land came up with pretty good names for
this sort of stuff.

Sure, masked_select can be considered the underlying operation that
would make a filter operation work. But given that the usage of such a
proc would usually result from fancy indexing, thus avoiding names in
general, it would imo make sense to give the underlying proc a useful
name that doesn't make me feel like the proc is discouraged from use.

The only argument against a name change would be that people coming from
torch, numpy etc. would be confused by the name, imo.

`masked_axis_select`

Here again I feel like it makes sense that the input is a 1D tensor. To
me the name implies that the given mask works along an axis. An axis is
by definition 1 dimensional, no matter how many dimensions the objects
along that axis have (which of course are N - 1 dimensional for an ND
tensor).

Supporting multiple dimensions as an input to an "axis select" seems
weird. I would stick to well defined assumptions that don't leave room
for ambiguities. If a need for a multi dimensional selection were to
arise, I'd give that a different name.

This could also be filter_axis.

`masked_fill`

As you say, pretty straight forward.

Could also take inspiriation from functional programming and call it
apply, although in this case I feel like the term apply is less
clear than filter for masked_select.

`masked_axis_fill`

In the same vein, this could be apply_axis.

This makes sense to me for the same reason masked_axis_select does.

`masked_filled_along_axis`

This does indeed confuse me.

Is this not essentially the inverse of the theoretical
masked_axis_select for ND tensors?

And due to adding the axis argument, we essentially restrict the usage
of the proc.

This is really confusing me. I think the reason is that it breaks my
idea of well defined input -> output. Since I don't know clearly what
the mask argument represents if it's the result of a comparison, it
makes it super hard to reason about what the meaning of axis is going
to be, because as far as I understand it, axis does not refer to the
tensor acted upon, but rather on the mask.

I don't know, maybe I'm just not thinking straight:

If I understand correctly this does not provide functionality that could
not either be done via masked_fill or masked_axis_fill already.

One thing that comes to mind, to support:

a.masked_fill_along_axis(a.sum(axis = 0) .> 1, axis = 0, -10)

for masked_axis_fill instead:

a.masked_axis_fill(a.sum(axis = 0) .> 1, axis = 0, -10)

would be to allow ND tensors as arguments to masked_axis_fill, but if
and only if the the input represents a squeezable tensor. Just to
provide convenience for results like a.sum(axis = 0) .> 1, which (as
far as I can read from your example) do not return an (N-1)D tensor, but
rather an ND tensor with only one element along the reduced axis.

But maybe this could also happen as part of reducing functionality.

mratsim · 2020-04-07T19:26:17Z

PR ready:

I have added the possibility to fill an axis with a tensor which is a common usecase for dataframes:

Arraymancer/tests/tensor/test_selectors.nim

Lines 121 to 141 in 38f048c

    
           block: # Fill with tensor 
        
             # import numpy as np 
        
             # a = np.array([[-1, -2, 1], [1, 2, 0], [1, -1, 1]]) 
        
             # print(a.sum(axis=0) > 1) 
        
             # a[:, a.sum(axis=0) > 1] = np.array([-10, -20, -30])[:, np.newaxis] 
        
             # print(a) 
        
             var a = [[-1, -2, 1], 
        
                      [ 1,  2, 0], 
        
                      [ 1, -1, 1]].toTensor 
        
             let b = [-10, -20, -30].toTensor.unsqueeze(1) 
        
             let cond = squeeze(a.sum(axis = 0) .> 1) 
        
             a.masked_axis_fill(cond, axis = 1, b) 
        
             let expected = [[-1, -2, -10], 
        
                             [ 1,  2, -20], 
        
                             [ 1, -1, -30]].toTensor 
        
             check: a == expected

I have given an example of 2D masking for masked_fill_along_axis. AFAIk it's not possible to do the same with the other proc, if you know a Numpy function that does the same I would love to ~~steal~~ know the name:

Arraymancer/tests/tensor/test_selectors.nim

Lines 157 to 209 in 38f048c

    
           block: # Apply a 2D mask on a RGB image 
        
             var checkered = [ 
        
               [ # Red 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
               ], 
        
               [ # Green 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
               ], 
        
               [ # Blue 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
                 [uint8 255, 255, 255, 255], 
        
               ] 
        
             ].toTensor() 
        
             var mask = [ 
        
               [false, true, false, true], 
        
               [true, false, true, false], 
        
               [false, true, false, true], 
        
               [true, false, true, false] 
        
             ].toTensor().unsqueeze(0) 
        
             var expected = [ 
        
               [ # Red 
        
                 [uint8 255, 0, 255, 0], 
        
                 [uint8 0, 255, 0, 255], 
        
                 [uint8 255, 0, 255, 0], 
        
                 [uint8 0, 255, 0, 255], 
        
               ], 
        
               [ # Green 
        
                 [uint8 255, 0, 255, 0], 
        
                 [uint8 0, 255, 0, 255], 
        
                 [uint8 255, 0, 255, 0], 
        
                 [uint8 0, 255, 0, 255], 
        
               ], 
        
               [ # Blue 
        
                 [uint8 255, 0, 255, 0], 
        
                 [uint8 0, 255, 0, 255], 
        
                 [uint8 255, 0, 255, 0], 
        
                 [uint8 0, 255, 0, 255], 
        
               ] 
        
             ].toTensor() 
        
             checkered.masked_fill_along_axis(mask, axis = 0, 0'u8) 
        
             check: checkered == expected

Added a missing test for masked_fill

The next step would be to integrate those and index_select in the [] and []= macro.

* index_select should use SomeInteger not SOmeNumber * Overload index_select for arrays and sequences * Masked Selector overload for openarrays * Add masked overload for regular arrays and sequences * Initial support of Numpy fancy indexing: index select * Fix broadcast operators from #429 using deprecated syntax * Stash dispatcher, working with types in macros is a minefield nim-lang/Nim#14021 * Masked indexing: closes #400, workaround nim-lang/Nim#14021 * Test for full masked fancy indexing * Add index_fill * Tensor mutation via fancy indexing * Add tests for index mutation via fancy indexing * Fancy indexing: supports broadcasting a value to a masked assignation * Detect wrong mask or tensor axis length * masked axis assign value test * Add masked assign of broadcastable tensor * Tag for changelog [skip ci]

mratsim added 8 commits April 5, 2020 12:09

Create a new selectors file

5fc9d36

initial implementation of masked_select

f54b335

Implement masked_fill

bac8f5a

Add scalar comparison

f77b492

Add test_selectors file

d816e7b

Add some tests

9344295

Add masked_axis_select

6d7f787

Introduce masked_select similar to NumPy and PyTorch

7db6023

mratsim mentioned this pull request Apr 5, 2020

masked_select: Filtering tensors with a boolean mask tensor #400

Closed

mratsim added 2 commits April 7, 2020 21:05

Add test for 2D masking with fill along axis

e03336b

Allow filling with a tensor for masked_axis_fill + test masked_fill

38f048c

mratsim merged commit 532f45a into master Apr 8, 2020

mratsim added a commit that referenced this pull request Apr 19, 2020

Fix broadcast operators from #429 using deprecated syntax

bde1880

mratsim mentioned this pull request Apr 19, 2020

Implement Numpy fancy indexing #434

Merged

7 tasks

mratsim deleted the masked_select branch April 20, 2020 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Masked select and Masked fill #429

[RFC] Masked select and Masked fill #429

mratsim commented Apr 5, 2020 •

edited

Loading

brentp commented Apr 5, 2020

Vindaar commented Apr 6, 2020

mratsim commented Apr 7, 2020

	test "Masked_select":
	block: # Numpy reference doc
	# https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing
	# select non NaN
	# x = np.array([[1., 2.], [np.nan, 3.], [np.nan, np.nan]])
	# x[~np.isnan(x)]
	# array([ 1., 2., 3.])
	let x = [[1.0, 2.0],
	[Nan, 3.0],
	[Nan, Nan]].toTensor

	let r = x.masked_select(x.isNotNan)

	let expected = [1.0, 2.0, 3.0].toTensor()
	check: r == expected

	var a = [[-1, -2, 1],
	[ 1, 2, 0],
	[ 1, -1, 1]].toTensor

	let cond = squeeze(a.sum(axis = 0) .> 1)
	a.masked_axis_fill(cond, axis = 1, -10)

	let expected = [[-1, -2, -10],
	[ 1, 2, -10],
	[ 1, -1, -10]].toTensor

	check: a == expected

	block:
	var a = [[-1, -2, 1],
	[ 1, 2, 0],
	[ 1, -1, 1]].toTensor

	a.masked_fill_along_axis(a.sum(axis = 0) .> 1, axis = 0, -10)

	let expected = [[-1, -2, -10],
	[ 1, 2, -10],
	[ 1, -1, -10]].toTensor

	check: a == expected

[RFC] Masked select and Masked fill #429

[RFC] Masked select and Masked fill #429

Conversation

mratsim commented Apr 5, 2020 • edited Loading

State

API design

masked_select

Tricky part

masked_axis_select

Tricky part

masked_fill

masked_axis_fill

Tricky part

masked_filled_along_axis

Tricky part

brentp commented Apr 5, 2020

Vindaar commented Apr 6, 2020

masked_select

masked_axis_select

masked_fill

masked_axis_fill

masked_filled_along_axis

mratsim commented Apr 7, 2020

mratsim commented Apr 5, 2020 •

edited

Loading

`masked_select`

`masked_axis_select`

`masked_fill`

`masked_axis_fill`

`masked_filled_along_axis`