-
-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for integer eq_hist #121
Conversation
Previously there was an unintentional casting operation in `interpolate` that resulted in all calls to `eq_hist` taking the floating point branch. This fixes that, allowing for better histogram equalization for integer arrays.
|
||
def _cbrt(data, mask=None): | ||
return _linear(data, mask) ** (1/3.) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an important and undesirable change -- whereas users could previously supply simple, user-meaningful code as the "how", e.g. np.log1p
or lambda x: x ** 0.42
, now they have to import _linear
(which has a name that implies people shouldn't mess with it), and then come up with lambda x, mask=None: _linear(x, mask) ** 0.42
. Is there no way to support masks that can separate these two concerns, maybe by having a simple type of how
like the above, that wouldn't handle the mask explicitly, and a more complex type that includes eq_hist that has a separate mask? Or is the coercion to float something that would mess up other how
options, not just eq_hist
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the coercion to float that's the issue, it's the notion of missingness. We coerced to float before to fill missing values with nan
, which were then mapped to nan
by log
and friends. This doesn't work for histogram equalization, so now we pass in the data, and a mask indicating where missing values are. This is the best way I could come up with, and seems perfectly reasonable to me. If you have a better idea, feel free to try it out.
Note that _linear
is just the implementation of the linear transfer function, which I used to simplify implementing log
and cbrt
. There's no mandate on users to use it, and they shouldn't (why it's underscored). For your method, you'd do:
def foo(data, mask):
return np.where(mask, np.nan, data**0.42)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the conversion to float is only a by-product of representing missingness, and the mask represents missingness explicitly, which is otherwise not possible for integers.
In any case, for foo to have the same semantics as e.g. cbrt
, wouldn't it have to be:
def foo(data, mask):
return np.where(mask, np.nan, data**0.42) if mask is not None else data
The awkwardness and non-obviousness of this is presumably why you used _linear as a shortcut in the code, but either way it's still awkward and non-obvious. If that's what users should do, then we should probably just put that code repetitively into cbrt and log, so that they act as good examples for people who look to see how to write their own how
options.
Unfortunately, I can't think of a way around this problem that doesn't require passing in something extra alongside the data array -- only a float type can represent missingness directly, and once something's been converted to a float the information that it was ever integers is now only implicit in the actual data values. E.g. one could pass a flag saying the original datatype, so that the correct branch in eq_hist would be used, but then np.log1p wouldn't work as-is as it did before, and all the functions would have a non-obvious datatype argument that they are ignoring. So the code would be shorter, but not necessarily any more obvious. Grr!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's what users should do, then we should probably just put that code repetitively into cbrt and log, so that they act as good examples for people who look to see how to write their own how options.
I guess better would be to move the implementation of _linear
into a non-underscored (i.e., importable and reusable) helper function, e.g.:
def nan_mask(data, mask=None):
"""Given matched-size data and Boolean mask arrays, return a floating-point version of the data with the masked values replaced with NaN"""
return np.where(mask, np.nan, data) if mask is not None else data
def _linear(data, mask=None):
return nan_mask(data,mask)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not, because all applications in interpolate pass in both data and a mask, so the simple where
implementation is good enough. Looking at the docstring for the function parameter, it specifies the desired behavior for a callable how
, which a non-novice user should be able to meet. I'd rather not export such a simple helper function as part of the interface, because then it makes it harder to change our own internal implementation if users depend on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds like the transfer_fns we supply should all use the simpler where implementation, not sharing code with _linear and not making mask be optional, as a clear indication of how we expect people to write their own custom versions. Once that's done, I'm happy for this PR to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementations are shared with colorize
, which doesn't pass in a mask. This is why it's optional in our implementations, but not expected of users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, can we change colorize to pass in a mask, then? I really want to minimize the actual and perceived complexity that the user has to deal with, and conveying that they don't have to support the optional argument (and under what conditions) is a complexity cost, just like handling the optional argument is a complexity cost.
I agree that the log version looks slightly better than the eq_hist version for the above example, but compared to e.g. a linear distribution, the plots are actually very similar to each other. For such similar plots, histogram equalization has the massive advantage of being non-parametric, i.e. it doesn't make any assumption about the form of the distribution, which is very important for visualizing unknown data (as is the main use case for the datashader library). My guess is that most of the reason log looks somewhat better in the plots above is down to the colormap -- in the light-blue/dark-blue colormap, we can't perceive much difference between the various high values (dark blues), and so the high-population regions all look about the same as each other. In the log version, the high-population bits are mapped to a different part of the color range where we can perceive the differences better, but I would think that's just an indication that we should use a more perceptually uniform color map, not a reason to use an a-priori arbitrary transform of the data. E.g. if we use a grey or viridis colormap, I think the eq_hist version looks better: Some people could clearly prefer one over the other, of course, but at least I don't think there's any clear win for log, and I think at least the grey one shows that there is still structure in the hotspot regions that's not necessarily visible in the blue plots. |
Taking this further, there are presumably subjective reasons we wouldn't necessarily want a purely perceptually uniform colormap -- it's very reasonable to want to reserve some good-sized chunk of the available dynamic range just for showing where the hottest hotspots are, particularly as in this case they are very, very hot (on a linear scale). E.g. the log versions of each plot make it more clear that L.A., Chicago, and NYC are the strongest bits. But highlighting those is probably something best done explicitly with a special colormap (e.g. with a special hot-spot color range at the top), not as an implicit outcome of a mismatch between the log scale and the nearly-but-not-quite-log data distribution as in this example. The eq_hist version correctly maps those bits to just the very top of the color range, since only a few pixels approach the highest population counts, but because we might care a lot about those few pixels, we probably want a colormap that makes the very highest values clearly higher than those below them (which after log or eq_hist normalization, may be very, very much higher than the next ones down). E.g. here's such a colormap, which shows that the information about the hottest spots hasn't been lost from the image representation, even if it's not visible in a particular colormap: |
`how` in `colorize` now takes data and mask, same as in interpolate.
Looks good now, thanks! |
Previously there was an unintentional casting operation in
interpolate
that resulted in all calls toeq_hist
taking the floating point branch. This fixes that, allowing for better histogram equalization for integer arrays. Fixes #120.Note that I still think
how='log'
works better for highly skewed distributions.