Fix for integer eq_hist #121

jcrist · 2016-03-21T16:34:17Z

Previously there was an unintentional casting operation in interpolate that resulted in all calls to eq_hist taking the floating point branch. This fixes that, allowing for better histogram equalization for integer arrays. Fixes #120.

Note that I still think how='log' works better for highly skewed distributions.

Previously there was an unintentional casting operation in `interpolate` that resulted in all calls to `eq_hist` taking the floating point branch. This fixes that, allowing for better histogram equalization for integer arrays.

jbednar · 2016-03-21T16:59:05Z

datashader/transfer_functions.py

+
+def _cbrt(data, mask=None):
+    return _linear(data, mask) ** (1/3.)
+


This seems like an important and undesirable change -- whereas users could previously supply simple, user-meaningful code as the "how", e.g. np.log1p or lambda x: x ** 0.42, now they have to import _linear (which has a name that implies people shouldn't mess with it), and then come up with lambda x, mask=None: _linear(x, mask) ** 0.42. Is there no way to support masks that can separate these two concerns, maybe by having a simple type of how like the above, that wouldn't handle the mask explicitly, and a more complex type that includes eq_hist that has a separate mask? Or is the coercion to float something that would mess up other how options, not just eq_hist?

It's not the coercion to float that's the issue, it's the notion of missingness. We coerced to float before to fill missing values with nan, which were then mapped to nan by log and friends. This doesn't work for histogram equalization, so now we pass in the data, and a mask indicating where missing values are. This is the best way I could come up with, and seems perfectly reasonable to me. If you have a better idea, feel free to try it out.

Note that _linear is just the implementation of the linear transfer function, which I used to simplify implementing log and cbrt. There's no mandate on users to use it, and they shouldn't (why it's underscored). For your method, you'd do:

def foo(data, mask): return np.where(mask, np.nan, data**0.42)

Sure, the conversion to float is only a by-product of representing missingness, and the mask represents missingness explicitly, which is otherwise not possible for integers.

In any case, for foo to have the same semantics as e.g. cbrt, wouldn't it have to be:

def foo(data, mask): return np.where(mask, np.nan, data**0.42) if mask is not None else data

The awkwardness and non-obviousness of this is presumably why you used _linear as a shortcut in the code, but either way it's still awkward and non-obvious. If that's what users should do, then we should probably just put that code repetitively into cbrt and log, so that they act as good examples for people who look to see how to write their own how options.

Unfortunately, I can't think of a way around this problem that doesn't require passing in something extra alongside the data array -- only a float type can represent missingness directly, and once something's been converted to a float the information that it was ever integers is now only implicit in the actual data values. E.g. one could pass a flag saying the original datatype, so that the correct branch in eq_hist would be used, but then np.log1p wouldn't work as-is as it did before, and all the functions would have a non-obvious datatype argument that they are ignoring. So the code would be shorter, but not necessarily any more obvious. Grr!

If that's what users should do, then we should probably just put that code repetitively into cbrt and log, so that they act as good examples for people who look to see how to write their own how options.

I guess better would be to move the implementation of _linear into a non-underscored (i.e., importable and reusable) helper function, e.g.:

def nan_mask(data, mask=None): """Given matched-size data and Boolean mask arrays, return a floating-point version of the data with the masked values replaced with NaN""" return np.where(mask, np.nan, data) if mask is not None else data def _linear(data, mask=None): return nan_mask(data,mask)

Actually not, because all applications in interpolate pass in both data and a mask, so the simple where implementation is good enough. Looking at the docstring for the function parameter, it specifies the desired behavior for a callable how, which a non-novice user should be able to meet. I'd rather not export such a simple helper function as part of the interface, because then it makes it harder to change our own internal implementation if users depend on it.

Ok, sounds like the transfer_fns we supply should all use the simpler where implementation, not sharing code with _linear and not making mask be optional, as a clear indication of how we expect people to write their own custom versions. Once that's done, I'm happy for this PR to be merged.

The implementations are shared with colorize, which doesn't pass in a mask. This is why it's optional in our implementations, but not expected of users.

Well, can we change colorize to pass in a mask, then? I really want to minimize the actual and perceived complexity that the user has to deal with, and conveying that they don't have to support the optional argument (and under what conditions) is a complexity cost, just like handling the optional argument is a complexity cost.

jbednar · 2016-03-21T17:31:44Z

I agree that the log version looks slightly better than the eq_hist version for the above example, but compared to e.g. a linear distribution, the plots are actually very similar to each other. For such similar plots, histogram equalization has the massive advantage of being non-parametric, i.e. it doesn't make any assumption about the form of the distribution, which is very important for visualizing unknown data (as is the main use case for the datashader library).

My guess is that most of the reason log looks somewhat better in the plots above is down to the colormap -- in the light-blue/dark-blue colormap, we can't perceive much difference between the various high values (dark blues), and so the high-population regions all look about the same as each other. In the log version, the high-population bits are mapped to a different part of the color range where we can perceive the differences better, but I would think that's just an indication that we should use a more perceptually uniform color map, not a reason to use an a-priori arbitrary transform of the data. E.g. if we use a grey or viridis colormap, I think the eq_hist version looks better:

Some people could clearly prefer one over the other, of course, but at least I don't think there's any clear win for log, and I think at least the grey one shows that there is still structure in the hotspot regions that's not necessarily visible in the blue plots.

jbednar · 2016-03-21T18:59:24Z

Taking this further, there are presumably subjective reasons we wouldn't necessarily want a purely perceptually uniform colormap -- it's very reasonable to want to reserve some good-sized chunk of the available dynamic range just for showing where the hottest hotspots are, particularly as in this case they are very, very hot (on a linear scale). E.g. the log versions of each plot make it more clear that L.A., Chicago, and NYC are the strongest bits. But highlighting those is probably something best done explicitly with a special colormap (e.g. with a special hot-spot color range at the top), not as an implicit outcome of a mismatch between the log scale and the nearly-but-not-quite-log data distribution as in this example. The eq_hist version correctly maps those bits to just the very top of the color range, since only a few pixels approach the highest population counts, but because we might care a lot about those few pixels, we probably want a colormap that makes the very highest values clearly higher than those below them (which after log or eq_hist normalization, may be very, very much higher than the next ones down).

E.g. here's such a colormap, which shows that the information about the hottest spots hasn't been lost from the image representation, even if it's not visible in a particular colormap:

`how` in `colorize` now takes data and mask, same as in interpolate.

jbednar · 2016-03-24T02:22:42Z

Looks good now, thanks!

Fix for integer eq_hist

23a50f5

Previously there was an unintentional casting operation in `interpolate` that resulted in all calls to `eq_hist` taking the floating point branch. This fixes that, allowing for better histogram equalization for integer arrays.

jcrist added the in progress label Mar 21, 2016

jbednar reviewed Mar 21, 2016
View reviewed changes

jcrist force-pushed the eq_hist_fix branch from 89b7855 to dd7ee29 Compare March 23, 2016 20:51

Update colorize to take new how call args

497ae83

`how` in `colorize` now takes data and mask, same as in interpolate.

jcrist force-pushed the eq_hist_fix branch from dd7ee29 to 497ae83 Compare March 23, 2016 20:53

jcrist merged commit b26a5d4 into holoviz:master Mar 23, 2016

jcrist deleted the eq_hist_fix branch March 23, 2016 21:04

jcrist removed the in progress label Mar 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for integer eq_hist #121

Fix for integer eq_hist #121

jcrist commented Mar 21, 2016

jbednar Mar 21, 2016

jcrist Mar 21, 2016

jbednar Mar 21, 2016

jbednar Mar 21, 2016

jcrist Mar 21, 2016

jbednar Mar 21, 2016

jcrist Mar 22, 2016

jbednar Mar 22, 2016

jbednar commented Mar 21, 2016

jbednar commented Mar 21, 2016

jbednar commented Mar 24, 2016


		def _cbrt(data, mask=None):
		return _linear(data, mask) ** (1/3.)

Fix for integer eq_hist #121

Fix for integer eq_hist #121

Conversation

jcrist commented Mar 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbednar commented Mar 21, 2016

jbednar commented Mar 21, 2016

jbednar commented Mar 24, 2016