Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for integer eq_hist #121

Merged
merged 2 commits into from
Mar 23, 2016
Merged

Fix for integer eq_hist #121

merged 2 commits into from
Mar 23, 2016

Conversation

jcrist
Copy link
Collaborator

@jcrist jcrist commented Mar 21, 2016

Previously there was an unintentional casting operation in interpolate that resulted in all calls to eq_hist taking the floating point branch. This fixes that, allowing for better histogram equalization for integer arrays. Fixes #120.

Note that I still think how='log' works better for highly skewed distributions.

screen shot 2016-03-21 at 11 32 48 am

Previously there was an unintentional casting operation in `interpolate`
that resulted in all calls to `eq_hist` taking the floating point
branch. This fixes that, allowing for better histogram equalization for
integer arrays.

def _cbrt(data, mask=None):
return _linear(data, mask) ** (1/3.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an important and undesirable change -- whereas users could previously supply simple, user-meaningful code as the "how", e.g. np.log1p or lambda x: x ** 0.42, now they have to import _linear (which has a name that implies people shouldn't mess with it), and then come up with lambda x, mask=None: _linear(x, mask) ** 0.42. Is there no way to support masks that can separate these two concerns, maybe by having a simple type of how like the above, that wouldn't handle the mask explicitly, and a more complex type that includes eq_hist that has a separate mask? Or is the coercion to float something that would mess up other how options, not just eq_hist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the coercion to float that's the issue, it's the notion of missingness. We coerced to float before to fill missing values with nan, which were then mapped to nan by log and friends. This doesn't work for histogram equalization, so now we pass in the data, and a mask indicating where missing values are. This is the best way I could come up with, and seems perfectly reasonable to me. If you have a better idea, feel free to try it out.

Note that _linear is just the implementation of the linear transfer function, which I used to simplify implementing log and cbrt. There's no mandate on users to use it, and they shouldn't (why it's underscored). For your method, you'd do:

def foo(data, mask):
    return np.where(mask, np.nan, data**0.42)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the conversion to float is only a by-product of representing missingness, and the mask represents missingness explicitly, which is otherwise not possible for integers.

In any case, for foo to have the same semantics as e.g. cbrt, wouldn't it have to be:

def foo(data, mask):
    return np.where(mask, np.nan, data**0.42) if mask is not None else data

The awkwardness and non-obviousness of this is presumably why you used _linear as a shortcut in the code, but either way it's still awkward and non-obvious. If that's what users should do, then we should probably just put that code repetitively into cbrt and log, so that they act as good examples for people who look to see how to write their own how options.

Unfortunately, I can't think of a way around this problem that doesn't require passing in something extra alongside the data array -- only a float type can represent missingness directly, and once something's been converted to a float the information that it was ever integers is now only implicit in the actual data values. E.g. one could pass a flag saying the original datatype, so that the correct branch in eq_hist would be used, but then np.log1p wouldn't work as-is as it did before, and all the functions would have a non-obvious datatype argument that they are ignoring. So the code would be shorter, but not necessarily any more obvious. Grr!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's what users should do, then we should probably just put that code repetitively into cbrt and log, so that they act as good examples for people who look to see how to write their own how options.

I guess better would be to move the implementation of _linear into a non-underscored (i.e., importable and reusable) helper function, e.g.:

def nan_mask(data, mask=None):
    """Given matched-size data and Boolean mask arrays, return a floating-point version of the data with the masked values replaced with NaN"""
    return np.where(mask, np.nan, data) if mask is not None else data

def _linear(data, mask=None):
    return nan_mask(data,mask)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not, because all applications in interpolate pass in both data and a mask, so the simple where implementation is good enough. Looking at the docstring for the function parameter, it specifies the desired behavior for a callable how, which a non-novice user should be able to meet. I'd rather not export such a simple helper function as part of the interface, because then it makes it harder to change our own internal implementation if users depend on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds like the transfer_fns we supply should all use the simpler where implementation, not sharing code with _linear and not making mask be optional, as a clear indication of how we expect people to write their own custom versions. Once that's done, I'm happy for this PR to be merged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementations are shared with colorize, which doesn't pass in a mask. This is why it's optional in our implementations, but not expected of users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, can we change colorize to pass in a mask, then? I really want to minimize the actual and perceived complexity that the user has to deal with, and conveying that they don't have to support the optional argument (and under what conditions) is a complexity cost, just like handling the optional argument is a complexity cost.

@jbednar
Copy link
Member

jbednar commented Mar 21, 2016

I agree that the log version looks slightly better than the eq_hist version for the above example, but compared to e.g. a linear distribution, the plots are actually very similar to each other. For such similar plots, histogram equalization has the massive advantage of being non-parametric, i.e. it doesn't make any assumption about the form of the distribution, which is very important for visualizing unknown data (as is the main use case for the datashader library).

My guess is that most of the reason log looks somewhat better in the plots above is down to the colormap -- in the light-blue/dark-blue colormap, we can't perceive much difference between the various high values (dark blues), and so the high-population regions all look about the same as each other. In the log version, the high-population bits are mapped to a different part of the color range where we can perceive the differences better, but I would think that's just an indication that we should use a more perceptually uniform color map, not a reason to use an a-priori arbitrary transform of the data. E.g. if we use a grey or viridis colormap, I think the eq_hist version looks better:

image

image

Some people could clearly prefer one over the other, of course, but at least I don't think there's any clear win for log, and I think at least the grey one shows that there is still structure in the hotspot regions that's not necessarily visible in the blue plots.

@jbednar
Copy link
Member

jbednar commented Mar 21, 2016

Taking this further, there are presumably subjective reasons we wouldn't necessarily want a purely perceptually uniform colormap -- it's very reasonable to want to reserve some good-sized chunk of the available dynamic range just for showing where the hottest hotspots are, particularly as in this case they are very, very hot (on a linear scale). E.g. the log versions of each plot make it more clear that L.A., Chicago, and NYC are the strongest bits. But highlighting those is probably something best done explicitly with a special colormap (e.g. with a special hot-spot color range at the top), not as an implicit outcome of a mismatch between the log scale and the nearly-but-not-quite-log data distribution as in this example. The eq_hist version correctly maps those bits to just the very top of the color range, since only a few pixels approach the highest population counts, but because we might care a lot about those few pixels, we probably want a colormap that makes the very highest values clearly higher than those below them (which after log or eq_hist normalization, may be very, very much higher than the next ones down).

E.g. here's such a colormap, which shows that the information about the hottest spots hasn't been lost from the image representation, even if it's not visible in a particular colormap:

image

`how` in `colorize` now takes data and mask, same as in interpolate.
@jcrist jcrist merged commit b26a5d4 into holoviz:master Mar 23, 2016
@jcrist jcrist deleted the eq_hist_fix branch March 23, 2016 21:04
@jbednar
Copy link
Member

jbednar commented Mar 24, 2016

Looks good now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eq_hist forces values to floats, hobbling equalization of integer counts
2 participants