Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify group, bin, and reduce. #272

Merged
merged 48 commits into from
Mar 28, 2021
Merged

Unify group, bin, and reduce. #272

merged 48 commits into from
Mar 28, 2021

Conversation

mbostock
Copy link
Member

@mbostock mbostock commented Mar 25, 2021

TODO

  • Automatic “Frequency” label (or “count”?)
  • Proportion (normalize to [0, 1]) (proportion without an input channel)
  • Weighted proportion (sum + normalize) (proportion with an input channel)
  • Weighted percent
  • Per-facet normalization (normalize: "facet") (proportion-facet)
  • Per-group normalization (normalize: "z") (proportion-z) (proportion-group)
  • Percent (normalize to [0, 100])
  • Reducers for the bin transform
  • Cumulative reducers (needed for binning at least)?
  • Respect the scale’s domain (will land fix group domain #271)

@Fil
Copy link
Contributor

Fil commented Mar 25, 2021

I've fixed (sometimes half-fixed) a few tests; here's what I've noticed:

  • the automatic label “Frequency ↑” has disappeared—should we add it to the axis or can we keep it in the transform? (mobyDickLetterFrequency, mobyDickFaceted, wordLengthMobyDick);
  • penguinSpeciesGroup and wordLengthMobyDick need normalize: true to work;
  • penguinSpeciesIsland needs to specify z: "island" (not automatically taken from fill: "island") for grouping to work properly
  • defaults would give the same as now? e.g. groupX => {y: "count"} if y is not specified, and {y: "sum"} if it is? (mobyDickFaceted, mobyDickLetterPosition, penguinSpeciesGroup, penguinSpeciesIsland, seattleTemperatureCell, wordLengthMobyDick)

@mbostock
Copy link
Member Author

mbostock commented Mar 25, 2021

Thanks for the help.

  • I think we could add some implicit labels from the reducer but in some cases we may not want it because it’s better to propagate the reduced dimension’s label, e.g., if the y-position is computed as the grouped sum of units, then we probably want “↑ units” as the implicit label. If the source dimension does not have a label then I think it could make sense to fall back to the reducer name, say “↑ Frequency” (or “↑ Count”?) or “↑ Frequency (%)”.
  • Yep, plan to implement normalization, probably by extending the reducer interface somehow so that it understands scopes (facet or z).
  • Oops, that’s an oversight I re-introduced in a81ac42. Need to bring back the firstof to compute the group dimension as the first of {z, fill, stroke}.
  • Yep, plan to add an arguments.length === 1 check to allow default outputs like you said.

@mbostock
Copy link
Member Author

Yep, plan to add an arguments.length === 1 check to allow default outputs like you said.

Oops, I don’t think this will work, e.g., Plot.groupX({y: "count"}) is a single argument, but representing outputs rather than options. I think it’s better to keep the outputs as explicitly required, for clarity.

@mbostock
Copy link
Member Author

I think we also want to change the design of the bin transform to support reducers and be consistent with the group transform.

@mbostock
Copy link
Member Author

Pretty close now, but I don’t like the way that normalize is implemented. It feels like it should be tied to the reducer, not a separate option, so that some reducers can be normalized while others are not. I’m not sure yet how to express this.

@mbostock mbostock changed the title Unify group and reduce. Unify group, bin, and reduce. Mar 26, 2021
@mbostock
Copy link
Member Author

mbostock commented Mar 26, 2021

This should be ready for review now! 👏

I think there’s still room for optimization with the bin transform: we compute the bins across the entire data, and then use them to subset the groups, but I suspect there’s a better data structure we could be using to make this faster. (Maybe sorted indexes?) I don’t think this optimization is necessary to land this PR; I’m just making a note for future work.

I also want to land #271 but I think we can do that separately. I will work on that after landing this. (Or you can work on it and I can help!)

@mbostock mbostock force-pushed the mbostock/group-reduce branch from 159ceb8 to 779fa07 Compare March 27, 2021 02:41
@mbostock
Copy link
Member Author

I added a percent: true option to scales that applies a transform: x => x * 100 and adds an “(%)” for implicit axis labels.

Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work!

I suggest a few changes (but I can send them as a PR if you prefer?):

  • rename context to basis
  • test defined for selectFirst and selectLast

const value = maybeInput(name, inputs);
const reducer = maybeReduce(reduce, value);
const [output, setOutput] = lazyChannel(labelof(value, reducer.label));
let V, O, context;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let V, O, context;
let V, O, basis;

rename for consistency with https://github.com/observablehq/plot/pull/272/files#diff-a376dffc7b64d2f0f302cb41caadd7b4d800823b63f90a2a5b16927dfea785daR211 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally gave this a more generic name than basis because it’s up to the reducer to decide what it means. In all the current implementations, it’s a basis for normalization, but it could be anything. It’s an internal-only name though so it shouldn’t matter too much what the name is.

V = valueof(data, value);
O = setOutput([]);
if (reducer.scope === "data") {
context = reducer.reduce(range(data), V);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
context = reducer.reduce(range(data), V);
basis = reducer.reduce(range(data), V);

},
scope(scope, I) {
if (reducer.scope === scope) {
context = reducer.reduce(I, V);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
context = reducer.reduce(I, V);
basis = reducer.reduce(I, V);

}
},
reduce(I) {
O.push(reducer.reduce(I, V, context));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
O.push(reducer.reduce(I, V, context));
O.push(reducer.reduce(I, V, basis));


const reduceFirst = {
reduce(I, X) {
return X[I[0]];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return X[I[0]];
return X[I.find(i => defined(X[i]))];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a similar issue in selectFirst/selectMinX, which can return a data point with null x.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there’s a TODO there:

// TODO If the value (for some required channel) is undefined, scan forward?

The challenge with the selectFirst and selectLast transform is that there isn’t an associated channel (or channels), so it’s unclear which channel you should be testing for defined-ness. We can test for defined-ness here but that would be inconsistent with the automatic behavior for z, fill, and stroke, and also inconsistent with selectFirst. So I think we should punt on this and solve it separately.


const reduceLast = {
reduce(I, X) {
return X[I[I.length - 1]];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return X[I[I.length - 1]];
return X[I.slice().reverse().find(i => defined(X[i]))];

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition the issues mentioned above, this could be made faster. 🙂

Fil added a commit that referenced this pull request Mar 27, 2021
…llowing NaN, null and undefined as (ordinal) classes

groups respect the domain option

fixes #52
fixes #45
fixes #255
supersedes #272
@Fil
Copy link
Contributor

Fil commented Mar 27, 2021

I have a doubt about the "proportion-group" reduction, which doesn't correspond to my intuition. I've got a complex example here https://next.observablehq.com/d/61ca1967e419b882#HELP_NEEDED

EDIT: Here's a simpler example. https://observablehq.com/d/9727c8201871b9cb

@mbostock
Copy link
Member Author

Z is category in that example (because of the fill channel).

@mbostock
Copy link
Member Author

mbostock commented Mar 28, 2021

Re. https://observablehq.com/d/9727c8201871b9cb, there’s no way to get what you want (other than faceting). The proportion-group scope only applies to the {z, fill, stroke} dimension (here fill):

for (const [, I] of maybeGroup(facet, G)) {
for (const o of outputs) o.scope("group", I);

The proportion you want is relative to the y dimension.

for (const [y, gg] of maybeGroup(I, Y)) {

It so happens that we group y and then x, so it would be possible to support “proportion-y”, but we’d have to invert the order of the loops to support “proportion-x”. Which… would be possible with some variable name dancing (remapping x to y when invoking groupn), but would be a little tricky to get right. This is part of the reason I was thinking that we should eliminate the secondary group dimension in favor of faceting (but that would also mean that you can’t have additional grouping within the facet).

@mbostock
Copy link
Member Author

I guess we could get rid of “proportion-group”, and just require people to use faceting and “proportion-facet”? I guess another option would be to rename to to “proportion-{channel}” so that it’s more explicit and predictable, but then we’ll have to do some fancy dynamic ordering of the loops to ensure that the proportion channel is the outermost.

@mbostock
Copy link
Member Author

Also, proportion-group has a different meaning for the bin transform, which makes it extra confusing. proportion-group groups by {z, fill, stroke} + y for binX, whereas for groupX it only groups by {z, fill, stroke}.

plot/src/transforms/bin.js

Lines 119 to 120 in f08e17f

for (const [k, g] of maybeGroup(I, K)) {
for (const o of outputs) o.scope("group", g);

@mbostock mbostock merged commit b5b9471 into main Mar 28, 2021
@mbostock mbostock deleted the mbostock/group-reduce branch March 28, 2021 01:04
@Fil
Copy link
Contributor

Fil commented Mar 28, 2021

Yes I was reaching the same conclusion as your comment. For proportion-group the relevant group should be y for groupY, x for groupX, and z for groupZ, or we should be explicit about the group (proportion-x, proportion-y…). When/if we want to do that we'll have to loop in a different order for each case. Thank you for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants