Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] "group by" for cat axes / index based slicing #211

Open
andrzejnovak opened this issue Apr 30, 2021 · 13 comments
Open

[FEATURE] "group by" for cat axes / index based slicing #211

andrzejnovak opened this issue Apr 30, 2021 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@andrzejnovak
Copy link
Member

andrzejnovak commented Apr 30, 2021

I don't think this is currently implemented, but would be super useful, allowing to merge samples that were processed separately.

I am imagining syntax like:
h[{'category: {'merged': ['sampleA', 'sampleB'], ...}}]

Also I thought

h[...,[0, 2], ...]

would work, but it doesn't seem to be possibly currently.

@andrzejnovak andrzejnovak added the enhancement New feature or request label Apr 30, 2021
@andrzejnovak andrzejnovak changed the title [FEATURE] "group by" for cat axes [FEATURE] "group by" for cat axes / index based slicing May 1, 2021
@andrzejnovak
Copy link
Member Author

@henryiii
Copy link
Member

henryiii commented May 1, 2021

h2[...] = h1.view(flow=True) allows assignment. Histogram histogram assignment would probably need some checking on the axes which is not implemented yet.

@henryiii
Copy link
Member

henryiii commented May 1, 2021

h[...,[0, 2], ...] would probably need (or be best with) support in Boost.Histogram, see scikit-hep/boost-histogram#296. @HDembinski, is this something that can be supported upstream? If we have to, we can implement it in boost-histogram via a workaround.

PS: Assuming this is for unordered axes only.

@henryiii
Copy link
Member

henryiii commented May 1, 2021

PS: The issue that got opened and fixed in Boost.Histogram was for slicing on categorical axes, which enabled h[..., 0:2, ...] to work, but not selecting a subset of a categorical axes as is requested here and mentioned on the original boost-histogram issue.

@andrzejnovak
Copy link
Member Author

andrzejnovak commented May 1, 2021

Ok thanks. In case anyone stumbles here. This seems to be the workaround, thanks @henryiii

def groupby(h, groupmap, axis='dataset'):
    new = Hist(*[ax for ax in h.axes if ax.name != axis], 
                hist.axis.StrCategory(groupmap.keys(), name=axis, growth=True), 
                hist.storage.Weight()
          )

    for name, cats in groupmap.items():
        grouped = sum([h[{axis: name}] for name in cats])
        new[{axis: name}] = grouped.view(flow=True)
    return new

@HDembinski
Copy link
Member

Seems like a nice feature for boost-histogram.

@andrzejnovak
Copy link
Member Author

Related issue. This new[{axis: name}] = grouped.view(flow=True) syntax fails when growth axis dimensions don't match.

@henryiii
Copy link
Member

henryiii commented May 18, 2021

Seems like a nice feature for boost-histogram.

boost-histogram doesn't have named axes, so it wouldn't be as pretty, and would need another layer of wrapping in Hist anyway, just like fill, project, ... (not against it, but probably best to implement it here first)

This new[{axis: name}] = grouped.view(flow=True) syntax fails when growth axis dimensions don't match.

How would it know what entries to add?

h[...,[0, 2], ...]

This is almost implementable on top of scikit-hep/boost-histogram#576, save for the caveats mentioned there.

@andrzejnovak
Copy link
Member Author

How would it know what entries to add?

Admittedly I didn't think about it too deeply, but it could just pad zeros to the dimensions along the missing categorical entries? Should be equivalent to adding two histograms where the growth/cat axis has different entries?

@nsmith-
Copy link
Member

nsmith- commented Jan 5, 2023

Since we know the new axis elements already (the dictionary keys) I think we could have a workaround without growth as follows:

import hist

def group(h: hist.Hist, oldname: str, newname: str, grouping: dict[str, list[str]]):
    hnew = hist.Hist(
        hist.axis.StrCategory(grouping, name=newname),
        *(ax for ax in h.axes if ax.name != oldname),
        storage=h._storage_type,
    )
    for i, indices in enumerate(grouping.values()):
        hnew.view(flow=True)[i] = h[{oldname: indices}][{oldname: sum}].view(flow=True)

    return hnew

Note that the new axis is put at the beginning (for convenience in implementation). I couldn't find a public accessor for the storage type though.

An example

h = (
    hist.Hist.new
    .StrCat("abcde", name="letter")
    .Reg(10, 0, 1, name="number")
    .Double()
)

grouping = {
    "vowel": ["a", "e"],
    "consonant": ["b", "c", "d"],
}

print(group(h, "letter", "type", grouping))

returning

Hist(
  StrCategory(['vowel', 'consonant'], name='type', label='type'),
  Regular(10, 0, 1, name='number', label='number'),
  storage=Double())

@nsmith-
Copy link
Member

nsmith- commented Oct 17, 2023

A small update to my previous comment: the workaround now needs h._storage_type() due to a warning about passing the type and not an instance. Perhaps we can have a public accessor for the storage type that is stable?

@henryiii
Copy link
Member

henryiii commented Oct 17, 2023

Can't you use h.storage_type?

@nsmith-
Copy link
Member

nsmith- commented Oct 17, 2023

Oops, guess it exists now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants