Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added ghead/gtail support for n>1 #5089

Merged
merged 17 commits into from
Aug 25, 2021
Merged

added ghead/gtail support for n>1 #5089

merged 17 commits into from
Aug 25, 2021

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Jul 30, 2021

Closes #5060

Adds Gforce support for ghead/gtail and n > 1.

Note: This leads to the change of test 3 test cases which explicitly checked the non support of ghead/gtail for n != 0.

@ben-schwen ben-schwen added feature request GForce issues relating to optimized grouping calculations (GForce) labels Jul 31, 2021
@jangorecki
Copy link
Member

jangorecki commented Aug 1, 2021

Nice, this change should nicely speed up db-benchmark group q8. https://h2oai.github.io/db-benchmark/

@mattdowle
Copy link
Member

Looks great!
I'm just looking at the lines of code, and the common lines of code in this file in general. I'm wondering why don't these functions gather the row numbers first, and then call subsetVectorRaw to get the result. That would collapse the lines of code in the file dramatically (all the type cases would go away) and if .SD has a lot of columns, then the processing of grpsize, ff etc to gather the row numbers need only happen once rather than for each column.

@ben-schwen
Copy link
Member Author

@mattdowle that sounds pretty reasonable as long as we do not introduce something like na.rm or na.omit for ghead/gtail, since then we would still need the cases?

A proof of concept (if I understood it right) can be found in

data.table/src/gsumm.c

Lines 1107 to 1123 in 27a829c

SEXP idx = allocVector(INTSXP, anslen);
int *idxp = INTEGER(idx);
int offset = 0;
for (int i=0; i<ngrp; ++i) {
const int thisgrpsize = grpsize[i];
for (int j=0; j<val; ++j) {
if (j > thisgrpsize-1) break;
int k = ff[i]+j-1;
if (isunsorted) k = oo[k]-1;
k = (irowslen == -1) ? k : irows[k]-1;
idxp[offset+j] = k+1;
}
if (val > thisgrpsize) offset += thisgrpsize; else offset += val;
}
SEXP ans = PROTECT(allocVector(TYPEOF(x), anslen));
copyMostAttrib(x, ans);
subsetVectorRaw(ans, x, idx, false);
cutting down the lines of code from >100 to <40. Same approach could be used for gfirst/gtail and gnthvalue.

@mattdowle mattdowle added this to the 1.14.1 milestone Aug 25, 2021
@codecov
Copy link

codecov bot commented Aug 25, 2021

Codecov Report

Merging #5089 (fde19a9) into master (b33dee6) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5089      +/-   ##
==========================================
- Coverage   99.53%   99.53%   -0.01%     
==========================================
  Files          76       76              
  Lines       14468    14448      -20     
==========================================
- Hits        14401    14381      -20     
  Misses         67       67              
Impacted Files Coverage Δ
R/data.table.R 99.94% <100.00%> (+<0.01%) ⬆️
src/gsumm.c 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b33dee6...fde19a9. Read the comment docs.

@mattdowle mattdowle merged commit 7f3fba9 into master Aug 25, 2021
@mattdowle mattdowle deleted the ghead branch August 25, 2021 08:05
@jangorecki
Copy link
Member

Looks great, do we have tests for n=0, n=NA and negative n?

@mattdowle
Copy link
Member

Hm. Good point. Looks like we don't, and they don't work.

@ben-schwen ben-schwen mentioned this pull request Oct 9, 2021
7 tasks
mattdowle added a commit that referenced this pull request Dec 3, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request GForce issues relating to optimized grouping calculations (GForce)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GForce optimization for head/tail with arbitrary n
3 participants