[query] Improve _same to make exactly one pass over the data #13151

danking · 2023-06-07T19:39:18Z

This both avoids an extra pass to find the failing rows as well as avoiding a an extra pass if the globals depend on non-global data. In particular, this pipeline would run the column aggregations four times (IMO, at most twice is OK):

mt = hl.utils.range_matrix_table(2,2)
mt = mt.annotate_entries(x = mt.row_idx * mt.col_idx)
mt = mt.annotate_cols(mean_x = hl.agg.mean(mt.x))
mt = mt.annotate_entries(x = mt.x - mt.mean_x)
mt._same(mt)

This both avoids an extra pass to find the failing rows as well as avoiding a an extra pass if the globals depend on non-global data. In particular, this pipeline would run the column aggregations four times (IMO, at most twice is OK): ``` mt = hl.utils.range_matrix_table(2,2) mt = mt.annotate_entries(x = mt.row_idx * mt.col_idx) mt = mt.annotate_cols(mean_x = hl.agg.mean(mt.x)) mt = mt.annotate_entries(x = mt.x - mt.mean_x) mt._same(mt) ```

danking assigned chrisvittal Jun 7, 2023

chrisvittal approved these changes Jun 7, 2023

View reviewed changes

danking merged commit 62f606c into hail-is:main Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] Improve _same to make exactly one pass over the data #13151

[query] Improve _same to make exactly one pass over the data #13151

danking commented Jun 7, 2023

[query] Improve _same to make exactly one pass over the data #13151

[query] Improve _same to make exactly one pass over the data #13151

Conversation

danking commented Jun 7, 2023