Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: offset indices in sparse tensor #3725

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

itzhakstern
Copy link
Contributor

@itzhakstern itzhakstern commented Jan 26, 2025

Change the indices of sparse tensors from positions of non-zero elements to offsets.

For example, if we have a tensor:

[0, 29, 0, 0, 17, 6]

its sparse representation would be:

  • values: 29,17,6
  • indices: 1,4,5

Instead, we want to store the differences (offsets) between consecutive indices:

  • values: 29,17,6
  • indices: 1,3,1

This approach seems to significantly reduce the size of compressed Parquet files, as the indices diffs contain repetitive patterns, which compress well.

@itzhakstern itzhakstern changed the title offset indices in sparse tensor feat: offset indices in sparse tensor Jan 26, 2025
@github-actions github-actions bot added the feat label Jan 26, 2025
Copy link

codspeed-hq bot commented Jan 26, 2025

CodSpeed Performance Report

Merging #3725 will improve performances by 82.23%

Comparing itzhakstern:itzhaks/sparse-tensors-indices-to-offsets (5629ece) with main (86adc44)

Summary

⚡ 2 improvements
✅ 25 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_count[1 Small File] 3.7 ms 3.4 ms +10%
test_iter_rows_first_row[100 Small Files] 342.5 ms 188 ms +82.23%

Copy link

codecov bot commented Jan 26, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.86%. Comparing base (6deb87e) to head (5629ece).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3725      +/-   ##
==========================================
+ Coverage   77.31%   77.86%   +0.54%     
==========================================
  Files         734      735       +1     
  Lines       93737    93128     -609     
==========================================
+ Hits        72475    72513      +38     
+ Misses      21262    20615     -647     
Files with missing lines Coverage Δ
src/daft-core/src/array/ops/cast.rs 88.60% <100.00%> (+0.16%) ⬆️

... and 20 files with indirect coverage changes

Copy link
Contributor

@raunakab raunakab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still combing through the PR, but just have a minor nit that I'd like to get out of the way first.

non_zero_values.push(data);
non_zero_indices.push(indices);
non_zero_indices.push(ofssets_indices_arr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: typo here. ofssets_indices_arr should be offsets_indices_arr.

@raunakab
Copy link
Contributor

raunakab commented Feb 5, 2025

@itzhakstern I know the PR is still in draft, but if you don't mind, could you please add a description on what this PR is trying to achieve?

I can gather that it's enabling registering index offsets in sparse tensors, but having a succinct description would be helpful for readability into the PR for other reviewers as well. Thanks in advance!

@itzhakstern itzhakstern marked this pull request as ready for review February 6, 2025 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants