Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884

Merged
merged 12 commits into from
Dec 9, 2022

Conversation

wasade
Copy link
Member

@wasade wasade commented Dec 7, 2022

Avoid dense representation on one_to_many. This code path was implicated in woltka collapse.

Note this is a WIP, I am asserting the memory reduction in woltka collapse right now. However, the high memory requirement is nearly certainly due to the dense memory representation previously used.

cc @antgonza @qiyunzhu

@wasade
Copy link
Member Author

wasade commented Dec 7, 2022

Test failure is due to doc build with Sphinx on py3.7, attempting to work through it.

@wasade
Copy link
Member Author

wasade commented Dec 7, 2022

...okay, now waiting on resources to test this.

@qiyunzhu, Antonio is out-of-office for a while. Is there any chance you could do a review of this PR? The change is relatively straightforward: instead of aggregating data in a dense numpy matrix, we aggregate into a "dict of keys" sparse matrix followed by conversion to compressed sparse column.

Copy link
Contributor

@qiyunzhu qiyunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade This looks impressive! I think it is highly appreciable that you are continuing to develop the BIOM package to enable high-performance table operations. The use of SciPy's sparse matrix could mean significant advantage over other implementations (especially the R solutions).

I have one question concerning the data types: woltka's one-to-many collapsing can optionally divide read counts by feature counts. If the current method has a similar mechanism, then it could implicate that integers become floats. Will that cause any problem in BIOM? Both practically and theoretically (because it is no longer a "contingency table").

In line 2673, I think np.float should be okay. But just wanted to remind if relevant that you may consider casting it to a more explicit type (like np.float64, which basically is the same as Python float), and think a bit to make sure there isn't a concern of overflow in extreme cases (for example, if the data type before casting is np.uint64, u since all numbers are non-negative).

@wasade
Copy link
Member Author

wasade commented Dec 7, 2022

The current method supports divide, and no problem for BIOM. Good call on float64, just pushed a change for that.

The divide may trigger an underflow rather than an overflow. In which case, numpy would warn:

In [2]: v = np.float64(1e1000)

In [3]: v /= 1e1000
<ipython-input-3-4eac3ab25e8a>:1: RuntimeWarning: invalid value encountered in double_scalars
  v /= 1e1000

In [4]: v
Out[4]: nan

Copy link
Contributor

@qiyunzhu qiyunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me now!

@wasade
Copy link
Member Author

wasade commented Dec 7, 2022

Thanks! My test is still running so will hold off merge for the time being

@wasade wasade merged commit 30dce98 into biocore:master Dec 9, 2022
@wasade wasade mentioned this pull request Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants