PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884

wasade · 2022-12-07T17:11:56Z

Avoid dense representation on one_to_many. This code path was implicated in woltka collapse.

Note this is a WIP, I am asserting the memory reduction in woltka collapse right now. However, the high memory requirement is nearly certainly due to the dense memory representation previously used.

cc @antgonza @qiyunzhu

…t now

wasade · 2022-12-07T17:40:27Z

Test failure is due to doc build with Sphinx on py3.7, attempting to work through it.

wasade · 2022-12-07T17:58:11Z

...okay, now waiting on resources to test this.

@qiyunzhu, Antonio is out-of-office for a while. Is there any chance you could do a review of this PR? The change is relatively straightforward: instead of aggregating data in a dense numpy matrix, we aggregate into a "dict of keys" sparse matrix followed by conversion to compressed sparse column.

qiyunzhu

@wasade This looks impressive! I think it is highly appreciable that you are continuing to develop the BIOM package to enable high-performance table operations. The use of SciPy's sparse matrix could mean significant advantage over other implementations (especially the R solutions).

I have one question concerning the data types: woltka's one-to-many collapsing can optionally divide read counts by feature counts. If the current method has a similar mechanism, then it could implicate that integers become floats. Will that cause any problem in BIOM? Both practically and theoretically (because it is no longer a "contingency table").

In line 2673, I think np.float should be okay. But just wanted to remind if relevant that you may consider casting it to a more explicit type (like np.float64, which basically is the same as Python float), and think a bit to make sure there isn't a concern of overflow in extreme cases (for example, if the data type before casting is np.uint64, u since all numbers are non-negative).

wasade · 2022-12-07T19:31:08Z

The current method supports divide, and no problem for BIOM. Good call on float64, just pushed a change for that.

The divide may trigger an underflow rather than an overflow. In which case, numpy would warn:

In [2]: v = np.float64(1e1000)

In [3]: v /= 1e1000
<ipython-input-3-4eac3ab25e8a>:1: RuntimeWarning: invalid value encountered in double_scalars
  v /= 1e1000

In [4]: v
Out[4]: nan

qiyunzhu

Looks great to me now!

wasade · 2022-12-07T21:36:34Z

Thanks! My test is still running so will hold off merge for the time being

wasade added 8 commits December 7, 2022 09:09

PERF: use sparse matrix for table.collapse(..., one_to_many=True)

4acbcfa

DOC: changelog mention

ecddc83

MAINT: remove unneeded code

142c694

Handle both axes

57be034

DOC: mention removal of py36 testing support

bc21a64

DOC: note on a future performance refactor but unclear if needed righ…

081c0f6

…t now

Use py37 for docs

c06e6b8

BLD: relax sphinx version requirement

f5aaad2

BLD: pin jinja2

e2fe287

qiyunzhu reviewed Dec 7, 2022

View reviewed changes

specify np.float64 explicitly, thanks @qiyunzhu

39adcb5

qiyunzhu approved these changes Dec 7, 2022

View reviewed changes

wasade added 2 commits December 8, 2022 13:56

move div higher in scope

1408c1f

PERF: remove deeply nested function calls

ccf7352

wasade merged commit 30dce98 into biocore:master Dec 9, 2022

wasade mentioned this pull request Dec 9, 2022

Drop py3.6 support #883

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884

PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884

wasade commented Dec 7, 2022

wasade commented Dec 7, 2022

wasade commented Dec 7, 2022

qiyunzhu left a comment

wasade commented Dec 7, 2022

qiyunzhu left a comment

wasade commented Dec 7, 2022

PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884

PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884

Conversation

wasade commented Dec 7, 2022

wasade commented Dec 7, 2022

wasade commented Dec 7, 2022

qiyunzhu left a comment

Choose a reason for hiding this comment

wasade commented Dec 7, 2022

qiyunzhu left a comment

Choose a reason for hiding this comment

wasade commented Dec 7, 2022