Faster merge #933

wasade · 2023-05-11T23:42:11Z

Table._fast_merge was performing poorly on large tables. Here we revise the algorithm used.

List based, on large data, we get:

        User time (seconds): 2990.23
        Elapsed (wall clock) time (h:mm:ss or m:ss): 51:55.52
        Maximum resident set size (kbytes): 52802340

wasade · 2023-05-12T01:38:54Z

Version 2, precomputing nnz:

        User time (seconds): 4814.36
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22:09
        Maximum resident set size (kbytes): 67324388

wasade · 2023-05-12T17:27:42Z

third version

        User time (seconds): 327.31
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:20.64
        Maximum resident set size (kbytes): 41402608

wasade · 2023-05-12T17:35:14Z

...output is consistent with the existing method on large data

In [1]: import biom

In [2]: exp = biom.load_table('/qmounts/qiita_data/BIOM/174098/57585_analysis_Metagenomic_Woltkav014DatabasescratchqpwoltkaWoLr2WoLr2
   ...: BIOMpergenebiom.biom')

In [3]: obs = biom.load_table('result.biom')

In [4]: exp
Out[4]: 3576008 x 1616 <class 'biom.table.Table'> with 454367286 nonzero entries (7% dense)

In [5]: obs
Out[5]: 3576008 x 1616 <class 'biom.table.Table'> with 454367286 nonzero entries (7% dense)

In [6]: exp.matrix_data.data[:10]
Out[6]: array([ 3.,  1.,  3.,  3.,  5.,  1., 33.,  7.,  2.,  1.])

In [7]: obs.matrix_data.data[:10]
Out[7]: array([ 3.,  1.,  3.,  3.,  5.,  1., 33.,  7.,  2.,  1.])

In [8]: import numpy as np

In [9]: np.allclose(exp.matrix_data.data, obs.matrix_data.data)
Out[9]: True

In [10]: np.allclose(exp.matrix_data.indptr, obs.matrix_data.indptr)
Out[10]: True

In [11]: np.allclose(exp.matrix_data.indices, obs.matrix_data.indices)
Out[11]: True

wasade · 2023-05-12T17:36:55Z

cc @ahdilmore @antgonza

antgonza · 2023-05-12T17:43:31Z

Really nice! Thank you for working on this. Looks good to me but just to confirm, and make sure I'm reading this correctly: if the user has n tables to be merged to a another table, the algorithm will perform much faster if others has all the tables to be merged vs. getting the another table and merging one by one; correct?

wasade · 2023-05-12T17:53:12Z

That is true here, and with the original implementation as well

wasade added 2 commits May 11, 2023 15:56

PERF: list-based bisection for merge

8156224

Add some comments

0f94426

Use coo

a1f3201

wasade added 2 commits May 12, 2023 10:30

STY: flake8

6cc0825

DOC: perf mention

7901d20

wasade changed the title ~~WIP, faster merge~~ Faster merge May 12, 2023

antgonza approved these changes May 12, 2023

View reviewed changes

wasade merged commit b0e71a0 into biocore:master May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster merge #933

Faster merge #933

wasade commented May 11, 2023 •

edited

Loading

wasade commented May 12, 2023

wasade commented May 12, 2023

wasade commented May 12, 2023

wasade commented May 12, 2023

antgonza commented May 12, 2023

wasade commented May 12, 2023

Faster merge #933

Faster merge #933

Conversation

wasade commented May 11, 2023 • edited Loading

wasade commented May 12, 2023

wasade commented May 12, 2023

wasade commented May 12, 2023

wasade commented May 12, 2023

antgonza commented May 12, 2023

wasade commented May 12, 2023

wasade commented May 11, 2023 •

edited

Loading