Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster merge #933

Merged
merged 5 commits into from
May 12, 2023
Merged

Faster merge #933

merged 5 commits into from
May 12, 2023

Conversation

wasade
Copy link
Member

@wasade wasade commented May 11, 2023

Table._fast_merge was performing poorly on large tables. Here we revise the algorithm used.

List based, on large data, we get:

        User time (seconds): 2990.23
        Elapsed (wall clock) time (h:mm:ss or m:ss): 51:55.52
        Maximum resident set size (kbytes): 52802340

@wasade
Copy link
Member Author

wasade commented May 12, 2023

Version 2, precomputing nnz:

        User time (seconds): 4814.36
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22:09
        Maximum resident set size (kbytes): 67324388

@wasade
Copy link
Member Author

wasade commented May 12, 2023

third version

        User time (seconds): 327.31
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:20.64
        Maximum resident set size (kbytes): 41402608

@wasade
Copy link
Member Author

wasade commented May 12, 2023

...output is consistent with the existing method on large data

In [1]: import biom

In [2]: exp = biom.load_table('/qmounts/qiita_data/BIOM/174098/57585_analysis_Metagenomic_Woltkav014DatabasescratchqpwoltkaWoLr2WoLr2
   ...: BIOMpergenebiom.biom')

In [3]: obs = biom.load_table('result.biom')

In [4]: exp
Out[4]: 3576008 x 1616 <class 'biom.table.Table'> with 454367286 nonzero entries (7% dense)

In [5]: obs
Out[5]: 3576008 x 1616 <class 'biom.table.Table'> with 454367286 nonzero entries (7% dense)

In [6]: exp.matrix_data.data[:10]
Out[6]: array([ 3.,  1.,  3.,  3.,  5.,  1., 33.,  7.,  2.,  1.])

In [7]: obs.matrix_data.data[:10]
Out[7]: array([ 3.,  1.,  3.,  3.,  5.,  1., 33.,  7.,  2.,  1.])

In [8]: import numpy as np

In [9]: np.allclose(exp.matrix_data.data, obs.matrix_data.data)
Out[9]: True

In [10]: np.allclose(exp.matrix_data.indptr, obs.matrix_data.indptr)
Out[10]: True

In [11]: np.allclose(exp.matrix_data.indices, obs.matrix_data.indices)
Out[11]: True

@wasade wasade changed the title WIP, faster merge Faster merge May 12, 2023
@wasade
Copy link
Member Author

wasade commented May 12, 2023

cc @ahdilmore @antgonza

@antgonza
Copy link
Contributor

Really nice! Thank you for working on this. Looks good to me but just to confirm, and make sure I'm reading this correctly: if the user has n tables to be merged to a another table, the algorithm will perform much faster if others has all the tables to be merged vs. getting the another table and merging one by one; correct?

@wasade
Copy link
Member Author

wasade commented May 12, 2023

That is true here, and with the original implementation as well

@wasade wasade merged commit b0e71a0 into biocore:master May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants