-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster merge #933
Faster merge #933
Conversation
Version 2, precomputing nnz: User time (seconds): 4814.36
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:22:09
Maximum resident set size (kbytes): 67324388 |
third version User time (seconds): 327.31
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:20.64
Maximum resident set size (kbytes): 41402608 |
...output is consistent with the existing method on large data In [1]: import biom
In [2]: exp = biom.load_table('/qmounts/qiita_data/BIOM/174098/57585_analysis_Metagenomic_Woltkav014DatabasescratchqpwoltkaWoLr2WoLr2
...: BIOMpergenebiom.biom')
In [3]: obs = biom.load_table('result.biom')
In [4]: exp
Out[4]: 3576008 x 1616 <class 'biom.table.Table'> with 454367286 nonzero entries (7% dense)
In [5]: obs
Out[5]: 3576008 x 1616 <class 'biom.table.Table'> with 454367286 nonzero entries (7% dense)
In [6]: exp.matrix_data.data[:10]
Out[6]: array([ 3., 1., 3., 3., 5., 1., 33., 7., 2., 1.])
In [7]: obs.matrix_data.data[:10]
Out[7]: array([ 3., 1., 3., 3., 5., 1., 33., 7., 2., 1.])
In [8]: import numpy as np
In [9]: np.allclose(exp.matrix_data.data, obs.matrix_data.data)
Out[9]: True
In [10]: np.allclose(exp.matrix_data.indptr, obs.matrix_data.indptr)
Out[10]: True
In [11]: np.allclose(exp.matrix_data.indices, obs.matrix_data.indices)
Out[11]: True |
Really nice! Thank you for working on this. Looks good to me but just to confirm, and make sure I'm reading this correctly: if the user has n tables to be merged to a another table, the algorithm will perform much faster if |
That is true here, and with the original implementation as well |
Table._fast_merge
was performing poorly on large tables. Here we revise the algorithm used.List based, on large data, we get: