Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: use sparse matrix for table.collapse(..., one_to_many=True) #884

Merged
merged 12 commits into from
Dec 9, 2022
10 changes: 5 additions & 5 deletions .github/workflows/python-package-conda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:
- name: flake8
uses: actions/setup-python@v2
with:
python-version: '3.6'
python-version: '3.10'
- name: install dependencies
run: python -m pip install --upgrade pip
- name: lint
Expand All @@ -28,14 +28,14 @@ jobs:
- uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
python-version: '3.6'
python-version: '3.7'
- name: Install dependencies
shell: bash -l {0}
run: |
conda create --yes -n env_name python=3.6
conda create --yes -n env_name python=3.7
conda activate env_name
conda install --name env_name pip click numpy 'scipy>=1.3.1' pep8 flake8 coverage 'pandas>=0.20.0' nose 'h5py>=2.2.0' cython scikit-bio
pip install sphinx==1.2.2 'docutils<0.14'
pip install "sphinx<1.3" 'docutils<0.14' "jinja2<3.1.0"
pip install -e . --no-deps
- name: Build docs
shell: bash -l {0}
Expand All @@ -46,7 +46,7 @@ jobs:
build:
strategy:
matrix:
python-version: ['3.6', '3.10']
python-version: ['3.7', '3.10']
os: [ubuntu-latest, macos-latest]
runs-on: ${{ matrix.os }}

Expand Down
11 changes: 11 additions & 0 deletions ChangeLog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
BIOM-Format ChangeLog
=====================

biom 2.1.11-dev
---------------

Important:

* Python 3.6 testing support has been removed.

New features:

* `table.collapse(..., one_to_many=True)` now uses a sparse matrix on construction, substantially reducing memory overhead [PR #884](https://github.com/biocore/biom-format/pull/884).

biom 2.1.11
-----------

Expand Down
53 changes: 34 additions & 19 deletions biom/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@
from collections.abc import Hashable, Iterable
from numpy import ndarray, asarray, zeros, newaxis
from scipy.sparse import (coo_matrix, csc_matrix, csr_matrix, isspmatrix,
vstack, hstack)
vstack, hstack, dok_matrix)
import pandas as pd
import re
from biom.exception import (TableException, UnknownAxisError, UnknownIDError,
Expand Down Expand Up @@ -2607,27 +2607,21 @@ def collapse(self, f, collapse_f=None, norm=True, min_group_size=1,

# transpose is only necessary in the one-to-one case
# new_data_shape is only necessary in the one-to-many case
# axis_slice is only necessary in the one-to-many case
# axis_update is only necessary in the one-to-many case
def axis_ids_md(t):
return (t.ids(axis=axis), t.metadata(axis=axis))

if axis == 'sample':
transpose = True

def new_data_shape(ids, collapsed):
return (len(ids), len(collapsed))

def axis_slice(lookup, key):
return (slice(None), lookup[key])
def axis_update(offaxis, onaxis):
return (offaxis, onaxis)

elif axis == 'observation':
transpose = False

def new_data_shape(ids, collapsed):
return (len(collapsed), len(ids))

def axis_slice(lookup, key):
return (lookup[key], slice(None))
def axis_update(offaxis, onaxis):
return (onaxis, offaxis)

else:
raise UnknownAxisError(axis)
Expand Down Expand Up @@ -2670,11 +2664,15 @@ def axis_slice(lookup, key):

# We need to store floats, not ints, as things won't always divide
# evenly.
dtype = float if one_to_many_mode == 'divide' else self.dtype
dtype = np.float64 if one_to_many_mode == 'divide' else self.dtype

new_data = zeros(new_data_shape(self.ids(self._invert_axis(axis)),
new_md),
dtype=dtype)
if axis == 'observation':
new_data = dok_matrix((len(self.ids(axis='sample')),
len(new_md)),
dtype=dtype)
else:
new_data = dok_matrix((len(self.ids(axis='observation')),
len(new_md)), dtype=dtype)

# for each vector
# for each bin in the metadata
Expand All @@ -2697,11 +2695,23 @@ def axis_slice(lookup, key):
continue
except StopIteration:
break

# TODO: refactor Table.collapse(..., one_to_many=True) so
# writes into new_data are performed without regard to
# the requested axis, and perform a single transpose at the
# end. Right now we incur many calls to `axis_update` which
# could be avoided. However, this refactor is likely
# complex to do correctly, so punting for now as we don't
# yet have data showing this is a real world performance
# concern.
column = idx_lookup[part]
if one_to_many_mode == 'add':
new_data[axis_slice(idx_lookup, part)] += vals
for vidx, v in enumerate(vals):
new_data[vidx, column] += v
else:
new_data[axis_slice(idx_lookup, part)] += \
vals / md_count[id_]
dv = md_count[id_]
for vidx, v in enumerate(vals / dv):
new_data[vidx, column] += v

if include_collapsed_metadata:
# reassociate pathway information
Expand All @@ -2713,6 +2723,11 @@ def axis_slice(lookup, key):
key=itemgetter(1))]

# convert back to self type
if axis == 'observation':
new_data = csr_matrix(new_data.T)
else:
new_data = csc_matrix(new_data)

data = self._conv_to_self_type(new_data)
else:
if collapse_f is None:
Expand Down