Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reprecate -n and add --n-substeps #18

Merged
merged 4 commits into from
Nov 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,16 +127,16 @@ The methods can be used directly through the command line after install:
$ which ssu
/Users/<username>/miniconda3/envs/unifrac/bin/ssu
$ ssu --help
usage: ssu -i <biom> -o <out.dm> -m [METHOD] -t <newick> [-n threads] [-a alpha] [-f] [--vaw]
[--mode [MODE]] [--start starting-stripe] [--stop stopping-stripe] [--partial-pattern <glob>]
usage: ssu -i <biom> -o <out.dm> -m [METHOD] -t <newick> [-a alpha] [-f] [--vaw]
[--mode MODE] [--start starting-stripe] [--stop stopping-stripe] [--partial-pattern <glob>]
[--n-partials number_of_partitions] [--report-bare] [--format|-r out-mode]
[--n-substeps n] [--pcoa dims] [--diskbuf path]

-i The input BIOM table.
-t The input phylogeny in newick.
-m The method, [unweighted | weighted_normalized | weighted_unnormalized | generalized |
unweighted_fp32 | weighted_normalized_fp32 | weighted_unnormalized_fp32 | generalized_fp32].
-o The output distance matrix.
-n [OPTIONAL] The number of threads, default is 1.
-a [OPTIONAL] Generalized UniFrac alpha, default is 1.
-f [OPTIONAL] Bypass tips, reduces compute by about 50%.
--vaw [OPTIONAL] Variance adjusted, default is to not adjust for variance.
Expand All @@ -148,18 +148,26 @@ The methods can be used directly through the command line after install:
--start [OPTIONAL] If mode==partial, the starting stripe.
--stop [OPTIONAL] If mode==partial, the stopping stripe.
--partial-pattern [OPTIONAL] If mode==merge-partial, a glob pattern for partial outputs to merge.
--n-partials [OPTIONAL] If mode==partial-report, the number of partitions to compute.
--n-partials [OPTIONAL] If mode==partial-report, the number of partitions to compute.
--report-bare [OPTIONAL] If mode==partial-report, produce barebones output.
--n-substeps [OPTIONAL] Internally split the problem in n substeps for reduced memory footprint, default is 1.
--format|-r [OPTIONAL] Output format:
ascii : [DEFAULT] Original ASCII format.
hfd5 : HFD5 format. May be fp32 or fp64, depending on method.
hdf5_fp32 : HFD5 format, using fp32 precision.
hdf5_fp64 : HFD5 format, using fp64 precision.
--pcoa [OPTIONAL] Number of PCoA dimensions to compute (default: 10, do not compute if 0)
--diskbuf [OPTIONAL] Use a disk buffer to reduce memory footprint. Provide path to a fast partition (ideally NVMe).
-n [OPTIONAL] DEPRECATED, no-op.

Environment variables:
CPU parallelism is controlled by OMP_NUM_THREADS. If not defined, all detected core will be used.
GPU offload can be disabled with UNIFRAC_USE_GPU=N. By default, if a NVIDIA GPU is detected, it will be used.
A specific GPU can be selected with ACC_DEVICE_NUM. If not defined, the first GPU will be used.

Citations:
For UniFrac, please see:
Sfiligoi et al. mSystems 2022; DOI: 10.1128/msystems.00028-22
McDonald et al. Nature Methods 2018; DOI: 10.1038/s41592-018-0187-8
Lozupone and Knight Appl Environ Microbiol 2005; DOI: 10.1128/AEM.71.12.8228-8235.2005
Lozupone et al. Appl Environ Microbiol 2007; DOI: 10.1128/AEM.01996-06
Expand Down
4 changes: 2 additions & 2 deletions src/biom.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,8 @@ biom::biom() : has_hdf5_backing(false) {
}

// not using const on indices/indptr/data as the pointers are being borrowed
biom::biom(char** obs_ids_in,
char** samp_ids_in,
biom::biom(const char* const * obs_ids_in,
const char* const * samp_ids_in,
uint32_t* indices,
uint32_t* indptr,
double* data,
Expand Down
4 changes: 2 additions & 2 deletions src/biom.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ namespace su {
* @param n_samples number of samples
* @param nnz number of data points
*/
biom(char** obs_ids,
char** samp_ids,
biom(const char* const * obs_ids,
const char* const * samp_ids,
uint32_t* index,
uint32_t* indptr,
double* data,
Expand Down
40 changes: 25 additions & 15 deletions src/su.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@
enum Format {format_invalid,format_ascii, format_hdf5_fp32, format_hdf5_fp64};

void usage() {
std::cout << "usage: ssu -i <biom> -o <out.dm> -m [METHOD] -t <newick> [-n threads] [-a alpha] [-f] [--vaw]" << std::endl;
std::cout << " [--mode [MODE]] [--start starting-stripe] [--stop stopping-stripe] [--partial-pattern <glob>]" << std::endl;
std::cout << "usage: ssu -i <biom> -o <out.dm> -m [METHOD] -t <newick> [-a alpha] [-f] [--vaw]" << std::endl;
std::cout << " [--mode MODE] [--start starting-stripe] [--stop stopping-stripe] [--partial-pattern <glob>]" << std::endl;
std::cout << " [--n-partials number_of_partitions] [--report-bare] [--format|-r out-mode]" << std::endl;
std::cout << " [--n-substeps n] [--pcoa dims] [--diskbuf path]" << std::endl;
std::cout << std::endl;
std::cout << " -i\t\tThe input BIOM table." << std::endl;
std::cout << " -t\t\tThe input phylogeny in newick." << std::endl;
std::cout << " -m\t\tThe method, [unweighted | weighted_normalized | weighted_unnormalized | generalized | unweighted_fp32 | weighted_normalized_fp32 | weighted_unnormalized_fp32 | generalized_fp32]." << std::endl;
std::cout << " -o\t\tThe output distance matrix." << std::endl;
std::cout << " -n\t\t[OPTIONAL] The number of threads, default is 1." << std::endl;
std::cout << " -a\t\t[OPTIONAL] Generalized UniFrac alpha, default is 1." << std::endl;
std::cout << " -f\t\t[OPTIONAL] Bypass tips, reduces compute by about 50%." << std::endl;
std::cout << " --vaw\t[OPTIONAL] Variance adjusted, default is to not adjust for variance." << std::endl;
Expand All @@ -36,16 +36,24 @@ void usage() {
std::cout << " --partial-pattern\t[OPTIONAL] If mode==merge-partial or check-partial, a glob pattern for partial outputs to merge." << std::endl;
std::cout << " --n-partials\t[OPTIONAL] If mode==partial-report, the number of partitions to compute." << std::endl;
std::cout << " --report-bare\t[OPTIONAL] If mode==partial-report, produce barebones output." << std::endl;
std::cout << " --n-substeps\t[OPTIONAL] Internally split the problem in n substeps for reduced memory footprint, default is 1." << std::endl;
std::cout << " --format|-r\t[OPTIONAL] Output format:" << std::endl;
std::cout << " \t\t ascii : [DEFAULT] Original ASCII format." << std::endl;
std::cout << " \t\t hfd5 : HFD5 format. May be fp32 or fp64, depending on method." << std::endl;
std::cout << " \t\t hdf5_fp32 : HFD5 format, using fp32 precision." << std::endl;
std::cout << " \t\t hdf5_fp64 : HFD5 format, using fp64 precision." << std::endl;
std::cout << " --pcoa\t[OPTIONAL] Number of PCoA dimensions to compute (default: 10, do not compute if 0)" << std::endl;
std::cout << " --diskbuf\t[OPTIONAL] Use a disk buffer to reduce memory footprint. Provide path to a fast partition (ideally NVMe)." << std::endl;
std::cout << " -n\t\t[OPTIONAL] DEPRECATED, no-op." << std::endl;
std::cout << std::endl;
std::cout << "Environment variables: " << std::endl;
std::cout << " CPU parallelism is controlled by OMP_NUM_THREADS. If not defined, all detected core will be used." << std::endl;
std::cout << " GPU offload can be disabled with UNIFRAC_USE_GPU=N. By default, if a NVIDIA GPU is detected, it will be used." << std::endl;
std::cout << " A specific GPU can be selected with ACC_DEVICE_NUM. If not defined, the first GPU will be used." << std::endl;
std::cout << std::endl;
std::cout << "Citations: " << std::endl;
std::cout << " For UniFrac, please see:" << std::endl;
std::cout << " Sfiligoi et al. mSystems 2022; DOI: 10.1128/msystems.00028-22" << std::endl;
std::cout << " McDonald et al. Nature Methods 2018; DOI: 10.1038/s41592-018-0187-8" << std::endl;
std::cout << " Lozupone and Knight Appl Environ Microbiol 2005; DOI: 10.1128/AEM.71.12.8228-8235.2005" << std::endl;
std::cout << " Lozupone et al. Appl Environ Microbiol 2007; DOI: 10.1128/AEM.01996-06" << std::endl;
Expand Down Expand Up @@ -296,7 +304,7 @@ int mode_check_partial(const std::string &partial_pattern) {
int mode_partial(std::string table_filename, std::string tree_filename,
std::string output_filename, std::string method_string,
bool vaw, double g_unifrac_alpha, bool bypass_tips,
unsigned int nthreads, int start_stripe, int stop_stripe) {
unsigned int nsubsteps, int start_stripe, int stop_stripe) {
if(output_filename.empty()) {
err("output filename missing");
return EXIT_FAILURE;
Expand Down Expand Up @@ -329,7 +337,7 @@ int mode_partial(std::string table_filename, std::string tree_filename,
partial_mat_t *result = NULL;
compute_status status;
status = partial(table_filename.c_str(), tree_filename.c_str(), method_string.c_str(),
vaw, g_unifrac_alpha, bypass_tips, nthreads, start_stripe, stop_stripe, &result);
vaw, g_unifrac_alpha, bypass_tips, nsubsteps, start_stripe, stop_stripe, &result);
if(status != okay || result == NULL) {
fprintf(stderr, "Compute failed in partial: %s\n", compute_status_messages[status]);
exit(EXIT_FAILURE);
Expand All @@ -350,7 +358,7 @@ int mode_one_off(const std::string &table_filename, const std::string &tree_file
const std::string &output_filename, const std::string &format_str, Format format_val,
const std::string &method_string, unsigned int pcoa_dims,
bool vaw, double g_unifrac_alpha, bool bypass_tips,
unsigned int nthreads, const std::string &mmap_dir) {
unsigned int nsubsteps, const std::string &mmap_dir) {
if(output_filename.empty()) {
err("output filename missing");
return EXIT_FAILURE;
Expand All @@ -376,7 +384,7 @@ int mode_one_off(const std::string &table_filename, const std::string &tree_file
mat_t *result = NULL;

status = one_off(table_filename.c_str(), tree_filename.c_str(), method_string.c_str(),
vaw, g_unifrac_alpha, bypass_tips, nthreads, &result);
vaw, g_unifrac_alpha, bypass_tips, nsubsteps, &result);
if(status != okay || result == NULL) {
fprintf(stderr, "Compute failed in one_off: %s\n", compute_status_messages[status]);
exit(EXIT_FAILURE);
Expand All @@ -394,7 +402,7 @@ int mode_one_off(const std::string &table_filename, const std::string &tree_file
const char * mmap_dir_c = mmap_dir.empty() ? NULL : mmap_dir.c_str();

status = unifrac_to_file(table_filename.c_str(), tree_filename.c_str(), output_filename.c_str(),
method_string.c_str(), vaw, g_unifrac_alpha, bypass_tips, nthreads, format_str.c_str(),
method_string.c_str(), vaw, g_unifrac_alpha, bypass_tips, nsubsteps, format_str.c_str(),
pcoa_dims, mmap_dir_c);

if (status != okay) {
Expand Down Expand Up @@ -439,12 +447,14 @@ int main(int argc, char **argv){
return EXIT_SUCCESS;
}

unsigned int nthreads;
unsigned int nsubsteps;
std::string table_filename = input.getCmdOption("-i");
std::string tree_filename = input.getCmdOption("-t");
std::string output_filename = input.getCmdOption("-o");
std::string method_string = input.getCmdOption("-m");
std::string nthreads_arg = input.getCmdOption("-n");
// deprecated, but we still want to support it, even as a no-op
std::string nold_arg = input.getCmdOption("-n");
std::string nsubsteps_arg = input.getCmdOption("--n-substeps");
std::string gunifrac_arg = input.getCmdOption("-a");
std::string mode_arg = input.getCmdOption("--mode");
std::string start_arg = input.getCmdOption("--start");
Expand All @@ -457,10 +467,10 @@ int main(int argc, char **argv){
std::string pcoa_arg = input.getCmdOption("--pcoa");
std::string diskbuf_arg = input.getCmdOption("--diskbuf");

if(nthreads_arg.empty()) {
nthreads = 1;
if(nsubsteps_arg.empty()) {
nsubsteps = 1;
} else {
nthreads = atoi(nthreads_arg.c_str());
nsubsteps = atoi(nsubsteps_arg.c_str());
}

bool vaw = input.cmdOptionExists("--vaw");
Expand Down Expand Up @@ -521,9 +531,9 @@ int main(int argc, char **argv){


if(mode_arg.empty() || mode_arg == "one-off")
return mode_one_off(table_filename, tree_filename, output_filename, format_arg, format_val, method_string, pcoa_dims, vaw, g_unifrac_alpha, bypass_tips, nthreads, diskbuf_arg);
return mode_one_off(table_filename, tree_filename, output_filename, format_arg, format_val, method_string, pcoa_dims, vaw, g_unifrac_alpha, bypass_tips, nsubsteps, diskbuf_arg);
else if(mode_arg == "partial")
return mode_partial(table_filename, tree_filename, output_filename, method_string, vaw, g_unifrac_alpha, bypass_tips, nthreads, start_stripe, stop_stripe);
return mode_partial(table_filename, tree_filename, output_filename, method_string, vaw, g_unifrac_alpha, bypass_tips, nsubsteps, start_stripe, stop_stripe);
else if(mode_arg == "merge-partial")
return mode_merge_partial(output_filename, format_val, pcoa_dims, partial_pattern, diskbuf_arg);
else if(mode_arg == "check-partial")
Expand Down
6 changes: 3 additions & 3 deletions src/test_su.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -509,8 +509,8 @@ void test_biom_constructor_from_sparse() {
uint32_t index[] = {2, 0, 1, 3, 4, 5, 2, 3, 5, 0, 1, 2, 5, 1, 2};
uint32_t indptr[] = {0, 1, 6, 9, 13, 15};
double data[] = {1., 5., 1., 2., 3., 1., 1., 4., 2., 2., 1., 1., 1., 1., 1.};
char* obs_ids[] = {"GG_OTU_1", "GG_OTU_2", "GG_OTU_3", "GG_OTU_4", "GG_OTU_5"};
char* samp_ids[] = {"Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Sample6"};
const char* obs_ids[] = {"GG_OTU_1", "GG_OTU_2", "GG_OTU_3", "GG_OTU_4", "GG_OTU_5"};
const char* samp_ids[] = {"Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Sample6"};

su::biom table = su::biom(obs_ids, samp_ids, index, indptr, data, 5, 6, 15);
_exercise_get_obs_data(table);
Expand Down Expand Up @@ -1838,7 +1838,7 @@ void test_bptree_cstyle_constructor() {
//11101000
bool structure[] = {true, true, true, false, true, false, false, false};
double lengths[] = {0, 0, 1, 0, 2, 0, 0, 0};
char* names[] = {"", "c", "123:foo; bar", "", "b", "", "", ""};
const char* names[] = {"", "c", "123:foo; bar", "", "b", "", "", ""};
su::BPTree tree = su::BPTree(structure, lengths, names, 8);

unsigned int exp_nparens = 8;
Expand Down
2 changes: 1 addition & 1 deletion src/tree.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ BPTree::BPTree(std::vector<bool> input_structure, std::vector<double> input_leng
index_and_cache();
}

BPTree::BPTree(const bool* input_structure, const double* input_lengths, char** input_names, const int n_parens) {
BPTree::BPTree(const bool* input_structure, const double* input_lengths, const char* const * input_names, const int n_parens) {
structure = std::vector<bool>();
lengths = std::vector<double>();
names = std::vector<std::string>();
Expand Down
2 changes: 1 addition & 1 deletion src/tree.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ namespace su {
* @param input_names A char* array of the names
* @param n_parens The length of the topology
*/
BPTree(const bool* input_structure, const double* input_lengths, char** input_names, const int n_parens);
BPTree(const bool* input_structure, const double* input_lengths, const char* const * input_names, const int n_parens);

/* postorder tree traversal
*
Expand Down