Skip to content

Latest commit

 

History

History
168 lines (135 loc) · 10.1 KB

3.run_smetana_detailed_interactions.md

File metadata and controls

168 lines (135 loc) · 10.1 KB

🕸️ 3. Use SMETANA detailed algorithm to predict metabolic interactions between species

Refer to the SMETANA documentation for more details regarding usage.

$ smetana -h

usage: smetana [-h] [-c COMMUNITIES.TSV] [-o OUTPUT] [--flavor FLAVOR]
               [-m MEDIA] [--mediadb MEDIADB]
               [-g | -d | -a ABIOTIC | -b BIOTIC] [-p P] [-n N] [-v] [-z]
               [--solver SOLVER] [--molweight] [--exclude EXCLUDE]
               [--no-coupling]
               MODELS [MODELS ...]

Calculate SMETANA scores for one or multiple microbial communities.

positional arguments:
  MODELS                
                        Multiple single-species models (one or more files).
                        
                        You can use wild-cards, for example: models/*.xml, and optionally protect with quotes to avoid automatic bash
                        expansion (this will be faster for long lists): "models/*.xml". 

optional arguments:
  -h, --help            show this help message and exit
  -c COMMUNITIES.TSV, --communities COMMUNITIES.TSV
                        
                        Run SMETANA for multiple (sub)communities.
                        The communities must be specified in a two-column tab-separated file with community and organism identifiers.
                        The organism identifiers should match the file names in the SBML files (without extension).
                        
                        Example:
                            community1	organism1
                            community1	organism2
                            community2	organism1
                            community2	organism3
                        
  -o OUTPUT, --output OUTPUT
                        Prefix for output file(s).
  --flavor FLAVOR       Expected SBML flavor of the input files (cobra or fbc2).
  -m MEDIA, --media MEDIA
                        Run SMETANA for given media (comma-separated).
  --mediadb MEDIADB     Media database file
  -g, --global          Run global analysis with MIP/MRO (faster).
  -d, --detailed        Run detailed SMETANA analysis (slower).
  -a ABIOTIC, --abiotic ABIOTIC
                        Test abiotic perturbations with given list of compounds.
  -b BIOTIC, --biotic BIOTIC
                        Test biotic perturbations with given list of species.
  -p P                  Number of components to perturb simultaneously (default: 1).
  -n N                  
                        Number of random perturbation experiments per community (default: 1).
                        Selecting n = 0 will test all single species/compound perturbations exactly once.
  -v, --verbose         Switch to verbose mode
  -z, --zeros           Include entries with zero score.
  --solver SOLVER       Change default solver (current options: 'gurobi', 'cplex').
  --molweight           Use molecular weight minimization (recomended).
  --exclude EXCLUDE     List of compounds to exclude from calculations (e.g.: inorganic compounds).
  --no-coupling         Don't compute species coupling scores.

🤝 Detailed algorithm

Refer to the methods sections of the SMETANA paper for details regarding the implementation of MILP probelms that are solved in order to compute the following:

  1. The species coupling score (SCS) measures the dependence of growth of species A on species B (SCSA,B)
    • calculated by enumerating all possible community member subsets where species A can grow, SCSA,B is the fraction of subsets where both species A and B can grow.
  2. The metabolite uptake score (MUS) measures the dependence of growth of species A on metabolite m (MUSA,m)
    • calculated by enumerating all possible metabolite requirement subsets where species A can grow, MUSA,m is the fraction of subsets where both species A grows and metabolite m is taken up.
  3. The metabolite production score (MPS) is a binary score indicating whether a given species B can produce metabolite m (MPS = 1) or not (MPS = 0) in the community of N members (MPSB,m)
  4. The SMETANA score ranges from 0 to 1
    • measures how strongly a receiver species relies on a donor species for a particular metabolite
    • SMETANAA,B,m = SCSA,B * MUSA,m * MPSB,m

Note: There may be equivalent solutions that satisfy the linear programming problems posed by the detailed and global algorithms. To explore the solution space run multiple simulations and then take averages. Use the --molweight flag to predict interactions on community-specific minimal media. Use the --zeros flag in order to accurately calculate averages across samples.

🤔 Your turn

Try running a few simulation using your community of GEMs. Assuming your models are in the SymbNET/models/$COMM folder

$ for i in {1..3}; do 
echo "Running simulation $i out of 3 ... "; 
smetana --flavor ucsd -o sim_${i} -v -d --molweight --zeros $ROOT/models/$COMM/*.xml;
done

Use the head command to inspect the generated detailed interactions

$ head sim_1_detailed.tsv 
community	medium	receiver	donor	compound	scs	mus	mps	smetana
all	minimal	ERR260172_bin.10.p.faa	ERR260172_bin.31.s.faa	M_12ppd__S_e	0.0	0.0	1	0.0
all	minimal	ERR260172_bin.10.p.faa	ERR260172_bin.31.s.faa	M_15dap_e	0.0	0.0	0	0.0
all	minimal	ERR260172_bin.10.p.faa	ERR260172_bin.31.s.faa	M_2pglyc_e	0.0	0.0	0	0.0
all	minimal	ERR260172_bin.10.p.faa	ERR260172_bin.31.s.faa	M_3amp_e	0.0	0.0	0	0.0
all	minimal	ERR260172_bin.10.p.faa	ERR260172_bin.31.s.faa	M_3cmp_e	0.0	0.0	0	0.0
all	minimal	ERR260172_bin.10.p.faa	ERR260172_bin.31.s.faa	M_3gmp_e	0.0	0.0	0	0.0
all	minimal	ERR260172_bin.10.p.faa	ERR260172_bin.31.s.faa	M_3hcinnm_e	0.0	0.0	0	0.0

If you successfully generated detailed predictions for your community, append these to the smet_all.tsv file, remember to gunzip it first! e.g.

$ gunzip $ROOT/data/smet_all.tsv.gz

Next loop through each of the detailed simulation output files, reformat them, and append to smet_all.tsv. This is the file that is loaded by the following R-markdown file for visualization.

$ while read file;do 
 sim=$(basename $file|sed 's/_detailed.tsv//g'|sed 's/sim_//g');
 cat $file|grep -v smetana|sed "s/^all/$COMM/g"|sed "s/^/$sim\t/g" >> $ROOT/data/smet_all.tsv;
 done< <(find . -name "sim*detailed.tsv"|grep -v simulations)

Check that you successfully reformatted and appended your simulations to the smet_all.tsv file with using the tail command:

$ tail $ROOT/data/smet_all.tsv
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_val__L_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_vanln_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_xan_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_xmp_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_xtsn_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_xyl3_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_xyl__D_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_xylan4_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_xylb_e	0.5	0.0	0	0.0
1	soil	minimal	ERR671933_bin.5.s.faa	ERR671933_bin.4.o.faa	M_zn2_e	0.5	1.0	0	0.0

Do not worry if you run out of time or are unable to generate detailed interactions for any reason, you already have pre-computed results that you may use for visualization and discussion.

🍌 Pre-computed simulations

Unless you submit jobs in parallel on the cluster it will not be feasible to generate 100's of simulation results within the timeframe of this course. The following code demonstrates how the detailed interactions were precomputed for the different datasets. You do not need to generate results for each community, as all results are pre-computed in their respective directories.

The following shows how all detailed interactions were pre-computed, assuming your current folder contains your community's GEMs:

$ for i in {1..100}; do 
echo "Running simulation $i out of 100 ... "; 
smetana --flavor ucsd -o sim_${i} -v -d --molweight --zeros $ROOT/models/$COMM/*.xml;
done

💎 metaGEM usage

Even though SMETANA may run on a standard laptop machine depending on your community size, this approach is not practical for scaling up and simulating large numbers of communities, e.g. on the order of 10's of thousands. For such large scale analyses we developed the metaGEM pipeline, which uses the Snakemake workflow manager to submit parallelized jobs on the high performance computer cluster (HPCC). For example, to submit 10,000 SMETANA jobs each with 2 cores + 3 GB RAM and a 2 hour max time limit:

$ bash metaGEM.sh --task smetana --nJobs 10000 --cores 2 --mem 3 --hours 2

Note: You do not have metaGEM installed on your virtual machines, so you will not be able to run the command above. Refer to the metaGEM repo's quickstart, manual installation guide, or google colab notebook for setup instructions.

🦍 Discussion questions

  • How does the SMETANA detailed algorithm work at a high level?
  • What are the underlying assumptions?
  • What does each metric measure and how is it calculated?
    • SCS
    • MUS
    • MPS
    • SMETANA
  • What does a SMETANA score of 0 mean? What does a SMETANA score of 1 mean?
  • Why do we run multiple simulations and take averages of SMETANA scores? Why is it important to use the --zeros flag in this case?
  • How does choice of simulation media affect the SMETANA detailed algorithm output?
  • What does the --molweight flag do? How can it be useful?

Move on to exercise 4