Skip to content

Latest commit

 

History

History
200 lines (150 loc) · 10.6 KB

1.carve_models.md

File metadata and controls

200 lines (150 loc) · 10.6 KB

🪚 1. Use CarveMe to generate GEMs for a bacterial community

Refer to the CarveMe documentation for more details regarding advanced usage.

$ carve -h

usage: carve [-h] [--dna | --egg | --refseq] [--diamond-args DIAMOND_ARGS]
             [-r] [-o OUTPUT] [-u UNIVERSE | --universe-file UNIVERSE_FILE]
             [--cobra | --fbc2] [-n ENSEMBLE] [-g GAPFILL] [-i INIT]
             [--mediadb MEDIADB] [-v] [-d] [--soft SOFT] [--hard HARD]
             [--reference REFERENCE]
             INPUT [INPUT ...]

Reconstruct a metabolic model using CarveMe

positional arguments:
  INPUT                 Input (protein fasta file by default, see other options for details).
                        When used with -r an input pattern with wildcards can also be used.
                        When used with --refseq an NCBI RefSeq assembly accession is expected.

optional arguments:
  -h, --help            show this help message and exit
  --dna                 Build from DNA fasta file
  --egg                 Build from eggNOG-mapper output file
  --refseq              Download genome from NCBI RefSeq and build
  --diamond-args DIAMOND_ARGS
                        Additional arguments for running diamond
  -r, --recursive       Bulk reconstruction from folder with genome files
  -o OUTPUT, --output OUTPUT
                        SBML output file (or output folder if -r is used)
  -u UNIVERSE, --universe UNIVERSE
                        Pre-built universe model (default: bacteria)
  --universe-file UNIVERSE_FILE
                        Reaction universe file (SBML format)
  --cobra               Output SBML in old cobra format
  --fbc2                Output SBML in sbml-fbc2 format
  -n ENSEMBLE, --ensemble ENSEMBLE
                        Build model ensemble with N models
  -g GAPFILL, --gapfill GAPFILL
                        Gap fill model for given media
  -i INIT, --init INIT  Initialize model with given medium
  --mediadb MEDIADB     Media database file
  -v, --verbose         Switch to verbose mode
  -d, --debug           Debug mode: writes intermediate results into output files
  --soft SOFT           Soft constraints file
  --hard HARD           Hard constraints file
  --reference REFERENCE
                        Manually curated model of a close reference species.

🔐 Key points

  1. The top-down reconstruction approach
    • based on a universal and well-curated bacterial model, carves out a species specific model based on organism's genome.
  2. The BiGG database
    • connects protein sequences with standardized and curated metabolic reaction knowledgebase.
  3. The carving algorithm
    • is a mixed integer linear programming (MILP) formulation that maximizes presence of high genomic evidence reactions, minimizes the presence of low genomic evidence reactions, and enforces gapless pathways.
  4. The gap-filling algorithm
    • uses genomic evidence scores to prioritize and minimize the number of added reactions needed to support growth on a given media composition.

Refer to the methods sections of the CarveMe paper for details regarding the implementation of MILP probelms that are solved for carving, ensemble generation, and gap filling.

🕳️ Gap filling

We will use media composition files to gap filling our models; these are tab delimited text files that contain media recipes, i.e. lists of chemical compounds using BiGG database metabolite identifiers. There are examples of media files in the /data/ subfolder.

Use the head command to get an idea of what these media files look like

$ head -n 20 $ROOT/data/media_db.tsv 
medium	description	compound	name
M1	M1	3mb	3mb
M1	M1	4abz	4abz
M1	M1	ac	ac
M1	M1	btn	btn
M1	M1	but	but
M1	M1	ca2	ca2
M1	M1	cbl1	cbl1
M1	M1	cbl2	cbl2
M1	M1	cellb	cellb
M1	M1	cl	cl
M1	M1	cobalt2	cobalt2
M1	M1	cu2	cu2
M1	M1	cys__L	cys__L
M1	M1	fe2	fe2
M1	M1	fe3	fe3
M1	M1	fol	fol
M1	M1	fru	fru
M1	M1	glc__D	glc__D
M1	M1	h	h

Search the BiGG database to learn more about specific metabolites name and identifiers.

Note: Use the --verbose or -v flag to view detailed runlog showing number of gapfill-added reactions and metabolites. Additionally, gapfilled reactions can be identified by searching for the keyword "GAP_FILL". For more details see here.

📝 Solvers

The underlying algorithms behind CarveMe and SMETANA use linear programming to pose large optimization problems. To solve such problems, we need a powerful solver. Unfortunately, while the CPLEX solver is not open source, researchers can obtain a free academic initiative license from IBM to use the academic version of CPLEX. For more details on the topic refer to the following issues related to the usage of solvers: issue 1,issue 2,issue 3.

🤔 Your turn

For maximum efficiency, assign one genome to each of your group members and run CarveMe, e.g. to see genomes in your community

$ ls $ROOT/genomes/$COMM
ERR260255_bin.14.p.faa  ERR260255_bin.19.p.faa  ERR260255_bin.24.s.faa  ERR260255_bin.7.p.faa  ERR260255_bin.9.s.faa

Use the $MODEL variable to select your chosen genome, e.g. to choose filename ERR260255_bin.14.p

$ MODEL=ERR260255_bin.14.p

Run CarveMe on your model, in this example we use the M8 media composition from the media_db.tsv file for gapfilling. Use curly braces syntax {} to protect variables e.g.

$ carve -v --mediadb $ROOT/data/media_db.tsv -g M8 --cobra -o ${MODEL}.xml $ROOT/genomes/$COMM/${MODEL}.faa

Read about the Systems Biology Markup Language (SBML) standard to familiarize yourself with the GEM format output of CarveMe. Use less to view the contents of your generated model, press the q key to stop viewing the file.

$ less ${MODEL}.xml

If you successfully generated models for your community, swap these into the appropriate subdirectory within the /models/ folder. Make sure that the names of your newly created models match exactly with the pre-computed ones to ensure that the plotting script runs smoothly. Hint: use the cp, mv, and/or rm commands.

Remove original model, e.g.

$ rm $ROOT/models/$COMM/${MODEL}.xml

Replace with your newly carved model, e.g.

$ cp ${MODEL}.xml $ROOT/models/$COMM/${MODEL}.xml

Do not worry if you run out of time or are unable to generate models for any reason, you already have pre-computed results that you may use for visualization and discussion.

🍬 Pre-computed GEMs

The following code chunks show how each community of GEMs was pre-generated for your convenience using CarveMe. You do not need to generate results for each community, as all results are pre-computed in their respective directories.

Kefir GEMs

We have prior knowledge regarding the growth media of kefir microbes from this publication. The following code generates GEMs for kefir genomes, using milk composition to gapfill reactions with genomic evidence.

$ while read model;do 
carve -v --mediadb $ROOT/data/milk_composition.tsv -g MILK --cobra -o $ROOT/models/kefir/${model}.xml $ROOT/genomes/kefir/${model}.faa;
done< <(ls $ROOT/genomes/kefir/|sed 's/.faa//g')

To execute loops like these, make sure you copy all three lines to complete the loop. You can alternatively run such code on a single line, e.g. while read model;do carve -v --mediadb $ROOT/data/milk_composition.tsv -g MILK --cobra -o $ROOT/models/kefir/${model}.xml $ROOT/genomes/kefir/${model}.faa; done< <(ls $ROOT/genomes/kefir/|sed 's/.faa//g'). To learn more about for loops in bash see here.

Gut GEMs

We have prior knowledge regarding the growth media of gut microbes from this publication. The following code generates GEMs for gut genomes, using M8 gut microbiome media composition to gapfill reactions with genomic evidence.

$ while read model;do     
carve -v --mediadb $ROOT/data/media_db.tsv -g M8 --cobra -o $ROOT/models/gut_normal/${model}.xml $ROOT/genomes/gut_normal/${model}.faa;
done< <(ls $ROOT/genomes/gut_normal/|sed 's/.faa//g')

Soil GEMs

We do not have any knowledge regarding the growth media of soil microbes from this particular community. The following code generates GEMs for soil genomes without any gapfilling.

$ while read model;do     
carve -v --cobra -o $ROOT/models/soil/${model}.xml $ROOT/genomes/soil/${model}.faa;
done< <(ls $ROOT/genomes/soil/|sed 's/.faa//g')

💎 metaGEM usage

Even though CarveMe can run comfortably on a standard laptop machine, this approach is not practical for scaling up and reconstructing large numbers of metabolic models, e.g. on the order of 10's of thousands. For such large scale analyses we developed the metaGEM pipeline, which uses the Snakemake workflow manager to submit parallelized jobs on the high performance computer cluster (HPCC). For example, to submit 10,000 CarveMe jobs each with 2 cores + 3 GB RAM and a 1 hour max time limit:

$ bash metaGEM.sh --task carveme --nJobs 10000 --cores 2 --mem 3 --hours 1

Note: You do not have metaGEM installed on your virtual machines, so you will not be able to run the command above. Refer to the metaGEM repo's quickstart, manual installation guide, or google colab notebook for setup instructions.

🙋 Discussion questions

  • How does CarveMe generate metabolic models?
  • How does ORF-annotation of genome DNA influence model reconstruction?
  • What are the pros and cons of relying on the BiGG database for model reconstruction?
  • How does the choice of gap-filling media affect model reconstruction?
  • Why are there specialized model templates?
  • What is metaGEM?

Move on to exercise 2