Refer to the CarveMe documentation for more details regarding advanced usage.
$ carve -h
usage: carve [-h] [--dna | --egg | --refseq] [--diamond-args DIAMOND_ARGS]
[-r] [-o OUTPUT] [-u UNIVERSE | --universe-file UNIVERSE_FILE]
[--cobra | --fbc2] [-n ENSEMBLE] [-g GAPFILL] [-i INIT]
[--mediadb MEDIADB] [-v] [-d] [--soft SOFT] [--hard HARD]
[--reference REFERENCE]
INPUT [INPUT ...]
Reconstruct a metabolic model using CarveMe
positional arguments:
INPUT Input (protein fasta file by default, see other options for details).
When used with -r an input pattern with wildcards can also be used.
When used with --refseq an NCBI RefSeq assembly accession is expected.
optional arguments:
-h, --help show this help message and exit
--dna Build from DNA fasta file
--egg Build from eggNOG-mapper output file
--refseq Download genome from NCBI RefSeq and build
--diamond-args DIAMOND_ARGS
Additional arguments for running diamond
-r, --recursive Bulk reconstruction from folder with genome files
-o OUTPUT, --output OUTPUT
SBML output file (or output folder if -r is used)
-u UNIVERSE, --universe UNIVERSE
Pre-built universe model (default: bacteria)
--universe-file UNIVERSE_FILE
Reaction universe file (SBML format)
--cobra Output SBML in old cobra format
--fbc2 Output SBML in sbml-fbc2 format
-n ENSEMBLE, --ensemble ENSEMBLE
Build model ensemble with N models
-g GAPFILL, --gapfill GAPFILL
Gap fill model for given media
-i INIT, --init INIT Initialize model with given medium
--mediadb MEDIADB Media database file
-v, --verbose Switch to verbose mode
-d, --debug Debug mode: writes intermediate results into output files
--soft SOFT Soft constraints file
--hard HARD Hard constraints file
--reference REFERENCE
Manually curated model of a close reference species.
- The top-down reconstruction approach
- based on a universal and well-curated bacterial model, carves out a species specific model based on organism's genome.
- The BiGG database
- connects protein sequences with standardized and curated metabolic reaction knowledgebase.
- The carving algorithm
- is a mixed integer linear programming (MILP) formulation that maximizes presence of high genomic evidence reactions, minimizes the presence of low genomic evidence reactions, and enforces gapless pathways.
- The gap-filling algorithm
- uses genomic evidence scores to prioritize and minimize the number of added reactions needed to support growth on a given media composition.
Refer to the methods sections of the CarveMe paper for details regarding the implementation of MILP probelms that are solved for carving, ensemble generation, and gap filling.
We will use media composition files to gap filling our models; these are tab delimited text files that contain media recipes, i.e. lists of chemical compounds using BiGG database metabolite identifiers. There are examples of media files in the /data/
subfolder.
Use the head
command to get an idea of what these media files look like
$ head -n 20 $ROOT/data/media_db.tsv
medium description compound name
M1 M1 3mb 3mb
M1 M1 4abz 4abz
M1 M1 ac ac
M1 M1 btn btn
M1 M1 but but
M1 M1 ca2 ca2
M1 M1 cbl1 cbl1
M1 M1 cbl2 cbl2
M1 M1 cellb cellb
M1 M1 cl cl
M1 M1 cobalt2 cobalt2
M1 M1 cu2 cu2
M1 M1 cys__L cys__L
M1 M1 fe2 fe2
M1 M1 fe3 fe3
M1 M1 fol fol
M1 M1 fru fru
M1 M1 glc__D glc__D
M1 M1 h h
Search the BiGG database to learn more about specific metabolites name and identifiers.
Note: Use the --verbose
or -v
flag to view detailed runlog showing number of gapfill-added reactions and metabolites. Additionally, gapfilled reactions can be identified by searching for the keyword "GAP_FILL". For more details see here.
The underlying algorithms behind CarveMe and SMETANA use linear programming to pose large optimization problems. To solve such problems, we need a powerful solver. Unfortunately, while the CPLEX solver is not open source, researchers can obtain a free academic initiative license from IBM to use the academic version of CPLEX. For more details on the topic refer to the following issues related to the usage of solvers: issue 1,issue 2,issue 3.
For maximum efficiency, assign one genome to each of your group members and run CarveMe, e.g. to see genomes in your community
$ ls $ROOT/genomes/$COMM
ERR260255_bin.14.p.faa ERR260255_bin.19.p.faa ERR260255_bin.24.s.faa ERR260255_bin.7.p.faa ERR260255_bin.9.s.faa
Use the $MODEL
variable to select your chosen genome, e.g. to choose filename ERR260255_bin.14.p
$ MODEL=ERR260255_bin.14.p
Run CarveMe on your model, in this example we use the M8 media composition from the media_db.tsv
file for gapfilling. Use curly braces syntax {}
to protect variables e.g.
$ carve -v --mediadb $ROOT/data/media_db.tsv -g M8 --cobra -o ${MODEL}.xml $ROOT/genomes/$COMM/${MODEL}.faa
Read about the Systems Biology Markup Language (SBML) standard to familiarize yourself with the GEM format output of CarveMe. Use less
to view the contents of your generated model, press the q
key to stop viewing the file.
$ less ${MODEL}.xml
If you successfully generated models for your community, swap these into the appropriate subdirectory within the /models/
folder. Make sure that the names of your newly created models match exactly with the pre-computed ones to ensure that the plotting script runs smoothly. Hint: use the cp
, mv
, and/or rm
commands.
Remove original model, e.g.
$ rm $ROOT/models/$COMM/${MODEL}.xml
Replace with your newly carved model, e.g.
$ cp ${MODEL}.xml $ROOT/models/$COMM/${MODEL}.xml
Do not worry if you run out of time or are unable to generate models for any reason, you already have pre-computed results that you may use for visualization and discussion.
The following code chunks show how each community of GEMs was pre-generated for your convenience using CarveMe. You do not need to generate results for each community, as all results are pre-computed in their respective directories.
We have prior knowledge regarding the growth media of kefir microbes from this publication. The following code generates GEMs for kefir genomes, using milk composition to gapfill reactions with genomic evidence.
$ while read model;do
carve -v --mediadb $ROOT/data/milk_composition.tsv -g MILK --cobra -o $ROOT/models/kefir/${model}.xml $ROOT/genomes/kefir/${model}.faa;
done< <(ls $ROOT/genomes/kefir/|sed 's/.faa//g')
To execute loops like these, make sure you copy all three lines to complete the loop. You can alternatively run such code on a single line, e.g. while read model;do carve -v --mediadb $ROOT/data/milk_composition.tsv -g MILK --cobra -o $ROOT/models/kefir/${model}.xml $ROOT/genomes/kefir/${model}.faa; done< <(ls $ROOT/genomes/kefir/|sed 's/.faa//g')
. To learn more about for loops in bash see here.
We have prior knowledge regarding the growth media of gut microbes from this publication. The following code generates GEMs for gut genomes, using M8 gut microbiome media composition to gapfill reactions with genomic evidence.
$ while read model;do
carve -v --mediadb $ROOT/data/media_db.tsv -g M8 --cobra -o $ROOT/models/gut_normal/${model}.xml $ROOT/genomes/gut_normal/${model}.faa;
done< <(ls $ROOT/genomes/gut_normal/|sed 's/.faa//g')
We do not have any knowledge regarding the growth media of soil microbes from this particular community. The following code generates GEMs for soil genomes without any gapfilling.
$ while read model;do
carve -v --cobra -o $ROOT/models/soil/${model}.xml $ROOT/genomes/soil/${model}.faa;
done< <(ls $ROOT/genomes/soil/|sed 's/.faa//g')
Even though CarveMe can run comfortably on a standard laptop machine, this approach is not practical for scaling up and reconstructing large numbers of metabolic models, e.g. on the order of 10's of thousands. For such large scale analyses we developed the metaGEM pipeline, which uses the Snakemake workflow manager to submit parallelized jobs on the high performance computer cluster (HPCC). For example, to submit 10,000 CarveMe jobs each with 2 cores + 3 GB RAM and a 1 hour max time limit:
$ bash metaGEM.sh --task carveme --nJobs 10000 --cores 2 --mem 3 --hours 1
Note: You do not have metaGEM installed on your virtual machines, so you will not be able to run the command above. Refer to the metaGEM repo's quickstart, manual installation guide, or google colab notebook for setup instructions.
- How does CarveMe generate metabolic models?
- How does ORF-annotation of genome DNA influence model reconstruction?
- What are the pros and cons of relying on the BiGG database for model reconstruction?
- How does the choice of gap-filling media affect model reconstruction?
- Why are there specialized model templates?
- What is metaGEM?