Linear programming and Markov Chain Monte Carlo sampling tools for thermodynamic analysis of metabolism.
- Background
- System requirements
- MDF and NEM analysis
- Hit-and-run analysis
- Case study: Syntrophic communities
- Author
Tools such as NET analysis (Kümmel et al. 2006, Zamboni et al. 2008) allow calculation of the minimum and maximum feasible metabolite concentrations in a metabolic network, while ensuring that all reactions have a negative Gibbs free energy change, or, equivalently, a positive thermodynamic driving force. Similarly, Max-min Driving Force analysis (MDF; Noor et al., 2014) is another linear programming approach to finding an optimal set of metabolite concentrations in a metabolic network. However, the aforementioned approaches only identify extreme concentration values, which is why this thermosampler
framework was developed to explore the full "solution space" of feasible metabolite concentrations (Fig 1). A random walk, or hit-and-run algorithm, allows assessment of metabolite concentration distributions through exploration of all combinations of concentrations that are thermodynamically feasible.
As an example, a clearly defined solution space was devised using a "toy" model. The toy model has two metabolites, A and B, and one reaction where A is transformed into B. The standard Gibbs free energy change was selected to be 5.705 kJ/mol, which corresponds to the constraint that [A] must be at least 10 times higher than [B]. Furthermore, [A] and [B] must be within the concentration range 0.01 to 1, and the sum of [A] and [B] must be 1 or lower. These three constraints define a solution space on the AB-plane that has the shape of a triangle (Fig 1). A real metabolic network has dozens of reactions and metabolites, and is therefore many magnitudes more complex and difficult to visualize.
A random walk applying the thermosampler
algorithm was performed within the triangular AB solution space using the sampling.py
script:
./sampling.py --reactions examples/toy.model.tab \
--std_drG examples/toy.model_drgs.tab \
--constraints examples/toy.concentrations.tab \
--max_conc 1 --steps 200 \
--outfile results/toy.sampling.tab
The results from the random walk were then plotted to yield the right panel in Fig 1:
examples/plot_toy_example.R
results/toy.sampling.pdf # Output PDF
In the thermosampler
algorithm, the feasibility concept and how to check it is very important, as it ensures that the random walk occurs within the solution space, and thereby defines the solution space itself. Feasibility consists of adherence of the metabolite concentration vector to several constraints:
- Thermodynamic driving forces of all reactions must be positive, based on the user-supplied model reactions and directions, as well as the user-supplied standard reaction Gibbs free energy changes.
- Metabolite concentration ratios must be within the user-supplied ranges.
- Total metabolite concentration sum must be below the user-defined value.
- Metabolite concentrations must be within the user-supplied ranges.
- Sums of groups of metabolite concentrations must be below the user-supplied values.
Several functions exist within sampling.py
to perform the feasibility checks, and are called by the "master" feasibility function named is_feasible
. It should be relatively simple to add additional feasibility checks to constrain the solution space further.
The thermosampler
algorithm starts at a feasible combination of concentrations within the solution space, i.e. one "feasible metabolite concentration set" (fMCS). This set is obtained through inefficient rejection sampling or from MDF analysis. More fMCSs are obtained along the random walk.
The thermosampler
algorithm works as follows, with concentrations in logarithmic space:
- Create a step direction vector of the same length as the concentrations vector, but with random values.
- Hone in on the edge of the solution space by determining the maximum and minimum allowable factor theta to apply to the step direction vector before adding it to the current concentrations vector. First the maximum step size without violating any of the initial concentration range bounds is added and feasibility is checked. If feasible, the edge has been found. If not, it is necessary to hone in on the edge of solution space by iteratively increasing the inner step size (starts at zero) and decreasing the outer step size (starts at the previously mentioned maximum), until the difference is smaller than a user-defined, small value.
- Once the difference between inner and outer step size is small enough, the inner step size is taken as "touching" the edge of the solution space, and thereby constituting the maximum feasible step size. The minimum feasible step size (can be negative) is gained from performing step 2 in the opposite direction. Thereby the minimum and maximum step size is determined.
- Perform one step in the random walk by sampling one step size factor theta from the previously calculated range, multiplying the step direction vector by that factor, and adding the product to the current metabolite concentrations vector. The feasibility of the new vector, i.e. the new metabolite concentration set, is checked. The algorithm stops and prints an error if an infeasible point is reached, which could happen for example if the solution space would be noncontinuous or concave. The error is however more likely to occur from infeasible model parameters (the input files and variables).
- Repeat from step 1 until the desired number of steps through the solution space have been performed.
Practical examples of using the thermosampler
algorithm with sampling.py
are shown below.
- Operating system with Unix-like terminal (Tested on Ubuntu 18.04.5 LTS, 20.04.2 LTS, and MacOS 12.1)
- bash (Tested with 3.2.57, 4.4.20 and 5.0.17)
- Python ≥ 3.7 (Tested with 3.7.6 and 3.10.0)
- R ≥ 4.1.1 (Tested with 4.1.1)
- GNU parallel (Tested with 20161222 and 20220122)
- Python libraries: numpy, pandas, scipy, tqdm, joblib
- R libraries: doMC, foreach, ggridges, optparse, scales, tidyverse
It is recommended to run the Python scripts within a virtual environment. The virtual environment can be created using the following command:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
On Windows systems, replace "python3" with "py", and activate the virtual environment using the command venv\Scripts\activate.bat
. Please note that this software has not been tested on Windows systems.
For improved performance om Mac systems with M1 processors, numpy can be built with support for the Accelerate framework. This can lead to roughly a doubling of performance in hit-and-run sampling. For more information, see this Stack Overflow question.
mdf.py
performs Max-min Driving Force (MDF; Noor et al., 2014) and Network-Embedded
MDF (NEM; Asplund-Samuelsson et al., 2018) analysis. Run ./mdf.py -h
to list all options of the script.
./mdf.py --min_conc 0.0000001 --max_conc 0.1 \
--constraints examples/E_coli.concentrations.tab \
--ratios examples/E_coli.Lys_opt_ratios.tab \
--pathway examples/E_coli.Lys_pathway.txt \
examples/E_coli.model.tab \
examples/E_coli.model_drgs.tab \
results/E_coli_Lys_nem.csv
./mdf.py --min_conc 0.0000001 --max_conc 0.1 \
--constraints examples/Synechocystis.concentrations.tab \
--ratios examples/Synechocystis.Lys_opt_ratios.tab \
--pathway examples/Synechocystis.Lys_pathway.txt \
examples/Synechocystis.model.tab \
examples/Synechocystis.model_drgs.tab \
results/Synechocystis_Lys_nem.csv
./mdf.py --min_conc 0.0000001 --max_conc 0.1 \
--constraints examples/tca.concentrations.tab \
--ratios examples/tca.ratios.tab examples/tca.model.tab \
examples/tca.model_drgs.tab results/tca_mdf.csv
sampling.py
performs hit-and-run sampling of feasible metabolite concentration sets by a random walk through the solution space. Run ./sampling.py -h
to list all options of the script.
./sampling.py --reactions examples/tca.model.tab \
--std_drG examples/tca.model_drgs.tab \
--outfile results/tca.mdf_sampling.tab \
--constraints examples/tca.concentrations.tab \
--concs results/tca_mdf.csv --steps 1000 --mdf 1.3
Here we sample TCA cycle metabolite concentrations five times, computed in parallel and always starting from the MDF, saving every 10th step. In the first sampling all reactions must have a driving force > 0 kJ/mol, while in the second sampling all reactions must have a driving force > 1.18 kJ/mol, which is 90% of the MDF.
# Sample 5 replicates of 10000 steps in parallel at feasible delta G
for i in {1..5}; do
echo "./sampling.py --reactions examples/tca.model.tab \
--std_drG examples/tca.model_drgs.tab \
--ratios examples/tca.ratio_range.tab \
--outfile results/tca_sampling/tca.Feasible.${i}.tab \
--constraints examples/tca.concentrations.tab \
--concs results/tca_mdf.csv --steps 10000 --max_conc 1.2 -n 10"
done | parallel --no-notice --jobs 5
# Sample 5 replicates of 10000 steps in parallel at optimum delta G (90% of MDF)
for i in {1..5}; do
echo "./sampling.py --reactions examples/tca.model.tab \
--std_drG examples/tca.model_drgs.tab \
--ratios examples/tca.ratio_range.tab \
--outfile results/tca_sampling/tca.Optimal.${i}.tab \
--constraints examples/tca.concentrations.tab \
--concs results/tca_mdf.csv --steps 10000 --max_conc 1.2 \
--mdf 1.18 -n 10"
done | parallel --no-notice --jobs 5
stoich.py
converts a model in reaction table format to a model in stoichiometric matrix format. Run ./stoich.py -h
to list all options of the script. This representation of the model is needed for plotting in the next step.
./stoich.py examples/tca.model.tab results/tca.stoich.tab
plot_samples.R
creates principal component analysis (PCA), concentration, and driving force plots from hit-and-run sampling experiments. Run ./plot_samples.R -h
to list all options of the script. All sampling results files used as input must be stored without any other files in one directory (--indir
option). Sampling results files must have the <path>/<indir>/<experiment>.<group>.<replicate>.tab
name format (even if there is only one group and/or only one replicate).
./plot_samples.R -i results/tca_sampling -S results/tca.stoich.tab \
-G examples/tca.model_drgs.tab -c examples/tca.concentrations.tab \
-x C00001 -o results/tca_sampling_plots
Plotting yields the following output files:
results/tca_sampling_plots.sampling_pca.png
results/tca_sampling_plots.sampling_pca_combo.png
results/tca_sampling_plots.sampling_concs.pdf
results/tca_sampling_plots.sampling_concs_combo.pdf
results/tca_sampling_plots.sampling_dfs.pdf
results/tca_sampling_plots.sampling_dfs_combo.pdf
results/tca_sampling_plots.concs.pca.pdf
results/tca_sampling_plots.dfs.pca.pdf
The combo
tag indicates that replicates have been combined in the plot and are not displayed individually, as in the other variant of each plot type. Combining multiple replicates improves coverage of the solution space.
results/tca_sampling_plots.sampling_dfs_combo.pdf |
---|
Then, driving forces are plotted with replicates combined to yield a smoother representation of the thermodynamic solution space. It appears that R01082, i.e. fumarate hydratase, is somewhat of a thermodynamic bottleneck in this experiment. As for the concentrations, the Kolmogorov-Smirnov D statistic ("Difference") is used to highlight differences in the distribution shapes. |
The plot_samples.R
script can be supplied with a config file to filter, order, and rename reactions and metabolites. The example config file examples/tca.plot_config.tab
is provided. Let's look at the contents:
cat examples/tca.plot_config.tab | column -tn -s $'\t'
Id Name Type
R00351 Citrate synthase Reaction
R01325 Citrate hydro-lyase Reaction
R01900 Isocitrate hydro-lyase Reaction
R00709 Isocitrate dehydrogenase Reaction
R08549 2OG dehydrogenase Reaction
R00405 Succinyl-CoA synthetase Reaction
M00148 Succinate dehydrogenase Reaction
R01082 Fumarate hydratase Reaction
R00342 Malate dehydrogenase Reaction
C00122 Fumarate Metabolite
C00149 Malate Metabolite
C00311 Isocitrate Metabolite
C00036 Oxaloacetate Metabolite
C00042 Succinate Metabolite
Feasible Free Group
Optimal MDF Group
The config file has three columns:
Id
is the identifier used for reactions, metabolites, or groups in the model and results files.Name
is the alternative name we want to display in the plots, replacing the identifiers that are otherwise displayed.Type
is one ofReaction
,Metabolite
, orGroup
, and defines what variable the identifier and name refers to. At least one line for each of the three types must be present in the config file to create meaningful plots and avoid crashes.
The exact order of reactions, metabolites, and groups in the config file is used to order the plots in the output accordingly. Thus, we use the plot config file to get nicely ordered TCA cycle plots with meaningful names:
./plot_samples.R -i results/tca_sampling -S results/tca.stoich.tab \
-G examples/tca.model_drgs.tab -c examples/tca.concentrations.tab \
-C examples/tca.plot_config.tab -o results/tca_cfg/tca_cfg_sampling_plots
These are the plots made with the help of the config file:
results/tca_cfg/tca_cfg_sampling_plots.sampling_pca.png
results/tca_cfg/tca_cfg_sampling_plots.sampling_pca_combo.png
results/tca_cfg/tca_cfg_sampling_plots.sampling_concs.pdf
results/tca_cfg/tca_cfg_sampling_plots.sampling_concs_combo.pdf
results/tca_cfg/tca_cfg_sampling_plots.sampling_dfs.pdf
results/tca_cfg/tca_cfg_sampling_plots.sampling_dfs_combo.pdf
results/tca_cfg/tca_cfg_sampling_plots.concs.pca.pdf
results/tca_cfg/tca_cfg_sampling_plots.dfs.pca.pdf
The concentrations now display only the five selected metabolites, C00122, C00149, C00311, C00036, and C00042, using their names Fumarate, Malate, Isocitrate, Oxaloacetate, and Succinate. Note that the groups of samples have been renamed to Free and MDF.
The driving forces are now displayed under more informative reaction names ordered according to their position in the TCA cycle.
The metabolite concentrations and reaction driving forces PCA plots show the data of interest according to the definitions in the plot config file.
One application of thermosampler is the study of the metabolite concentrations and thermodynamics of metabolic reactions in microbial communities. Syntrophic propionate oxidation and methanogenesis was considered in this case study.
A model was prepared for a syntrophic propionate oxidizing bacterium (SPOB) based on Candidatus Syntrophopropionicum ammoniitolerans (Fig 2). The model was parameterized using data from literature and Equilibrator.
Model component | File |
---|---|
Reactions describing metabolic network | data/methylmalonyl.model.tab |
Stoichiometric matrix from stoich.py |
data/methylmalonyl.stoich.tab |
Standard reaction Gibbs free energy changes | data/methylmalonyl.model_drGs.tab |
Default metabolite concentration ranges | data/methylmalonyl.concentrations.tab |
Fixed concentration ratios for MDF | data/methylmalonyl.ratios.tab |
Concentration ratio ranges for sampling | data/methylmalonyl.ratio_range.tab |
Fig 2. Model of syntrophic propionate oxidizing bacterium. Protons (H+) are not shown. Fdred, reduced ferredoxin; Fdox, oxidized ferredoxin; MK, menaquinone. |
Fixed extracellular high and low propionate (input metabolite) and acetate (output metabolite) concentrations were defined to allow thermosampler analysis regarding the thermodynamic effects of such conditions.
Concentrations | File |
---|---|
Default | data/mmcoa_fixed_prp_ac/methylmalonyl.concentrations.free.tab |
High acetate | data/mmcoa_fixed_prp_ac/methylmalonyl.concentrations.acHi.tab |
Low propionate | data/mmcoa_fixed_prp_ac/methylmalonyl.concentrations.prpLo.tab |
High prp/Low ac | data/mmcoa_fixed_prp_ac/methylmalonyl.concentrations.prpHi_acLo.tab |
Low prp/High ac | data/mmcoa_fixed_prp_ac/methylmalonyl.concentrations.prpLo_acHi.tab |
The SPOB model was used to sample thermodynamically feasible metabolite concentrations for all the conditions above as described in this Bash script:
source/methylmalonyl_fixed_prp_ac.sh
The results were plotted using plot_samples.R
and yielded the following output:
Full dataset plots.
# Random walk PCA
results/mmcoa_fixed_prp_ac_plots.sampling_pca_combo.png
results/mmcoa_fixed_prp_ac_plots.sampling_pca.png
# Concentrations and driving forces PCA
results/mmcoa_fixed_prp_ac_plots.concs.pca.pdf
results/mmcoa_fixed_prp_ac_plots.dfs.pca.pdf
# Concentration distributions
results/mmcoa_fixed_prp_ac_plots.sampling_concs_combo.pdf
results/mmcoa_fixed_prp_ac_plots.sampling_concs.pdf
# Driving force distributions
results/mmcoa_fixed_prp_ac_plots.sampling_dfs_combo.pdf
results/mmcoa_fixed_prp_ac_plots.sampling_dfs.pdf
Reduced dataset plots.
# Random walk PCA
results/mmcoa_fixed_prp_ac_plots_reduced.sampling_pca_combo.png
results/mmcoa_fixed_prp_ac_plots_reduced.sampling_pca.png
# Concentrations and driving forces PCA
results/mmcoa_fixed_prp_ac_plots_reduced.concs.pca.pdf
results/mmcoa_fixed_prp_ac_plots_reduced.dfs.pca.pdf
# Concentration distributions
results/mmcoa_fixed_prp_ac_plots_reduced.sampling_concs_combo.pdf
results/mmcoa_fixed_prp_ac_plots_reduced.sampling_concs.pdf
# Driving force distributions
results/mmcoa_fixed_prp_ac_plots_reduced.sampling_dfs_combo.pdf
results/mmcoa_fixed_prp_ac_plots_reduced.sampling_dfs.pdf
The SPOB was given company by an acetoclastic methanogen and a hydrogenotrophic methanogen to produce a multi-organism syntrophic model. To distinguish the different organisms' reactions, a prefix was introduced; po_
for the SPOB, and ac_
and hm_
for the acetoclastic and hydrogenotrophic methanogens. The three organisms affected eachother through the overlap in metabolites, i.e. hydrogen, formate, acetate, methane, and carbon dioxide. The model was parameterized using data from literature and Equilibrator.
Model component | File |
---|---|
Reactions describing metabolic network | data/syntrophic.model.tab |
Stoichiometric matrix from stoich.py |
data/syntrophic.stoich.tab |
Standard reaction Gibbs free energy changes | data/syntrophic.model_drGs.tab |
Default metabolite concentration ranges | data/syntrophic.concentrations.tab |
Individual concentration sums for each organism | data/syntrophic.conc_sums.tab |
Fixed concentration ratios for MDF | data/syntrophic.ratios.tab |
Concentration ratio ranges for sampling | data/syntrophic.ratio_range.tab |
Additionally, a file with a fixed high methane concentration was prepared to make contrast with free methane concentration thermodynamics:
data/syntrophic.concentrations.high_CH4.tab
The syntrophic model was used to sample thermodynamically feasible metabolite concentrations in free and high methane concentration conditions, as described in these Bash scripts:
source/syntrophic_2x10x10M.sh
source/syntrophic_2x10x10M.high_CH4.sh
The results were plotted using plot_samples.R
and yielded the following output:
Methane comparison plots.
# Random walk PCA
results/syntrophic_CH4.sampling_pca_combo.png
results/syntrophic_CH4.sampling_pca.png
# Concentrations and driving forces PCA
results/syntrophic_CH4.concs.pca.pdf
results/syntrophic_CH4.dfs.pca.pdf
# Concentration distributions
results/syntrophic_CH4.sampling_concs_combo.pdf
results/syntrophic_CH4.sampling_concs.pdf
# Driving force distributions
results/syntrophic_CH4.sampling_dfs_combo.pdf
results/syntrophic_CH4.sampling_dfs.pdf
Johannes Asplund-Samuelsson (johannes.aspsam@gmail.com)