Skip to content

The MFD Atlas

tbnj edited this page Oct 21, 2024 · 13 revisions

Microflora Danica is the Atlas of the enviromental microbiomes of Denmark. It relies upon a large scale dataset encompassing 10,686 shotgun metagenomes and 449 full-length 16S and 18S rRNA collections, linked to a detailed 5-level habitat ontology. The manuscript determines that while human-disturbed habitats have high alpha diversity, the same species reoccur, revealing hidden homogeneity and underlining the importance of natural systems for total species (gamma) diversity. In-depth studies of nitrifiers, a functional group closely linked to climate change, challenge existing perceptions regarding habitat preference and discover several novel nitrifiers as more abundant than canonical nitrifiers. Together, the Microflora Danica dataset provides an unprecedented resource and the foundation for answering fundamental questions underlying microbial ecology: what drives microbial diversity, distribution and function.

Background

In 1752, King Frederik V of Denmark, in line with the historical context of the Age of Enlightenment, commissioned the Flora Danica project to Georg C. Oeder, initiating an “Opus Incomparibile” that took 122 years and produced one of the world’s most unique works in natural history. The resulting atlas included 3000 botanic engravings of flowers and plants over 17 volumes. In 2019, we initiated the Microflora Danica (MFD) project with the aim of cataloguing the microbiomes of Denmark.

The results of Microflora Danica

Section I: The Microflora Danica data set

The MFD dataset includes 10,686 shotgun metagenomes and 449 full-length 16S and 18S rRNA collections, linked to a detailed 5-level habitat ontology. The full-length collections (14.9 million bacterial rRNA operons, 6.4 million bacterial V1-V8 UMI sequences, and 13.4 million eukaryotic rRNA operons) were augmented with publicly available datasets resulting in a combined dataset with a size an order of magnitude larger than SILVA 138.1. This dataset was used to generate the Microflora Global (MFG) database of 350,815 species representatives (98.7% nucleotide identity) using the AutoTax framework. Metagenomic samples have an average sequencing depth of 4.5 Gbp, totaling to 48.2 Tbp for the entire project.

With a total geographic area of 42,952 km² the sampling density in MFD roughly equals 1 metagenomic sample per 4 km². The samples cover the main land use/land cover categories (LULC) as defined by Basemap04; intensive agriculture (23,457.7 km²; 54.4 %), built and infrastructure (6,010.1 km²; 13.9 %), forest (5,713.6 km²; 13.3 %), nature areas (3,977.1 km²; 9.2 %), extensive agriculture (3,590.7 km²; 8.3 %) and streams and lakes (1,211.3 km², 2.8 %). The main numbers, sampling map and sankey plot of the habitats' breakdown can be reproduced using the following github repo. The shotgun metagenomes were assembled and binned yielding a total of 19,253 metagenome-assembled genomes (MAGs) of at least MIMAG medium quality representing 5,518 species groups of 95% average nucleotide identity (ANI).

Section II: The Danish microbiome is diverse and mirrors the global microbiome

The Microflora Danica full-length 16S rRNA dataset contains 21.3 million sequences representing 168,938 species-level (98.7%) OTUs. Pan-habitat rarefactions of the 16S rRNA sequences indicates that the 16S rRNA data captures Denmark’s dominant species in the investigated habitats. The taxonomic diversity measured as the number of species representatives in the Danish habitats under investigation was quantified using rarefaction (interpolation) and prediction (extrapolation) with Hill numbers of order q as explained in this repo. The species richness (q = 0) was estimated to be 195,384 of which 75,661 can be considered common or typical (q = 1) and sets a minimum estimate of the total free-living bacterial gamma diversity of Denmark.

The sequences were also compared to public databases, demonstrating that we had uncovered significant phylogenetic novelty at lower taxonomic levels (species and genus), while the higher ranks were already well covered. We also evaluated the coverage of the Microflora Global (MFG) database and its ability to classify 16S rRNA amplicons and metagenomic reads. This showed significant improvements compared to state-of-the-art databases such as GreenGenes2, GTDB, and SILVA, not only for data from the MFD project but also for data from the Global Prokaryote Consensus (GPC) project. Finally, we also investigated the diversity of microeukaryotes in the MFD habitats based on 13.4 million full-length 18S rRNA sequences representing 12,440 species-level (99.0%) OTUs. The analyses can be reproduced using the following github repo.

Section III: Alpha and gamma diversity as habitat management discriminators

By combining the full-length 16S rRNA and the 16S rRNA profiles derived from the metagenomes we investigated three different types of diversity; alpha (how many taxa do we see in a sample?), beta (how different is the taxon composition between samples?), and gamma diversity (how big is the shared taxon pool across a group of samples?).

By doing so we show how samples from disturbed or highly disturbed habitats show larger alpha diversity compared to less disturbed habitats - an observation that has also been reported on the continental and the global scale. These observations contradict what is observed in analysis of the above ground biodiversity. However, if the focus is changed to evaluation of gamma diversity, the disturbed habitats are associated with lower overall diversity - reflecting what is observed above ground. This is a result of community homogenisation in disturbed habitats - a conclusion which is also supported by the analysis of the within-group beta diversity. These results led us to point out that gamma diversity should be used as a metric for the monitoring of microbial diversity when e.g. evaluating the effect of land use perturbation or climatic changes. This analysis can be reproduced using the following github repo.

Section IV: Convergence of supervised and unsupervised habitat descriptors

The Microflora Danica dataset presents a unique opportunity to investigate how microbial communities corroborate habitat ontologies defined by macroflora and abiotic observations. Generally, we found a good separation between the different MFDO1 habitats in ordination space (with the exception of MFDO1 “Bogs, mires and fens”). When subjecting the prokaryotic communities to distance decay analysis, we evaluated that spatial autocorrelation was negligible above distances of 10 km. Based on this finding we computed a spatially thinned dataset based on the MFDO1 level using the official 10 km reference grid of Denmark comprising 2,122 samples. Clustering based on between-group beta diversity of samples with the same MFDO1 classification largely captured the expected relationships between the different habitats.

To take advantage of the high-resolution MFD ontology, we used the prokaryotic 16S rRNA gene fragment counts as predictor variables in a random forest habitat classification across all MFD ontology levels and taxonomic ranks (phylum to genus). The analysis can be reproduced using the following github repo. Some habitats were difficult to model, e.g. all agricultural fields, where the level of shared taxa was large, while habitats with highest model scores were more specialised microbiomes (e.g. “Saltwater” and “Wastewater”).

Section V: Core genera are abundant and prevalent across habitats in Denmark

We used the metagenomic derived 16S rRNA gene community profiles and identified core community members across each of the 5 levels of the habitat ontology based on the mean relative abundance and habitat-specific prevalence. The median size and the cumulative relative abundance of the core community with increasing specificity of the habitat classifications. The results align with the presented findings across the terrestrial ecosystems of Earth at similar habitat specificity.

Across the full habitat ontology of 565 core genera were identified, and habitat-specific core genera were primarily identified in habitats associated with strongly selecting environmental gradients (e.g. halotolerance) or well-defined habitats such as biogas systems. If such are not present, a high degree of prevalence across multiple habitats is observed exemplified by the lack of unique genera for different crop type in agricultural fields corroborating the modelling results from Section IV. The scripts for performing this analysis can found in the following github repo.

Section VI: MAGs from Danish habitats double the known species fraction in metagenomes

The Microflora Danica metagenomic dataset provides a substantial amount of microbial novelty and reveals a high proportion of non-prokaryotic DNA that could present key resources for future analyses. Furthermore, the shallow shotgun metagenomes were used for metagenome assembled genome (MAG) recovery and a total of 19,253 MAGs were obtained, which de-replicated into 5,518 non-redundant (species level) MAGs, including 422 MAGs that were high quality by MIMAG guidelines. Most of the MAGs represented novel prokaryotic species from different branches of the tree of life, as presented in the following github repo. The MAGs recovered in this project could also offer great insights into Danish microorganisms as they double the amount of known species. However, despite the large progress in this project, there remains a large amount of Danish prokaryotic community to be represented. This implies the importance of more large-scale sampling efforts for completing the entire tree of life. The analysis can be reproduced using the following github repo.

Section VII: Distribution patterns of canonical and novel nitrifiers across Danish habitats

Nitrification, the aerobic conversion of ammonium to nitrite and nitrate, is an essential step in the biogeochemical nitrogen cycle, and has a significant impact on ecosystem health. In order to investigate the diversity, distribution and novelty of nitrifiers across Denmark, we improved gene-based search models by incorporating the nitrification marker genes from the recovered MFD MAGs. GraftM packages and nitrifier gene-abundances can be found at the MFD Zenodo repo. By using the updated search models, we were able to detect distinct differences in the nitrifier communities between Danish habitats. The analysis can be reproduced by following github repo. Investigation of the recovered MFD MAGs revealed several novel groups of potential nitrifiers that appear abundant and widespread in the natural habitats of Denmark. Our results underline the importance of studying the distribution and diversity of both canonical and novel nitrifiers, and how they respond to changes in environmental factors.