A key feature of MicrobiomeDB is the development of an automated pipeline for loading data from microbiome experiments using the standard Biological Observation Martix (.biom) as input. This file format can be produced for any experiment processed using the popular and powerful software suites QIIME and Mothur, and is also the standard format used by both the Earth Microbiome Project and the Human Microbiome Project. Relative abundance data are extracted from the .biom file and mapped to the GreenGenes database (by ID) to retrieve full 16S sequences, NCBI taxon identifiers, and taxonomic strings. Alpha-diversity metrics are pre-calculated and these data are loaded together into the database.
Although taxa abundance tables and diversity metrics are useful, the real power of MicrobiomeDB comes from the fact that all the 'metadata' terms used by the experimenter to describe each sample are also extracted from the .biom file. These terms are mapped to the MIxS ontology and unmapped terms are manually curated and used to expand a custom, MIxS-compliant, ontology tree. This rich, structured sample description generates an ISA.tab file that is then loaded into microbiomeDB. When combined with the extensive web toolkit and infrastructure developed by EuPathDB, the user is provided with an web interface to interrogate complex, even massive-scale, microbiome studies using metadata queries. The resulting queries are then visualized using Shiny app plug-ins available directly in the browser.
In its current state, MicrobiomeDB is a 'first-pass' example of microbiome data mining. We envision significantly expanding our pipeline to include loading additional 16S rRNA databases, metadata that describe taxa (i.e. basic microbiological properties), as well as bacterial metabolic pathway databases (i.e. KEGG), and much more. Although the experimental datasets currently loaded are from 16S rRNA marker gene sequencing, our pipeline would also accommodate similarly formatted, taxa abundance data from shotgun metagenomic studies, and future functionality could allow loading tables of bacterial gene expression data derived from these studies. Taken together, we hope to develop a full-featured, open-source platform for a systems biology view of microbial communities.