-
Notifications
You must be signed in to change notification settings - Fork 4
Home
GDCtools is a set of open-source, config-file driven Python and UNIX CLI utilities for interacting with the NIH/NCI Genomics Data Commons and automating data cleansing, aggregation and reporting steps that are common to most data-driven science projects. It grew from efforts at the Broad Institute to connect the GDAC Firehose pipeline and portal developed in TCGA to use the GDC as its primary source of data, but aims to go well beyond that. By wrapping the GDC API in a set of rigorously defined and domain-aware tools, GDCtools lets users interact with the GDC in memes familiar to them—as biomedical researchers and informaticians—rather than as web or database programmers. This can make it simpler to search and retrieve either legacy or harmonized data & metadata from the GDC, and shrink the learning and staffing curves, while providing indispensable features like: turnkey creation of date-stamped snapshots of data; aggregating multiple samples into a single bolus for ready consumption by scientific algorithms; ensuring that samples are identifiable by project (e.g. restoring TCGA ids to SNP6 segments); sample report generation; sample freeze list (load file) creation, for either on-premise or cloud storage (e.g. Google); aggregate cohort construction (e.g. combining TCGA STAD + ESCA cohorts into STES, with just 1 line in a config file); retrieving an entire project or just 1 case, with equal ease; easily combine data across multiple projects, e.g. TCGA and CPTAC; all within a well-tested object-oriented framework that is easy to comprehend and extend by users. For more information see the README and this pictorial overview and the documentation given in this Wiki.
Corresponding Author: Michael S. Noble (mnoble@broadinstitute.org)
Contributing Authors: Timothy DeFreitas (timdef@broadinstitute.org)
David Heiman (dheiman@broadinstitute.org)