This repository contains tools associated with the DHART (Digital HeART) portal, e.g. data retrieval from GEO, semi-automated metadata preparation, data wrangling, etc.
The DHART is based on the gEAR framework, developped by the Institute for Genome Sciences.
For Docker (Singularity) usage, see Containers.
If installing the package, first see Dependencies. Create a virtual environment, clone the repository and install dhart-utils:
# create virtual environment (Python >=3.9)
python3 -m venv /path/to/virtual/environment
# activate the environment
source /path/to/virtual/environment/bin/activate
# clone the repository
git clone https://github.com/eboileau/dhart-utils.git
cd dhart-utils
# install
pip3 install --upgrade pip setuptools wheel
pip3 --verbose install . 2>&1 | tee install.log
This script is used to download GEO series, prepare the metadata for ingestion into DHART, and data files for wrangling.
Call the program with --geo
for a single GEO identifier, or with --gfile
for many identifiers e.g. ACCESSION.txt.
To see the full list of parameters, call the program on the command line without any option: accession
.
This script is a data wrangling wrapper to prepare and reformat GEO supplementary files (output of accession
) into H5AD for input into the DHART portal.
The program currently only handles single-cell RNA-Seq and bulk RNA-Seq, and adheres to the gEAR standard formatting guidelines. To enable some of the portal functionalities, use --replicate-column
and --cluster-column
if relevant.
Note As DHART is being actively developed, formatting guidelines may change in the future!
The program handles two basic --input-format
: TXT (e.g. TXT, TSV, or CSV) and MEX, and each has is own set of options.
For the portal, single-cell RNA-Seq data is NOT normalized, but bulk RNA-Seq must be normalized (use --do-normalise
).
For plate-based single-cell RNA-Seq ( e.g. with well numbers as observations), pass --txt-first-row-names
, in particular if there is a
header for gene ids/symbol column. In general, however, avoid using headers! To see the full list of parameters, call the program on the command line without any option: wrangling
.
Unless you are using Docker (or Singularity), installing dhart-utils also requires a working R >4.1.x installation with packages listed in R-dependencies.r
Pinned version of selected Python packages are listed in requirements.txt for reproducible installation.
Note: Depending on your setup, environment variables may need to be updated, e.g. export LD_LIBRARY_PATH="/path/to/your/R/installation/lib/R/lib:$LD_LIBRARY_PATH"
after installing rpy2.
Docker images allow to deploy the package resources without any dependency tree issues. However, dhart-utils images are not yet published (hosted in an image registry) or integrated to GitHub actions, and thus needs to be built from the Dockerfile.
Assuming that Docker is installed, execute the following command from the root directory:
- accession:
docker build --rm -t accession:latest -f docker/accession/Dockerfile .
- wrangling:
docker build --rm -t wrangling:latest -f docker/wrangling/Dockerfile .
where the tag latest can be changed.
Once the image is built, the Docker run
command can be used to create a writeable container and start it:
- accession:
docker run accession:latest
(without options, shows help message and exits, equivalent to [--help/-h])
docker run -v "`pwd`/local":/out accession:latest -f /out/ACCESSION.txt -d /out --logging-level INFO --log-file /out/accession.log
Here, we use volume driver options, where the first field is an existing directory on the host machine (a directory named local under the current directory, where we have placed required input files e.g. ACCESSION.txt), and the second field is the path where the file or directory are mounted in the container. Options with input files or paths that are passed to the program must be relative to the mounted directory in the container.
- wrangling:
docker run wrangling:latest
(without options, shows help message and exits, equivalent to [--help/-h])
docker run -v /mnt/smb/prj/DHART/DATA/GSEXXXXX1:/out wrangling:latest --dest /out --name docker -fmt TXT --txt-gene-cols gene_id --gene-var gene_symbol --do-normalise --normalization-method TMM --logging-level INFO --log-file /out/wrangling.log
Here, the host machine's directory is /mnt/smb/prj/DHART/DATA/GSEXXXXX1, which must be the path to the directory where GEO files reside, including output from accession and/or pre-processed bulk RNA-seq count matrix (the path specified by [--dest]).
We strongly recommend to use Singularity. However, at present, we do not have recipes to build Singularity images, but they can be built from
Docker. First create a tarball of the Docker image, e.g. docker save 5fad7a46ecfd -o docker/accession/accession.tar
Build the singularity image singularity build accession.sif docker-archive:///prj/DHART/dhart-utils/local/accession.tar
Use the --sandbox
option as singularity build --sandbox accession docker-archive:///prj/DHART/dhart-utils/local/accession.tar
for development, by accessing the container singularity shell --bind /prj/DHART/dhart-utils/local/:/local accession
.
We use the --bind
to mount volumes, as for Docker above.
Data can only be read if it conforms to one of the provided data schemes. Using the --normalization-method
flag, a normalization method can be specified. TMM
, DESeq
and TPM
are all valid normalization methods, TMM
being the default. When processing data using TPM normalization, gene lengths must be provided separately. See below for details.
The file observations.tab.gz (output form accession
) and the bulk data in a file named GSEXXXXXX_read_count
(whith the correct GEO identifier, and any TXT extension) needs to be submitted for each dataset.
Note: The input read count matrix e.g. GSEXXXXXX_read_count.csv.gz
MUST be prepared by you. This is NOT automatically done by accession
or wrangling
. In some cases, data downloaded from GEO by accession
will be ready to input, but you must first ensure that it satisfies the input format.
Data schmeme choices include the following:
- 'gene_id': Includes just the gene ID, this is later used to look up the gene symbol. Both fields are then written to the H5AD file.
- 'gene_symbol': Includes just the gene symbol.
- 'both': Both the gene symbol and gene ID are already present in the input scheme, both will be written to the H5AD file.
Note: The column header for extra columns must be empty!
- gene_id
<sample#1> | ... | <sample#n> | |
---|---|---|---|
ENS... | 1.343 | ... | 1.472 |
ENS... | 2.872 | ... | 3.971 |
ENS... | 1.341 | ... | 1.349 |
- gene_symbol
<sample#1> | ... | <sample#n> | |
---|---|---|---|
A... | 1.343 | ... | 1.472 |
B... | 2.872 | ... | 3.971 |
C... | 1.341 | ... | 1.349 |
- both
<sample#1> | ... | <sample#n> | ||
---|---|---|---|---|
ENS... | A... | 1.343 | ... | 1.472 |
ENS... | B... | 2.872 | ... | 3.971 |
ENS... | C... | 1.341 | ... | 1.349 |
Currently no test are implemented. We're working on adding integration/regression tests baed on some sample data. Some minimal test data for bulk under /prj/DHART/DATA/test_data/RNAseq/GSEXXXXX1.
This project adheres to Semantic Versioning. See CHANGELOG.
GEO classes and parsing functions in geo_utils
are based on GEOparse, which does
not seem to be maintained anymore. These were modified to deal with GEO SERIES and SAMPLES metadata only.
Interacting with GEO/SRA online (data download) is done with pysradb.
Bulk RNA-seq normalization relies on standard functions implemented in the R packages edgeR and DESeq2.