diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 65489b5c..7fe6a604 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -100,7 +100,7 @@ When you're ready to contribute code to address an open issue, please follow the Our continuous integration (CI) testing runs [a number of checks](https://github.com/comorment/containers/actions) for each pull request on [GitHub Actions](https://github.com/features/actions). You can run most of these tests locally, which is something you should do *before* opening a PR to help speed up the review process and make it easier for us. - And finally, please update the [CHANGELOG](https://github.com/comorment/containers/blob/main/CHANGELOG.md) with notes on your contribution in the "Unreleased" section at the top. + And finally, please update the [CHANGELOG](CHANGELOG.md) with notes on your contribution in the "Unreleased" section at the top. After all of the above checks have passed, you can now open [a new GitHub pull request](https://github.com/comorment/containers/pulls). Make sure you have a clear description of the problem and the solution, and include a link to relevant issues. @@ -109,3 +109,119 @@ When you're ready to contribute code to address an open issue, please follow the +## Information for developers + +The list of tools included in the different Dockerfiles and installer bash scripts for each container +is provided [here](docker/README.md). Please keep this up to date when pushing new container builds. + +### Sphinx + +We use sphinx to generate online documentation from README.md files of this repository. +This uses [MyST](https://myst-parser.readthedocs.io) package to generate links in the documentation. +Here are few rules that we follow across ``.md`` files to make it work well: + +* use full path to the file in this repository + +### Folder structure + +These folders are relevant to the users: +* ``docs`` folder contain user documentation +* ``usecases`` folder contain extended examples / tutorials +* ``singularity`` folder contain pre-build containers +* ``reference`` folder contain reference data used in use-cases +* ``scripts`` folder contain pipelines such as ``gwas.py`` and ``pgs-toolkit``, as well as other helper scripts. + +These folders are relevant to developers: +* ``docker`` folder contains several ``Dockerfile`` files (container definitions) +and relevant shell scripts (in ``docker/scripts/``) used within those Dockerfile's. Unit-tests validating functionality of the resulting containers are available in the ``tests`` folder. +* ``sphinx-docs`` provides scripts used to build sphinx documentation. + +### Note about NREC machine + +We use NREC machine to develop and build containers. +NREC machine has small local disk (~20 TB) and a larger external volume attached (~400 TB) +If you use NREC machine, it's important to not store large data or install large software to your home folder which is located on a small disk, +using ``/nrec/projects space`` instead: + +``` +Filesystem Size Used Avail Use% Mounted on +/dev/sda1 20G 9.6G 9.7G 50% / +/dev/mapper/nrec_extvol-comorment 393G 346G 28G 93% /nrec/projects +/dev/mapper/nrec_extvol_2-comorment_2 935G 609G 279G 69% /nrec/space +``` + +Both docker and singularity were configured to avoid placing cached files into local file system. +For docker this involves changing ``/etc/docker/daemon.json`` file by adding this: + +``` +{ + "data-root": "/nrec/projects/docker_root" +} +``` + +(as described ; you may use ``docker info`` command to check the data-root) + +For singularity, the configuration is described here +and it was done for the root user by adding the following line into /etc/environment + +``` +export SINGULARITY_CACHEDIR="/nrec/projects/singularity_cache" +``` + +Common software, such as git-lfs, is installed to /nrec/projects/bin. +Therefore it's reasonable for all users of the NREC comorment instance +to add this folder to the path by changing ``~/.bashrc`` and ``~/.bash_profile``. + +``` +export PATH="/nrec/projects/bin:$PATH" +``` + +A cloned version of comorment repositories is available here: + +``` +/nrec/projects/github/comorment/containers +/nrec/projects/github/comorment/reference +``` + +Feel free to change these folders and use git pull / git push. TBD: currently the folder is cloned as 'ofrei' user - I'm not sure if it will actually work to pull & push. But let's figure this out. + +### Testing container builds + +Some basic checks for the functionality of the different container builds are provided in ``/tests/``, implemented in Python. +The tests can be executed using the [Pytest](https://docs.pytest.org) testing framework. + +To install Pytest in the current Python environment, issue: + +``` +pip install pytest # --user optional +``` + +New virtual environment using [conda](https://docs.conda.io/en/latest/index.html): + +``` +conda create -n pytest python=3 pytest -y # creates env "pytest" +conda activate pytest # activates env "pytest" +``` + +Then, all checks can be executed by issuing: + +``` +cd +py.test -v tests # with verbose output +``` + +Checks for individual containers (e.g., ``gwas.sif``) can be executed by issuing: + +``` +py.test -v tests/test_.py +``` + +Note that the proper container files (*.sif files) corresponding to the different test scripts must exist in ``/singularity/>``, +not only git LFS pointer files. + +### Git clone ignoring LFS + +See [stackoverflow.com/questions/42019529/how-to-clone-pull-a-git-repository-ignoring-lfs](https://stackoverflow.com/questions/42019529/how-to-clone-pull-a-git-repository-ignoring-lfs) +``` +GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:comorment/containers.git +``` diff --git a/README.md b/README.md index 87fc6c97..155fba22 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,9 @@ -# CoMorMent Containers +# COSGAP: A COntainerized Statistical Genetics Analysis Pipelines -The goal of the [CoMorMent](https://www.comorment.uio.no) containers repository at is to distribute tools for GWAS and post-GWAS analysis in CoMorMent project ([comorment.eu](https://comorment.eu)). +The goal of this github repository () is to distribute software tools for statistical genetics analysis, alongside with their respective reference data and scripts ("analysis pipelines") to facilitate application of these tools. The scope of this project is currently limited to genome-wide association studies (GWAS) and post-GWAS statistical-genetics analyses, including polygenic scoring (PGS). This project builds on earlier work by [Tryggve consortium](https://neic.no/tryggve/), +with most recent major development done as part of the CoMorMent EU H2020 project ([comorment.eu](https://comorment.eu)). For more information see our [preprint](https://arxiv.org/abs/2212.14103) manuscript, [this presentation](https://www.youtube.com/watch?v=msegdR2vJZs) on PGC WWL meeting (Feb 9, 2024), or our online documentation [here](https://comorment-containers.readthedocs.io/en/latest/). + +For an overview of available software, see [here](docs/README.md). Most of these tools are packaged into singularity containers () and shared in the [singularity](https://github.com/comorment/containers/tree/main/singularity) folder of this repository. You can download individual containers using github's ``Download`` button, or clone the entire repository from command line as described in the [Getting started](#getting-started) section below. @@ -14,6 +17,8 @@ More extensive use cases of containers, focusing on real data analysis, are prov The history of changes is available in the [CHANGELOG.md](./CHANGELOG.md) file. +If you would like to contribute to developing these containers, please see the [CONTRIBUTING](CONTRIBUTING.md) file. + Additional tools are available in separate repositories: * - LD score regression diff --git a/docker/README.md b/docker/README.md index 0ff65287..35c20496 100644 --- a/docker/README.md +++ b/docker/README.md @@ -1,32 +1,3 @@ -# Docker - -This repository is used to develop and document [Singularity](https://sylabs.io) or [Apptainer](https://apptainer.org) containers with various software and analytical tools for GWAS and post-GWAS analysis via [Docker](https://www.docker.com). - -## Getting started - -For new users we recommend to go over introductory instructions in [docs/singularity/hello.md](./../docs/singularity/hello.md), which explain the basic usage of singularity containers, using a minimalistic example (singularity container with ``plink`` binary). - -If you would like to contribute to developing these containers, please see the [CONTRIBUTING](./../CONTRIBUTING.md) file. - -For a tutorial on GWAS with synthetic data, see [usecases/gwas_demo.md](./../usecases/gwas_demo.md). - -### Prerequisites (to running tutorials) - -NOTE: This is out of date. Confer [usecases/README.md](./../usecases/README.md). - -* download container files shared on the [Google Drive](https://drive.google.com/drive/folders/1mfxZJ-7A-4lDlCkarUCxEf2hBIxQGO69?usp=sharing). -* download ``comorment_ref.tar.gz`` file from the above Google Drive folder, extract it with ``tar -xzvf comorment_ref.tar.gz`` command, - and create an environmental variable ``COMORMENT_REF`` pointing to the folder containing extracted ``comorment_ref.tar.gz`` data. - If you want to see the content of ``comorment_ref.tar.gz`` without downloading and extracting, - you may take a quick look [here](https://github.com/norment/comorment_data). This is a private repository, and you need to get access. - Please contact Oleksandr and Bayram by e-mail and send us your github user name. If you don't have it, create one [here](http://github.com/join). -* create an empty folder called ``data``, for storing the results and intermediate files produced by running containers. - (most instructinos mount this folder like this: ``-B data:/data``). - -## Description of available containers - -The detailed description of the available container [files](https://github.com/comorment/containers/tree/main/singularity) provided in this repository are found [here](./../docs/singularity/README.md). - ## Software versions Below is the list of tools included in the different Dockerfiles and installer bash scripts for each container. diff --git a/docs/README.md b/docs/README.md index 1d4eb4df..b8044459 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,8 +2,27 @@ ## Singularity -Brief descriptions of the available software container builds are available at [singularity](./singularity/README.md). +The list of all tools is provided on [this](/docs/software_list.md) page. +This software is organized into the following containers: +* [hello.sif](/docs/singularity/hello.md) - a simple container for demo purpose, allowing to experiment with singularity features +* [gwas.sif](/docs/singularity/gwas.md) - multiple tools (released as binaries/executables) for imputation and GWAS analysis +* [python3.sif](/docs/singularity/python3.md) - python3 environment with pre-installed modules and tools +* [r.sif](/docs/singularity/r.md) - R 4.0.5 environment with rareGWAMA, GenomicSEM, TwoSampleMR and GSMR packages installed (plus some standard R packages) -## Specifications +All containers have a common set of linux tools like ``gzip``, ``tar``, ``parallel``, etc. +Please [open an issue](https://github.com/comorment/containers/issues/new) if you'd like to add more of such basic tools, or if you would like to update some software to a newer version. + +## Data Format Specifications + +To improve interoperability between different tools we developed the following data format specification: + +* [Genotypes data](/docs/specifications/geno_specification.md) +* [Phenotypes data](/docs/specifications/pheno_specification.md) +* [GWAS Summary Statistics](/docs/specifications/sumstats_specification.md) + +These format specifications are applicable to various scripts, released in this repository, including + +* [gwas.py](/scripts/gwas/README.md) - pipeline for GWAS analysis +* [LDpred2](/scripts/pgs/LDpred2/README.md) - command-line wrapper around LDpred2 +* [pgs_toolkit](/scripts/pgs/pgs_toolkit/README.md) - pipeline for PGS analysis -Specifications of the input data format for GWAS analysis, recommended in CoMorMent projects are documented at [specifications](./specifications/README.md) diff --git a/docs/singularity/README.md b/docs/singularity/README.md deleted file mode 100644 index 80be277e..00000000 --- a/docs/singularity/README.md +++ /dev/null @@ -1,12 +0,0 @@ -# Singularity - -Here is a brief overview of available container [files](./../../singularity/README.md) (more info in the links below): - -* [hello.sif](hello.md) - a simple container for demo purpose, allowing to experiment with singularity features -* [gwas.sif](gwas.md) - multiple tools (released as binaries/executables) for imputation and GWAS analysis -* [python3.sif](python3.md) - python3 environment with pre-installed modules and tools -* [r.sif](r.md) - R 4.0.5 environment with rareGWAMA, GenomicSEM, TwoSampleMR and GSMR packages installed (plus some standard R packages) - -All containers have a common set of linux tools like ``gzip``, ``tar``, ``parallel``, etc. -Please open an issue if you'd like to add more of such basic tools. -Please [let us know](https://github.com/comorment/containers/issues/new) if you face any problems. diff --git a/docs/specifications/README.md b/docs/specifications/README.md deleted file mode 100644 index 05501d94..00000000 --- a/docs/specifications/README.md +++ /dev/null @@ -1,16 +0,0 @@ -# Specifications - -These are the specifications of the input data format for GWAS anslysis, recommended in CoMorMent projects. -Current version: ``v0.9.1``. Further changes from this version will be documented. - -For details, please see these documents: - -* [Genotypes](geno_specification.md) -* [Phenotypes](pheno_specification.md) -* [Summary statistics](sumstats_specification.md) - - -## Change log - -* ``v0.9`` - first version of this document -* ``v0.9.1`` - specify case/control coding and rename COLUMN->FIELD in the dictionary file diff --git a/docs/specifications/geno_specification.md b/docs/specifications/geno_specification.md index 5969e7e9..fef9f05b 100644 --- a/docs/specifications/geno_specification.md +++ b/docs/specifications/geno_specification.md @@ -1,4 +1,4 @@ -# Genotypes +# Genotype data spec We expect imputed genotype data, which may be split into multiple *cohorts* at each site. For example, MoBa imputed genotype data is currently split into three cohorts, one per genotype array: GSA, OMNI and HCE. @@ -47,3 +47,7 @@ Currently we do not require ``IID`` values to be unique across cohorts. At of now, we only support the analysis for autosomes (chr 1..22). Support for other chromosomes will came later. We expect the same set of individuals across all autosomes (chr 1..22). + +## Change log + +* ``v0.9`` - first version of this document diff --git a/docs/specifications/pheno_specification.md b/docs/specifications/pheno_specification.md index db8a7e8a..67646842 100644 --- a/docs/specifications/pheno_specification.md +++ b/docs/specifications/pheno_specification.md @@ -1,4 +1,4 @@ -# Phenotypes and covariates +# Phenotypes and covariates spec For phenotypes and covariates, we expect the data to be organized in a single delimiter-separated file (hereinafter referred to as *phenotype file*), with rows corresponding to individuals, and columns corresponding to relevant variables of interest or covariates. @@ -75,3 +75,8 @@ PC2,CONTINUOUS,2nd principal component PC3,CONTINUOUS,3rd principal component ... ``` + +## Change log + +* ``v0.9`` - first version of this document +* ``v0.9.1`` - specify case/control coding and rename COLUMN->FIELD in the dictionary file diff --git a/docs/specifications/sumstats_specification.md b/docs/specifications/sumstats_specification.md index 66c7d5d3..e6a5be06 100644 --- a/docs/specifications/sumstats_specification.md +++ b/docs/specifications/sumstats_specification.md @@ -1,4 +1,4 @@ -# Summary statistics +# Summary statistics spec The results of GWAS are represented as summary statistics, with the following columns: @@ -55,3 +55,7 @@ If you need these columns for ``regenie`` analysis consider also running ``plink | Z | ? | Z | Z | Z | OK | | FRQ | FRQ_A_NNN | FRQ | EAF | FRQ | keep "FRQ" which makes more sense for non-EUR populations | | missing | ? | missing | EAF_1KG | missing | not needed | + +## Change log + +* ``v0.9`` - first version of this document diff --git a/reference/examples/gsmr/.gitattributes b/reference/examples/gsmr/.gitattributes new file mode 100644 index 00000000..1a4ee854 --- /dev/null +++ b/reference/examples/gsmr/.gitattributes @@ -0,0 +1,2 @@ +reference/examples/gsmr/gsmr_example.recode.log filter=lfs diff=lfs merge=lfs -text +gsmr_example.recode.log filter=lfs diff=lfs merge=lfs -text diff --git a/sphinx-docs/source/docs/singularity/README.md b/sphinx-docs/source/docs/singularity/README.md deleted file mode 100644 index 21a8c5cd..00000000 --- a/sphinx-docs/source/docs/singularity/README.md +++ /dev/null @@ -1,2 +0,0 @@ -```{include} ../../../../docs/singularity/README.md -``` \ No newline at end of file diff --git a/sphinx-docs/source/docs/specifications/README.md b/sphinx-docs/source/docs/specifications/README.md deleted file mode 100644 index 093eef1c..00000000 --- a/sphinx-docs/source/docs/specifications/README.md +++ /dev/null @@ -1,2 +0,0 @@ -```{include} ../../../../docs/specifications/README.md -``` \ No newline at end of file diff --git a/sphinx-docs/source/index.rst b/sphinx-docs/source/index.rst index ebcdb076..7c2cefdd 100644 --- a/sphinx-docs/source/index.rst +++ b/sphinx-docs/source/index.rst @@ -3,6 +3,51 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. + Here is an overview of how we may want the doc's TOC to be + (As a starting point we may have only root-level listed in the docs, and everything else expanding upon user clicking on that node) + + Introduction + Getting started (hello.sif) + Full installation + documentation specific to each container (including containers shared in other repositories) + hello.sif + Software list + gwas.sif + Software list, including refs to documentation + python.sif + Software list + R.sif + Software list + For more docuemtation on LDpred2 see [scripts] folder. + ldsc.sif + * reference overview + * usage exampe + HDL.sif (external repo) + MAGMA.sif + MiXeR.sif + ... + specification of data formats + geno + pheno + sumstats + reference data + For tool-specifc referece, see [container documentation] + * opensnp dataset + * summary statistics (HEIGHT, L/R handedness, ...) + * ? 1kG files if needed + scripts (tools / toolkits / pipelines) + gwas + usage example (with data included in the repo, i.e. opensnp dataset) + pgs_toolkit + this supports usage from python + ldpred2 + usecases / tutorials (UKB, MoBa, ADNI, ..) + can be READMe files, but also jupyter notebooks + API usage + pgs_toolkit + Contributing / dev instructions (wiki-like content) + Internal usage (p33/p697/Tryggve collaborators) + Welcome to the CoMorMent-container's documentation! =================================================== diff --git a/usecases/bolt_out/.gitattributes b/usecases/bolt_out/.gitattributes new file mode 100644 index 00000000..caa29b73 --- /dev/null +++ b/usecases/bolt_out/.gitattributes @@ -0,0 +1,2 @@ +example_3chr.log filter=lfs diff=lfs merge=lfs -text +myld.log filter=lfs diff=lfs merge=lfs -text diff --git a/usecases/gwas_demo/.gitattributes b/usecases/gwas_demo/.gitattributes new file mode 100644 index 00000000..4dca367d --- /dev/null +++ b/usecases/gwas_demo/.gitattributes @@ -0,0 +1,18 @@ +run2_chr3.log filter=lfs diff=lfs merge=lfs -text +run2_PHENO.regenie.log filter=lfs diff=lfs merge=lfs -text +run1_CASE2.regenie.log filter=lfs diff=lfs merge=lfs -text +run1_CASE.regenie.log filter=lfs diff=lfs merge=lfs -text +run1_chr1.log filter=lfs diff=lfs merge=lfs -text +run1.log filter=lfs diff=lfs merge=lfs -text +run1.regenie.step1.log filter=lfs diff=lfs merge=lfs -text +run1_CASE2.plink2.log filter=lfs diff=lfs merge=lfs -text +run1_chr3.log filter=lfs diff=lfs merge=lfs -text +run2_chr1.log filter=lfs diff=lfs merge=lfs -text +run2_PHENO2.plink2.log filter=lfs diff=lfs merge=lfs -text +run2_PHENO2.regenie.log filter=lfs diff=lfs merge=lfs -text +run1_CASE.plink2.log filter=lfs diff=lfs merge=lfs -text +run1_chr2.log filter=lfs diff=lfs merge=lfs -text +run2_chr2.log filter=lfs diff=lfs merge=lfs -text +run2.regenie.step1.log filter=lfs diff=lfs merge=lfs -text +run2.log filter=lfs diff=lfs merge=lfs -text +run2_PHENO.plink2.log filter=lfs diff=lfs merge=lfs -text diff --git a/usecases/saige_out/.gitattributes b/usecases/saige_out/.gitattributes new file mode 100644 index 00000000..30222b4c --- /dev/null +++ b/usecases/saige_out/.gitattributes @@ -0,0 +1 @@ +out_vcf.log filter=lfs diff=lfs merge=lfs -text