py-graph-imputation
is the successor of GRIMM written in Python and based on NetworkX
pip install py-graph-imputation
For an example, get example-conf-data.zip
Unzip the folder so it appears as:
conf
|-- README.md
`-- minimal-configuration.json
data
|-- freqs
| `-- CAU.freqs.gz
`-- subjects
`-- donor.csv
>>> from graph_generation.generate_hpf import produce_hpf
>>> produce_hpf(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Conversion to HPF file based on following configuration:
Population: ['CAU']
Frequency File Directory: data/freqs
Output File: output/hpf.csv
****************************************************************************************************
Reading Frequency File: data/freqs/CAU.freqs.gz
Writing hpf File: output/hpf.csv
This will produce the files which will be used for graph generation:
output
|-- hpf.csv # CSV file of Haplotype, Populatio, Freq
`-- pop_counts_file.txt # Size of each population
>>> from grim.grim import graph_freqs
>>> graph_freqs(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Performing graph generation based on following configuration:
Population: ['CAU']
Freq File: output/hpf.csv
Freq Trim Threshold: 1e-05
****************************************************************************************************
This will produce the following files:
output
`-- csv
|-- edges.csv
|-- info_node.csv
|-- nodes.csv
`-- top_links.csv
>>> from grim.grim import impute
>>> impute(conf_file='conf/minimal-configuration.json')
****************************************************************************************************
Performing imputation based on:
Population: ['CAU']
Priority: {'alpha': 0.4999999, 'eta': 0, 'beta': 1e-07, 'gamma': 1e-07, 'delta': 0.4999999}
UNK priority: SR
Epsilon: 0.001
Plan B: True
Number of Results: 10
Number of Population Results: 100
Nodes File: output/csv/nodes.csv
Top Links File: output/csv/edges.csv
Input File: data/subjects/donor.csv
Output UMUG Format: True
Output UMUG Freq Filename: output/don.umug
Output UMUG Pops Filename: output/don.umug.pops
Output Haplotype Format: True
Output HAP Freq Filename: output/don.pmug
Output HAP Pops Filename: output/don.pmug.pops
Output Miss Filename: output/don.miss
Output Problem Filename: output/don.problem
Factor Missing Data: 0.0001
Loci Map: {'A': 1, 'B': 2, 'C': 3, 'DQB1': 4, 'DRB1': 5}
Plan B Matrix: [[[1, 2, 3, 4, 5]], [[1, 2, 3], [4, 5]], [[1], [2, 3], [4, 5]], [[1, 2, 3], [4], [5]], [[1], [2, 3], [4], [5]], [[1], [2], [3], [4], [5]]]
Pops Count File: output/pop_counts_file.txt
Use Pops Count File: output/pop_counts_file.txt
Number of Options Threshold: 100000
Max Number of haplotypes in phase: 100
Save space mode: False
****************************************************************************************************
0 Subject: D1 8400 haplotypes
0 Subject: D1 6028 haplotypes
0.09946062499999186
This will produce files in output
directory as:
├── output
│ ├── don.miss # Cases that failed imputation (e.g. incorrect typing etc.)
│ ├── don.pmug # Phased imputation as PMUG GL String
│ ├── don.pmug.pops # Population for Phased Imputation
│ ├── don.problem # List of errors
│ ├── don.umug # Unphased imputation as UMUG GL String
│ ├── don.umug.pops # Population for Phased Imputation
How to develop on the project locally.
- Make sure the following pre-requites are installed.
git
python >= 3.8
- build tools eg
make
- Clone the repository locally
git clone git@github.com:nmdp-bioinformatics/py-graph-imputation.git cd py-graph-imputation
- Make a virtual environment and activate it, run
make venv
> make venv python3 -m venv venv --prompt py-graph-imputation-venv ===================================================================== To activate the new virtual environment, execute the following from your shell source venv/bin/activate
- Source the virtual environment
source venv/bin/activate
- Development workflow is driven through
Makefile
. Usemake
to list show all targets.> make clean remove all build, test, coverage and Python artifacts clean-build remove build artifacts clean-pyc remove Python file artifacts clean-test remove test and coverage artifacts lint check style with flake8 behave run the behave tests, generate and serve report pytest run tests quickly with the default Python test run all(BDD and unit) tests coverage check code coverage quickly with the default Python dist builds source and wheel package docker-build build a docker image for the service docker build a docker image for the service install install the package to the active Python's site-packages venv creates a Python3 virtualenv environment in venv activate activate a virtual environment. Run `make venv` before activating.
- Install all the development dependencies. Will install packages from all
requirements-*.txt
files.make install
- Package Module files go in the
grim
directory.grim |-- __init__.py |-- grim.py `-- imputation |-- __init__.py |-- cutils.pyx |-- cypher_plan_b.py |-- cypher_query.py |-- impute.py `-- networkx_graph.py
- Run all tests with
make test
or different tests withmake behave
ormake pytest
. - Run
make lint
to run the linter and black formatter. - Use
python app.py
to run the Flask service app in debug mode. Service will be available at http://localhost:8080/ - Use
make docker-build
to build a docker image using the currentDockerfile
. make docker
will build and run the docker image with the service. Service will be available at http://localhost:8080/
From the main directory of the repo run:
scripts/build-imputation-validation.sh
This will prepare and load frequency data into the graph and run imputation on a sample set of subjects.
The execution is driven by the configuration file: conf/minimal-configuration.json
It takes input from this file:
data/subjects/donor.csv
And generates an output
directory with these contents:
output
├── don.miss
├── don.pmug
├── don.pmug.pops
├── don.problem
├── don.umug
└── don.umug.pops
The .problem
file contains cases that failed due to serious errors (e.g., invalid HLA).
The .miss
file contains cases where there was no output possible given the input, frequencies and configuration options.
The .pmug
file contains the Phased Multi-locus Unambiguous Genotypes.
The .umug
file contains the Un-phased Multi-locus Unambiguous Genotypes.
The format of both files is (csv):
- id
- genotype - in glstring format
- frequency
- rank
The .pmug.pops
and .umug.pops
contain the corresponding population assignments.
The format of the .pops
files is (csv):
- id
- pop1
- pop2
- frequency
- rank