Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First working version #5

Merged
merged 155 commits into from
Jan 7, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
155 commits
Select commit Hold shift + click to select a range
029a578
init classes
murphycj2 Oct 15, 2020
8a11288
common args
murphycj2 Oct 19, 2020
d6cd47a
add some util functions
murphycj2 Oct 19, 2020
f7b6c54
very basic cli finished
murphycj2 Oct 19, 2020
da48a87
some cli error checking
murphycj2 Oct 19, 2020
a514d80
parse sample info
murphycj2 Oct 20, 2020
9801756
rename a arg
murphycj2 Oct 20, 2020
684870e
some commetns
murphycj2 Oct 20, 2020
2bf914b
add vcf option
murphycj2 Oct 20, 2020
3ab9168
parse vcf
murphycj2 Oct 20, 2020
3ba80a6
automated search for sample bam file
murphycj2 Oct 20, 2020
d5e2992
basic extraction module
murphycj2 Oct 20, 2020
b68ee17
add some options
murphycj2 Oct 20, 2020
0b4c00a
standardize sex nomeclature
murphycj2 Oct 21, 2020
9e538ed
some bug fixes to get extract to work
murphycj2 Oct 27, 2020
ab7031a
add fasta file as option
murphycj2 Oct 27, 2020
06b12df
add overwrite options
murphycj2 Oct 27, 2020
ae85b39
store more data in the extraction data
murphycj2 Oct 27, 2020
eb88569
pre compute minor allele freq
murphycj2 Oct 28, 2020
b7be49d
ignore ds store files
murphycj2 Oct 28, 2020
c5132e0
extract return samples
murphycj2 Oct 28, 2020
bdf0183
get genotyp info per pileup site
murphycj2 Oct 28, 2020
bd214c9
bug fix
murphycj2 Oct 28, 2020
20858fa
store metrics in sampel class
murphycj2 Oct 28, 2020
606f5b1
initial minor contamination computation
murphycj2 Oct 28, 2020
9d15ee1
basic major contamination
murphycj2 Oct 28, 2020
71bb270
cleaning up of minor contamination
murphycj2 Oct 28, 2020
c1b8c93
cleanup extract and add min coverage
murphycj2 Oct 28, 2020
d2fbfab
remove title file input, change patient to group
murphycj2 Nov 13, 2020
08a678f
make cli the entry point, some bug fixes for unittesting
murphycj2 Nov 17, 2020
2bdecc1
cleanup for unittesting
murphycj2 Nov 17, 2020
4246f7a
remove find title file alignment
murphycj2 Nov 17, 2020
d10411b
bug in extraction
murphycj2 Nov 17, 2020
8bc4669
add more tests
murphycj2 Nov 18, 2020
8693283
bug in variant position
murphycj2 Nov 18, 2020
ea5b370
bug in major contamination calc
murphycj2 Nov 18, 2020
8b360e1
contamination unittests
murphycj2 Nov 18, 2020
ab4e0bb
move some code from extract to sample
murphycj2 Nov 19, 2020
d68917d
update test with more data
murphycj2 Nov 20, 2020
bd7ccdf
improved handling of sample data load
murphycj2 Nov 20, 2020
796f9f8
basic discordance calculation
murphycj2 Nov 20, 2020
2f61f3a
fix variant position
murphycj2 Nov 20, 2020
12b76c3
update args
murphycj2 Nov 20, 2020
04d1301
function to laod samples from db
murphycj2 Nov 20, 2020
7299498
bug in saving pileup info to file
murphycj2 Nov 23, 2020
01e3a66
properly compare with existing samples in db
murphycj2 Nov 23, 2020
9d3fd67
store sample as dict not list
murphycj2 Nov 23, 2020
1cc1f0e
missed converting sample list to dict
murphycj2 Nov 23, 2020
8b9e4f8
cleanup
murphycj2 Nov 23, 2020
f8baf5e
get expected match/mismatch for genotyping
murphycj2 Nov 23, 2020
599f75e
add json input, modify input name
murphycj2 Nov 23, 2020
ef36634
modify input name
murphycj2 Nov 23, 2020
2dc6e14
check input for genotype comparison
murphycj2 Nov 23, 2020
88d274b
bug in genotype db comparison, function to write output files
murphycj2 Nov 23, 2020
3938d87
round the contamination level
murphycj2 Nov 24, 2020
20b6dbb
base contamination class to save output
murphycj2 Nov 24, 2020
1c0ca7a
remove base class, move to dataframe code
murphycj2 Nov 25, 2020
b43ea58
baisc plots for contamination
murphycj2 Nov 25, 2020
faa9430
cleanup
murphycj2 Nov 25, 2020
178aa75
improve major/minor contamination figures
murphycj2 Dec 3, 2020
85e486c
let user set thresholds via args
murphycj2 Dec 3, 2020
d6a9341
basic heatmap for genotype comparison
murphycj2 Dec 3, 2020
f3f1e49
change heatmap color
murphycj2 Dec 3, 2020
93dc05f
basic working version of sex mismatch
murphycj2 Dec 3, 2020
3a1d2f0
cleanup
murphycj2 Dec 10, 2020
acabc86
fix package install and requirments
murphycj2 Dec 10, 2020
fea64d2
Update requirements.txt
murphycj2 Dec 10, 2020
8522e48
bug in supplying args
murphycj2 Dec 10, 2020
774bcbb
update to args
murphycj2 Dec 10, 2020
3449236
bug, handle case no bed file
murphycj2 Dec 10, 2020
3227114
fix case no coverage in pileup
murphycj2 Dec 10, 2020
a4403d3
let user specify just sample name if already in db
murphycj2 Dec 10, 2020
a433131
misc
murphycj2 Dec 10, 2020
b48f70d
Update minor_contamination.py
murphycj2 Dec 10, 2020
ad6bab3
add multiprocessing
murphycj2 Dec 10, 2020
0bcd000
Update minor_contamination.py
murphycj2 Dec 10, 2020
f8f2575
dont run extraction automatically, separate it from the other tools
murphycj2 Dec 10, 2020
0f17325
bug loading samples
murphycj2 Dec 11, 2020
755caa5
fix minor contamination plot
murphycj2 Dec 11, 2020
f30466e
bug in minor cont plot
murphycj2 Dec 11, 2020
92a2b4e
better y axis range for contamination plots
murphycj2 Dec 11, 2020
817e9e6
more imformative heatmap plot
murphycj2 Dec 11, 2020
5a88f37
round discordance rate
murphycj2 Dec 11, 2020
5a1062e
initial docs
murphycj2 Dec 15, 2020
c1d0422
extract documentation
murphycj2 Dec 15, 2020
6d5104f
introduction documentaiton
murphycj2 Dec 15, 2020
0383733
don't plot if too many samples
murphycj2 Dec 15, 2020
5cf0ae2
cleanup figures
murphycj2 Dec 16, 2020
3ae1df5
test new pileup function
murphycj2 Dec 18, 2020
e76605b
misc
murphycj2 Dec 18, 2020
2b85650
bug in pileup
murphycj2 Dec 18, 2020
13395c0
bug in pileup
murphycj2 Dec 18, 2020
0c8c80b
cleanup
murphycj2 Dec 18, 2020
ae12b93
bug in pileup pos, cleanup
murphycj2 Dec 18, 2020
720273d
convert cols to int
murphycj2 Dec 18, 2020
454b797
Update extraction.md
murphycj2 Dec 21, 2020
c0dc859
Update introduction.md
murphycj2 Dec 21, 2020
89c3dc1
increase precision
murphycj2 Dec 21, 2020
0fd1788
properly handle coverage of 0
murphycj2 Dec 21, 2020
f4e11a9
bug getting coverage
murphycj2 Dec 21, 2020
1d9cd41
if not running extract, then only need sample name
murphycj2 Dec 21, 2020
f608768
remove pysamstats stuff, simplify min baseq
murphycj2 Dec 22, 2020
11003ba
bug in ignore overlap
murphycj2 Dec 22, 2020
7afed20
comments
murphycj2 Dec 22, 2020
c6b4c85
cleanup
murphycj2 Dec 22, 2020
0ecd7e6
count Ns
murphycj2 Dec 22, 2020
250ca46
disable rounding of metrics
murphycj2 Dec 23, 2020
e5f9650
test old method of minor contamination
murphycj2 Dec 23, 2020
4055659
test pileup
murphycj2 Dec 23, 2020
700a08e
cleanup extraction
murphycj2 Dec 24, 2020
21b948c
add parallelization to genotype
murphycj2 Dec 24, 2020
14a94cb
bugs
murphycj2 Dec 24, 2020
e883095
bug in loading samples
murphycj2 Dec 24, 2020
52e076e
cleanup
murphycj2 Dec 24, 2020
b8ca81f
option to set default genotype
murphycj2 Dec 24, 2020
eee55f2
bug fix
murphycj2 Dec 24, 2020
74d69eb
bug
murphycj2 Dec 24, 2020
e6e9efc
added comments
murphycj2 Jan 5, 2021
519da57
genotype docs
murphycj2 Jan 5, 2021
f4da649
misc
murphycj2 Jan 5, 2021
1593705
dont fix heatmap scale
murphycj2 Jan 5, 2021
328eb1c
zmin/zmax cli arguments
murphycj2 Jan 5, 2021
9711556
bug
murphycj2 Jan 5, 2021
150b256
resize heatmap with sample imbalance
murphycj2 Jan 6, 2021
ee97705
bug
murphycj2 Jan 6, 2021
409b1ef
Update genotype.py
murphycj2 Jan 6, 2021
c0b8c08
Update genotype.py
murphycj2 Jan 6, 2021
bd59341
improve heamap plot
murphycj2 Jan 6, 2021
304a4c1
cleanup
murphycj2 Jan 6, 2021
95dd6f8
update doc figures for genotype
murphycj2 Jan 6, 2021
f0a5ed5
cli flag bug for major and minor
murphycj2 Jan 6, 2021
880420c
reorder major statistics, round metrics for plots
murphycj2 Jan 6, 2021
07362ac
minor and major docs
murphycj2 Jan 6, 2021
323024c
Update introduction.md
murphycj2 Jan 6, 2021
daa9786
update docs
murphycj2 Jan 6, 2021
e4d6640
coverage threshold for sexmismatch,
murphycj2 Jan 6, 2021
70efa99
rename in sample class
murphycj2 Jan 6, 2021
8e4f196
update docs with new sample property names
murphycj2 Jan 6, 2021
da8c834
handle sex is unknown
murphycj2 Jan 6, 2021
d1a347f
linting
murphycj2 Jan 6, 2021
67e66eb
bugs in handling samples
murphycj2 Jan 6, 2021
4ba62b8
cleaner code
murphycj2 Jan 6, 2021
ad5e5cc
bug when region_counts is none
murphycj2 Jan 6, 2021
52a11e6
unit tests and test data
murphycj2 Jan 6, 2021
6d0ca2b
Update README.rst
murphycj2 Jan 6, 2021
9db7e22
Rename README.rst to README.md
murphycj2 Jan 6, 2021
3163ff9
Update README.md
murphycj2 Jan 6, 2021
fd4bcb3
Update README.md
murphycj2 Jan 6, 2021
1b77270
update repo in setup
murphycj2 Jan 7, 2021
99a361d
Merge branch 'first-working-version' of https://github.com/msk-access…
murphycj2 Jan 7, 2021
bb85ff2
fix cli bug
murphycj2 Jan 7, 2021
21db76f
fix setup bug
murphycj2 Jan 7, 2021
0c3c69b
v0.1.1, update other requirments
murphycj2 Jan 7, 2021
131af3b
Update setup.py
murphycj2 Jan 7, 2021
9a44f2e
update versions in other places
murphycj2 Jan 7, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
__pycache__/
*.py[cod]
*$py.class
*.DS_Store
*.pk

# C extensions
*.so
Expand Down
2 changes: 1 addition & 1 deletion HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
History
=======

0.1.0 (2019-12-06)
0.1.1 (2021-01-07)
------------------

* First release on PyPI.
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# biometrics

Package to generate sample based biometrics

[![Build Status](https://travis-ci.com/msk-access/biometrics.svg?branch=master)](https://travis-ci.com/msk-access/biometrics) [![PyPi](https://img.shields.io/pypi/v/biometrics.svg?)](https://pypi.python.org/pypi/biometrics)

* Free software: Apache Software License 2.0
* Documentation: https://msk-access.gitbook.io/biometrics/

## Installation

From pypi:

`pip install biometrics`
37 changes: 0 additions & 37 deletions README.rst

This file was deleted.

2 changes: 1 addition & 1 deletion biometrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

__author__ = """Ronak Shah"""
__email__ = 'rons.shah@gmail.com'
__version__ = '0.1.0'
__version__ = '0.1.1'
312 changes: 311 additions & 1 deletion biometrics/biometrics.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1 +1,311 @@
"""Main module."""
import os
import glob

import pandas as pd

from biometrics.sample import Sample
from biometrics.extract import Extract
from biometrics.genotype import Genotyper
from biometrics.minor_contamination import MinorContamination
from biometrics.major_contamination import MajorContamination
from biometrics.sex_mismatch import SexMismatch
from biometrics.utils import standardize_sex_nomenclature, exit_error


def write_to_file(args, data, basename):
"""
Generic function to save output to a file.
"""

outdir = os.path.abspath(args.outdir)

outpath = os.path.join(outdir, basename + '.csv')
data.to_csv(outpath, index=False)

if args.json:
outpath = os.path.join(outdir, basename + '.json')
data.to_json(outpath)


def run_extract(args, samples):
"""
Extract the pileup and region information from the samples. Then
save to the database.
"""

extractor = Extract(args=args)
samples = extractor.extract(samples)

return samples


def run_sexmismatch(args, samples):
"""
Find and sex mismatches and save the output
"""

sex_mismatch = SexMismatch(args.coverage_threshold)

results = sex_mismatch.detect_mismatch(samples)
write_to_file(args, results, 'sex_mismatch')


def run_minor_contamination(args, samples):
"""
Compute minor contamination and save the output and figure
"""

minor_contamination = MinorContamination(threshold=args.minor_threshold)
samples = minor_contamination.estimate(samples)

data = minor_contamination.to_dataframe(samples)
write_to_file(args, data, 'minor_contamination')

if args.plot:
if len(samples) > 1000:
print('WARNING - Turning off plotting functionality. You are trying to plot more than 1000 samples, which is too cumbersome.')
else:
minor_contamination.plot(data, args.outdir)

return samples


def run_major_contamination(args, samples):
"""
Compute major contamination and save the output and figure.
"""

major_contamination = MajorContamination(threshold=args.major_threshold)
samples = major_contamination.estimate(samples)

data = major_contamination.to_dataframe(samples)
write_to_file(args, data, 'major_contamination')

if args.plot:
if len(samples) > 1000:
print('WARNING - Turning off plotting functionality. You are trying to plot more than 1000 samples, which is too cumbersome.')
else:
major_contamination.plot(data, args.outdir)

return samples


def run_genotyping(args, samples):
"""
Run the genotyper and save the output and figure.
"""

genotyper = Genotyper(
no_db_compare=args.no_db_compare,
discordance_threshold=args.discordance_threshold,
threads=args.threads,
zmin=args.zmin,
zmax=args.zmax)
data = genotyper.genotype(samples)

write_to_file(args, data, 'genotype_comparison')

if args.plot:
if len(samples) > 1000:
print('WARNING - Turning off plotting functionality. You are trying to plot more than 1000 samples, which is too cumbersome.')
else:
genotyper.plot(data, args.outdir)

return samples


def load_input_sample_from_db(sample_name, database):
"""
Loads any the given (that the user specified via the CLI) from the
database.
"""

extraction_file = os.path.join(database, sample_name + '.pk')

if not os.path.exists(extraction_file):
exit_error(
'Could not find: {}. Please rerun the extraction step.'.format(
extraction_file))

sample = Sample(query_group=False)
sample.load_from_file(extraction_file)

return sample


def load_database_samples(database, existing_samples):
"""
Loads any samples that are already present in the database AND
which were not specified as input via the CLI.
"""

samples = {}

for pickle_file in glob.glob(os.path.join(database, '*pk')):

sample_name = os.path.basename(pickle_file).replace('.pk', '')

if sample_name in existing_samples:
continue

sample = Sample(db=database, query_group=True)
sample.load_from_file(extraction_file=pickle_file)

samples[sample.sample_name] = sample

return samples


def get_samples_from_input(input, database, extraction_mode):
"""
Parse the sample information from the user-supplied CSV file.
"""

samples = {}

for fpath in input:

input = pd.read_csv(fpath, sep=',')

# check if some required columns are present

if 'sample_bam' not in input.columns:
exit_error(
'Input file does not have the \'sample_bam\' column.')

if 'sample_name' not in input.columns:
exit_error('Input does not have \'sample_name\' column.')

input = input.to_dict(orient='records')

for row in input:

if not extraction_mode:
# if not running extract tool, then just need to get
# the sample name

sample_name = row['sample_name']

sample = load_input_sample_from_db(sample_name, database)
samples[sample.sample_name] = sample

continue

# parse in the input

sample = Sample(
sample_name=row['sample_name'],
sample_bam=row['sample_bam'],
sample_group=row.get('sample_group'),
sample_type=row.get('sample_type'),
sample_sex=standardize_sex_nomenclature(input.get('sample_sex')),
db=database)

samples[sample.sample_name] = sample

return samples


def get_samples_from_bam(args):
"""
Parse the sample information the user supplied via the CLI.
"""

samples = {}

for i, sample_bam in enumerate(args.sample_bam):

sample_sex = standardize_sex_nomenclature(
args.sample_sex[i] if args.sample_sex is not None else None)
sample_name = args.sample_name[i] if args.sample_name is not None else None
sample_group = args.sample_group[i] \
if args.sample_group is not None else None
sample_type = args.sample_type[i] \
if args.sample_type is not None else None

sample = Sample(
sample_bam=sample_bam, sample_group=sample_group,
sample_name=sample_name, sample_type=sample_type,
sample_sex=sample_sex, db=args.database)

samples[sample.sample_name] = sample

return samples


def get_samples_from_name(sample_names, database):
"""
Parse the sample information the user supplied via the CLI.
"""

samples = {}

for i, sample_name in enumerate(sample_names):
sample = load_input_sample_from_db(sample_name, database)
samples[sample.sample_name] = sample

return samples


def get_samples(args, extraction_mode=False):
"""
Parse the sample information the user supplied via the CLI.
"""

samples = {}

if args.input:
samples.update(get_samples_from_input(
args.input, args.database, extraction_mode))

if extraction_mode:
if args.sample_bam:
samples.update(get_samples_from_bam(args))
else:
if args.sample_name:
samples.update(get_samples_from_name(
args.sample_name, args.database))

for sample_name in samples.keys():
extration_file = os.path.join(args.database, sample_name + '.pk')
samples[sample_name].load_from_file(extration_file)

existing_samples = set([i for i in samples.keys()])

if not args.no_db_compare:
samples.update(load_database_samples(
args.database, existing_samples))

return samples


def create_outdir(outdir):
os.makedirs(outdir, exist_ok=True)


def run_biometrics(args):
"""
Decide what tool to run based in CLI input.
"""

extraction_mode = args.subparser_name == 'extract'

samples = get_samples(args, extraction_mode=extraction_mode)

# if not extraction_mode and args.plot:

if extraction_mode:
create_outdir(args.database)
run_extract(args, samples)
elif args.subparser_name == 'sexmismatch':
create_outdir(args.outdir)
run_sexmismatch(args, samples)
elif args.subparser_name == 'minor':
create_outdir(args.outdir)
run_minor_contamination(args, samples)
elif args.subparser_name == 'major':
create_outdir(args.outdir)
run_major_contamination(args, samples)
elif args.subparser_name == 'genotype':
create_outdir(args.outdir)
run_genotyping(args, samples)
Loading