All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Updated the data type for the index pointer in the
LDMatrix
object to beint64
.int32
does not work well for very large datasets with millions of variants and it causes overflow errors. - Updated the way we determine the
pandas
chunksize when converting fromplink
tables tozarr
. - Simplified the way we compute the quantization scale in
model_utils
. - Fixed major bug in how LD window thresholds that are passed to
plink1.9
are computed. - Fixed in-place
fillna
infrom_plink_table
inLDMatrix
to conform to latestpandas
API. - Update
run_shell_script
to check for and capture errors. - Refactored code to slightly reduce import/load times.
- Cleaned up
load_data
method ofLDMatrix
and subsumed functionality inload_rows
. - Fixed bugs in
match_snp_tables
. - Fixed bugs and re-wrote how the
block
LD estimator is computed using both theplink
andxarray
backends. - Updated
from_plink_table
method inLDMatrix
to handle cases where boundaries are different from whatplink
computes. - Fixed bug in
symmetrize_ut_csr_matrix
utility functions. - Changed default storage data type for LD matrices to
int16
.
- Added extra validation checks in
LDMatrix
to ensure that the index pointer is formatted correctly. LDLinearOperator
class to allow for efficient linear algebra operations on the LD matrix without representing the full symmetric matrix in memory.- Added utility methods to
LDMatrix
class to allow for computing eigenvalues, performing SVD, etc. - Added
Spectral properties
to the attributes of LD matrices. - Added support to slice/retrieve entries of LD matrix by using SNP rsIDs.
- Added support to reading LD matrices from AWS s3 storage.
- Added utility method to detect if a file contains header information.
- Added utility method to generate overlapping windows over a sequence.
- Added
compute_extremal_eigenvalues
to allow the user to compute extremal (minimum and maximum) eigenvalues of LD matrices. - Added the utility function
combine_ld_matrices
to allow for combining LD matrices from different sources.
- Updated the logic for
detect_outliers
in phenotype transforms to actually reflect the function name (before it was returning true for inliers...). - Updated
quantize
anddequantize
to minimize data copying as much as possible. - Updated
LDMatrix.load_rows()
method to minimize data copying. - Fixed bug in
LDMatrix.n_neighbors
implementation. - Updated
dask
version inrequirements.txt
to avoid installingdask-expr
.
- Added
get_peak_memory_usage
tosystem_utils
to inspect peak memory usage of a process. - Placeholder method to perform QC on
SumstatsTable
objects (needs to be implemented still). - New attached dataset for long-range LD regions.
- New method in SumstatsTable to impute rsID (if missing).
- Preliminary support for matching with CHR+POS in SumstatsTable (still needs more work).
- LDMatrix updates:
- New method to filter long-range LD regions.
- New method to prune LD matrix.
- New algorithm for symmetrizing upper triangular and block diagonal LD matrices.
- Much faster and more memory efficient than using
scipy
. - New
LDMatrix
class has efficient data loading in.load_data
method. - We still retain
load_rows
because it is useful for loading a subset of rows.
- Much faster and more memory efficient than using
- Fixed
manhattan
plot implementation to support various new features. - Added a warning when accessing
csr_matrix
property ofLDMatrix
when it hasn't been loaded previously.
reset_mask
method for magenpyLDMatrix
.Dockerfile
s for bothcli
andjupyter
modes.- A helper script to convert LD matrices from old format to new format.
- Fixed bugs in how covariates are processed in
SampleTable
. - Fixed bugs / issues in implementation of GWAS with
xarray
backend. - Streamlined implementation of
manhattan
plotting function.
A large scale restructuring of the code base to improve efficiency and usability.
- Bug fixes across the entire code base.
- Simulator classes have been renamed from
GWASimulator
toPhenotypeSimulator
. - Moved plotting script to its own separate module.
- Updated some method names / commandline flags to be consistent throughout.
- Basic integration testing with
pytest
and GitHub workflows. - Documentation for the entire package using
mkdocs
. - Integration testing / automating building with GitHub workflows.
- New implementation of the LD matrix that uses CSR matrix data structures.
- Quantization / float precision specification when storing LD matrices.
- Allow user to specify Compressor / Compressor options for Zarr storage.
- New implementation of
magenpy_simulate
script.- Allow users to set random seed.
- Now accept
--prop-causal
instead of specifying full mixing proportions.
- Tried to incorporate
genome_build
into various data structures. This will be useful in the future to ensure consistent genome builds across different data types. - Allow user to pass various metadata to
magenpy_ld
to save information about dataset characteristics. - New sumstats parsers:
- Saige sumstats format.
- plink1.9 sumstats format.
- GWAS Catalog sumstats format.
- Chained transform function for transforming phenotypes.
- Removed the
--fast-math
compiler flag due to concerns about numerical precision (e.g. Beware of fast-math). - Updated implementation of
SumstatsParser
class to allow user to specifyread_csv_kwargs
at the point of instantiation. - Updated plink executors to propagate the error messages to the user.
- Updated
merge_snp_tables
to allow for merges on columns other thanSNP
. - Refactored, cleaned, and updated the implementation of the
AnnotationMatrix
class. - Fixed bug in
GWADataLoader.split_by_samples()
: Need to performdeepcopy
, otherwise splitting would not work properly. - Updated
read_annotations
method inGWADataLoader
to work with the latestAnnotationMatrix
interfaces. - Fixed bug in the
manhattan
plotting function.
- Added parsers for functional annotations and annotation files. Mainly support LDSC annotation format for now.
- Added a utility method to
GWADataLoader
calledalign_with
to streamline aligningGWADataLoader
objects across SNP and sample dimensions. - Added utility methods for flattening the LD matrix in
LDMatrix
. - Added a method to perform matrix-vector multiplication in
LDMatrix
. - Added a method to perform block-wise iteration in the
LDMatrix
class.
- Fixed bug in implementation of
identify_mismatched_snps
. - Fixed bugs in handling of missing information in LD matrix.
- Fixed bug in handling of covariates in
SampleTable
. - Updated
README
file to remove line indicators>>>
from sample code.
- Added the reference allele
A2
to the output of thetrue_beta_table
inGWASimulator
.
- Fixed a bug in the phenotype likelihood inference in
SampleTable
. - Changed the implementation of the
merge_snp_tables
utility function to check for BOTH reference and alternative alleles. - Modified implementation of
score
method ofGWADataLoader
to correct potential issues with the BETAS being for a subset of the chromosomes.
- A utility method to
GenotypeMatrix
calledestimate_memory_allocation
. This should allow the user to gauge the memory resources required to interact with the genotype files.
- Fixing bug in computing minor allele frequency with
plink
.
- Added the reference allele
A2
to the list of attributes ofLDMatrix
. - Added
effect_sign
as a property ofSumstatsTable
.
- Fixed implementation of
merge_snp_tables
to detect allele differences that are not flips betweenA1
/A2
. - Improved implementation of
.score
method ofxarrayGenotypeMatrix
.
- Added
tqdm
progress bars when processing multiple files/chromosomes inGWADataLoader
. - Added
min_maf
andmin_mac
flags inmagenpy_ld
andmagenpy_simulate
.
- Lowered default threshold for LD shrinkage to 1e-3.
- Bug fix in
SampleTable
.
- Utility function to compute the genomic control or lambda factor.
- A method to set the causal SNPs directly in
GWASimulator
.
- Fixed bugs in
manhattan
plotting function. - Added alternative ways to derive Chi-Squared statistic from other summary stats (e.g. p-value).
- Give user more fine-grained control on what to reset in the
.simulate()
method ofGWASimulator
. - Modified the LD score computation method to allow for aggregating LD scores by functional category or annotations.
- Streamlining module import structure to speed up loading.
- Fixed bugs in rechunking logic when computing LD matrices using
xarray
/dask
- A new attached dataset of GWAS summary statistics for standing height from the fastGWA database.
- Updated the data harmonization method in
GWADataLoader
to ensure that all data sources have the same set of chromosomes. - Bug fixes in
SumStatsTable
,GWASimulator
, andGWADataLoader
.
- New methods to split
GWADataLoader
objects by chromosome and by samples. The latter should come in handy for splitting the samples for training, validation and testing.
- Updated implementation of the shrinkage estimator of LD to align it more closely with the original formulation in Wen and Stephens (2010) and implementations in RSS software.
- Fixed various bugs and errors in the code.
- Added proper handling for the slice objects in
plot_ld_matrix
-
Added classes encapsulating data structures and methods for:
- Genotype matrices:
GenotypeMatrix
- Sample tables:
SampleTable
- Summary statistics table:
SumstatsTable
- Genotype matrices:
-
Added a new
stats
submodule that implements utilities and functions to compute various statistics, includingld
(SNP correlation matrix),h2
(heritability),score
,transforms
,variant
statistics, andgwa
(genome-wide association testing). -
Added a modular class for summary statistics parsers
SumstatsParser
. -
Added modular interfaces for
executors
, representing external software, such asplink
. -
Added support for window size specifications using number of SNPs and distance in kilobases.
-
Added
CHANGELOG.md
to track the latest changes and updates tomagenpy
.
- Refactored the
GWADataLoader
class to utilize the new data structures. - Updated plotting functions/utilities.
- Updated documentation in README file.
- Updated implementation of
MulticohortGWASimulator
(still incomplete).
- Refactored the code for
pypi
package release. - Added license,
.toml
file, andMANIFEST.in
.
- Updated
setup.py
to prepare for the package release. - Updated
README.md
to add basic documentation.