In this repo you will find the code and data employed to prepare the figures and the paper.
A manuscript describing the results from this work has been posted on BioRxiv (Preprint) and is also available within the folder BioRxiv_manuscript.
Evolutionary studies were conducted using sequence data from Ensembl and RefSeq databases.
Below you can find the link to retrieve protein sequence:
Nucleotide sequences from Ensembl can be downloaded from the same link detailed above. In the case of RefSeq entries, nucleotide sequences were retrieve using NCBI Entrez Programming Utilities.
Contains the phylogenetic reconstruction of apoA-I evolution with IQ-TREE (APOA1_phylogeny.treefile) and the evolutionary rates inferred with HyPhy (the file evolution_dataset.csv contains all the data used for visualizations).
To compute the aggregation propensity of each protein sequence in our dataset we employed TANGO with default settings. The file aprs_dataset.csv contains all the aggregating regions predicted for apoA-I sequences.
Gaussian network model fluctuations (apoa1_msf.csv) and weighted contact numbers (apoa1.wcn.csv) were computed with the ProDy and with a custom script from clauswilke/proteinER, respectively. We used Camsol (Structurally-corrected protein solubility prediction) and ZipperDB (database of fibril-forming protein segments) to understand the contribution of apoA-I structure (link) to its aggregation tendency (camsol_solubility.txt and zipperdb.csv).
We used FoldX to calculate the theoretical thermodynamic destabilization effect of each possible amino acid substitution in apoA-I sequence (foldx_dataset.csv) and automated this task with the aid of the Mutatex pipeline.
~/mutatex/bin/mutatex apoa1-hdl.pdb --foldx-binary ~/foldx5Linux64.tar__0/foldx --rotabase rotabase.txt --np 4 --binding-energy --foldx-log --clean deep --compress
Pathogenicity scores (rhapsody_dataset.csv) were calculated with the Rhapsody server. Apoa-I natural variants were extracted from gnomAD. The file variants_dataset.csv contains all the data used for visualizations.
Code and datasets used for visualization, together with the .svg figure files.