This README describes the scripts used for the sequence analysis in:
Major antigenic site B of human influenza H3N2 viruses has an evolving local fitness landscape
This study aims to understand how the local fitness landscape of antigenic site B in human H3N2 HA evolves in the past 50 years. The repository here describes the analysis for the deep mutational scanning experiment that focuses on HA1 residues 156, 158, 159, 190, 193, 196 in six different genetic backgrounds, namely A/Hong Kong/1/1968 (HK68), A/Bangkok/1/1979 (Bk79), A/Beijing/353/1989 (Bei89), A/Moscow/10/1999 (Mos99), A/Brisbane/10/2007 (Bris07), and A/North Dakota/26/2016 (NDako16).
- All raw sequencing reads, which can be downloaded from NIH SRA database PRJNA563320, should be placed in fastq/ folder. The filename for read 1 should match those described in ./doc/SampleID.tsv. The filename for read 2 should be the same as read 1 except "R1" is replaced by "R2"
- ./data/SampleID.tsv: Describes the sample identity for each fastq file
- ./Fasta/RefSeq.fa: Reference (wild type) nucleotide sequences for the sequencing data
- ./data/WTseq.tsv: Amino acids for the wild type sequences at residues 156, 158, 159, 190, 193, 196
- ./Fasta/HumanH3N2_All_2018.aln.gz: Full-length HA protein sequences from human H3N2 downloaded from GISAID
- ./result/AllDKLInfo_2018.csv: KL distance computed by Armita Nourmohammad
- ./result/pairwiseparameters_nonlinearity.csv: Parameters for additive fitness effect and pairwise epistatic interaction from a nonlinear model, computed by Jakub Otwinowski (see ./modelepistasis.ipynb)
- ./result/pairwiseparameters_linearity.csv: Parameters for additive fitness effect and pairwise epistatic interaction from a linear model, computed by Jakub Otwinowski (see ./modelepistasis.ipynb)
- ./data/WTheatmap.tsv: A list of wild type residue pairs for heatmap plotting
- ./script/EpiB_fastq_to_fitness.py: Converts raw reads to variant counts and fitness measures.
- Input files:
- Raw sequencing reads in fastq/ folder
- ./data/SampleInfo.tsv
- ./Fasta/RefSeq.fa
- Output files:
- result/EpiB_MultiMutLib_*.tsv
- Input files:
- ./script/EpiB_clean_mut.py: Filter mutants of interest
- Input files:
- result/EpiB_MultiMutLib_*.tsv
- Output files:
- result/EpiB_Index_*.tsv
- Input files:
- ./script/combine_data.jl: Re-calculate mutant fitness. Written by Jakub Otwinowski
- Input files:
- result/EpiB_Index_*.tsv
- Output files:
- Input files:
- ./script/EpiB_fit_to_pref.py: Preference normalization
- Input files:
- Output files:
- ./script/EpiB_PrefEvol.py: Amino acid sequences of HA residues 156, 158, 159, 190, 193, and 196 in naturally occurring strains were extracted
- ./script/EpiB_AnalyzeParam.py: Classify the parameters for the additive fitness effect and pairwise epistatic effect into "positive" or "negative" based on the 95% confidence interval
- ./script/EpiB_seq_comparison.py: Compute the pairwise sequence identities among strains
- Input files:
- Output files:
- ./script/Plot_CompareRep.R: Compare mutant fitness (i.e. enrichment ratio) from replicates
- Input files:
- Output files:
- ./script/Plot_CompareLib.R: Compare mutant fitness from different genetic backgrounds
- Input files:
- Output files:
- ./script/EpiB_SeqLogGen.py: Generate sequence logo based on mutant preference
- Input files:
- Output files:
- result/seqlogo_*.fa
- graph/seqlogo_*.png
- ./script/EpiB_network.py: Plot fitness landscape (network graph)
- Input files:
- Output files:
- dot/EvoNetwork_*.dot
- dot/EvoNetwork_*.png
- ./script/Plot_TrackPref.R: Plot the normalized preference of naturally occurring sequences in different genetic backgrounds
- Input files:
- Output files:
- ./script/Plot_Inf_class_summary.R: Plot the distribution of "positive", "negative", and "mixed" parameters as pie charts
- Input files:
- Output files:
- ./script/Plot_Inf_heatmap_overall.R: Plot heatmap summarizing the number of "positive" and "negative" parameters among all six genetic backgrounds of interest
- Input files:
- Output files:
- ./script/Plot_Inf_heatmap_specific.R: Plot heatmap showing the number of "positive" and "negative" parameters in each of the six genetic backgrounds of interest
- Input files:
- Output files:
- ./script/Plot_seq_dist.R: Plot the relationship between pairwise correlation of fitness landscape and pairwise sequence identity
- Input files:
- Output files:
- ./script/Plot_TrackFreq.R: Plot the frequency of different haplotypes over time
- Input files:
- Output files:
- ./script/Plot_KLdist.R: Plot the relationship between KL distance and preference in different genetic backgrounds
- Input files:
- Output files:
- ./graph/KLdist_vs_pref*.png