Skip to content

Latest commit

 

History

History

1_processing

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Processing

The Raw data was quantified in Salmon using both PAO1 and PA14 references. For more information on the raw data is see this Doing et al. with source code here. The datasets containing all strains aligned against the PAO1 reference and PA14 reference is here.

To determine which samples were more PAO1- versus PA14-like, we will use the median expression of accessory genes. In our exploratory analysis we found that samples labeled as PAO1 based on SRA annotations had high PAO1 accessory gene expression. Whereas samples labeled as PA14 by SRA had high PA14 accessory gene expression.

See plot below where the median expression of PAO1 genes (PAO1 accessory genes) on the x-axis and the median expression of PA14-only genes (PA14 accessory genes) on the y-axis. Each point is a strain or sample. all_samples

A sample is considered PAO1-like if the median gene expression of PA14 accessory genes is 0 and PAO1 accessory genes in > 5. Similarlty, a sample is considered PA14 if the median gene expression of PA14 accessory genes is > 5 and PAO1 accessory genes in 0.

A threshold of 25 MR normalized estimated counts is used based on our analysis in 1_decide_threshold.ipynb. The goal of this notebook was to define a threshold to determine if a sample is PAO1-like or not (likewise, if a sample is PA14-like or not). We used known labels from SRA to do this. Specifically, we examined the distribution of PAO1 samples vs non-PAO1 samples (see histogram plots). We define the threshold to be one that separated between the two distributions. We use this threshold in 2_create_compendia.ipynb to partition gene expression data into PAO1 and PA14 compendia because we found that using a threshold of 0 MR normalized estimated counts included some other SRA-labeled samples.

Using a threshold of 0 MR normalized estimated counts, within the PAO1 binned compendium there are samples that SRA labeled as PAK or Clinical. pao1_compendium_0thresdhold

Similarly for the PA14 compendium.

pa14_compendium_0thresdhold

Looking at the distribution of the median accessory gene expression for these non-PAO1 SRA labeled samples (i.e. PAK, Clinical) their expression is very low (brown) compared to all other PAO1 labeled samples (light purple/pink). A similar trend is seen comparing the non-PA14 labeled samples (brown) vs PA14 samples (dark purple)

pao1_dist_0thresdhold

pa14_dist_0thresdhold

Note: To generate the above figures, set the params for same_threshold = 0 and opp_threshold = 25 in 2_create_compendia noetbook. You will also need to comment out the assertion statement: assert len(shared_pao1_pa14_binned_ids) == 0. Then run 3_validate_compendia.

Using a threshold of 25 we get the following plots that correspond to our final compendia that we will use in our analysis. As a check, our PAO1 compendium contains 890 samples and the PA14 compendium contains 505 samples. These numbers are close to the numbers that SRA annotates as PAO1 and PA14.

pao1_compendium_25thresdhold

pa14_compendium_25thresdhold

pao1_dist_25thresdhold

pa14_dist_25thresdhold