Name		Name	Last commit message	Last commit date
parent directory ..
archive		archive
0_format_expression_data.ipynb		0_format_expression_data.ipynb
0_format_expression_data.py		0_format_expression_data.py
1_decide_threshold.ipynb		1_decide_threshold.ipynb
1_decide_threshold.py		1_decide_threshold.py
2_create_compendia.ipynb		2_create_compendia.ipynb
2_create_compendia.py		2_create_compendia.py
3_validate_compendia.ipynb		3_validate_compendia.ipynb
3_validate_compendia.py		3_validate_compendia.py
4_viz_compendia_pca.ipynb		4_viz_compendia_pca.ipynb
4_viz_compendia_pca.py		4_viz_compendia_pca.py
MR_median_acc_expression_pa14_compendium_0threshold.svg		MR_median_acc_expression_pa14_compendium_0threshold.svg
MR_median_acc_expression_pa14_compendium_25threshold.svg		MR_median_acc_expression_pa14_compendium_25threshold.svg
MR_median_acc_expression_pao1_compendium_0threshold.svg		MR_median_acc_expression_pao1_compendium_0threshold.svg
MR_median_acc_expression_pao1_compendium_25threshold.svg		MR_median_acc_expression_pao1_compendium_25threshold.svg
Medium annotations.csv		Medium annotations.csv
PA14TableVF1.csv		PA14TableVF1.csv
PAO1TableVF1.csv		PAO1TableVF1.csv
README.md		README.md
compendia_gene_function.svg		compendia_gene_function.svg
compendia_kegg.svg		compendia_kegg.svg
compendia_media.svg		compendia_media.svg
composition_of_compendia.ipynb		composition_of_compendia.ipynb
composition_of_compendia.py		composition_of_compendia.py
dist_median_acc_expression_pa14_compendium_0threshold.svg		dist_median_acc_expression_pa14_compendium_0threshold.svg
dist_median_acc_expression_pa14_compendium_25threshold.svg		dist_median_acc_expression_pa14_compendium_25threshold.svg
dist_median_acc_expression_pao1_compendium_0threshold.svg		dist_median_acc_expression_pao1_compendium_0threshold.svg
dist_median_acc_expression_pao1_compendium_25threshold.svg		dist_median_acc_expression_pao1_compendium_25threshold.svg
gene_function_legend.tsv		gene_function_legend.tsv
media_legend.tsv		media_legend.tsv
median_acc_expression.tsv		median_acc_expression.tsv
pa14_acc_gene_ids.tsv		pa14_acc_gene_ids.tsv
pa14_compendium_pa14_ref_pca.svg		pa14_compendium_pa14_ref_pca.svg
pa14_compendium_pao1_ref_pca.svg		pa14_compendium_pao1_ref_pca.svg
pa_pa14_ref_pca.svg		pa_pa14_ref_pca.svg
pa_pao1_ref_pca.svg		pa_pao1_ref_pca.svg
pao1_acc_gene_ids.tsv		pao1_acc_gene_ids.tsv
pao1_compendium_pa14_ref_pca.svg		pao1_compendium_pa14_ref_pca.svg
pao1_compendium_pao1_ref_pca.svg		pao1_compendium_pao1_ref_pca.svg
pathway_legend.tsv		pathway_legend.tsv
prebinned_compendia_acc_expression.tsv		prebinned_compendia_acc_expression.tsv

README.md

Processing

The Raw data was quantified in Salmon using both PAO1 and PA14 references. For more information on the raw data is see this Doing et al. with source code here. The datasets containing all strains aligned against the PAO1 reference and PA14 reference is here.

To determine which samples were more PAO1- versus PA14-like, we will use the median expression of accessory genes. In our exploratory analysis we found that samples labeled as PAO1 based on SRA annotations had high PAO1 accessory gene expression. Whereas samples labeled as PA14 by SRA had high PA14 accessory gene expression.

See plot below where the median expression of PAO1 genes (PAO1 accessory genes) on the x-axis and the median expression of PA14-only genes (PA14 accessory genes) on the y-axis. Each point is a strain or sample.

A sample is considered PAO1-like if the median gene expression of PA14 accessory genes is 0 and PAO1 accessory genes in > 5. Similarlty, a sample is considered PA14 if the median gene expression of PA14 accessory genes is > 5 and PAO1 accessory genes in 0.

A threshold of 25 MR normalized estimated counts is used based on our analysis in 1_decide_threshold.ipynb. The goal of this notebook was to define a threshold to determine if a sample is PAO1-like or not (likewise, if a sample is PA14-like or not). We used known labels from SRA to do this. Specifically, we examined the distribution of PAO1 samples vs non-PAO1 samples (see histogram plots). We define the threshold to be one that separated between the two distributions. We use this threshold in 2_create_compendia.ipynb to partition gene expression data into PAO1 and PA14 compendia because we found that using a threshold of 0 MR normalized estimated counts included some other SRA-labeled samples.

Using a threshold of 0 MR normalized estimated counts, within the PAO1 binned compendium there are samples that SRA labeled as PAK or Clinical.

Similarly for the PA14 compendium.

Looking at the distribution of the median accessory gene expression for these non-PAO1 SRA labeled samples (i.e. PAK, Clinical) their expression is very low (brown) compared to all other PAO1 labeled samples (light purple/pink). A similar trend is seen comparing the non-PA14 labeled samples (brown) vs PA14 samples (dark purple)

Note: To generate the above figures, set the params for same_threshold = 0 and opp_threshold = 25 in 2_create_compendia noetbook. You will also need to comment out the assertion statement: assert len(shared_pao1_pa14_binned_ids) == 0. Then run 3_validate_compendia.

Using a threshold of 25 we get the following plots that correspond to our final compendia that we will use in our analysis. As a check, our PAO1 compendium contains 890 samples and the PA14 compendium contains 505 samples. These numbers are close to the numbers that SRA annotates as PAO1 and PA14.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1_processing

1_processing

README.md

Processing

Files

1_processing

Directory actions

More options

Directory actions

More options

Latest commit

History

1_processing

Folders and files

parent directory

README.md

Processing