Initial benchmark analysis #22

jashapiro · 2020-09-09T18:41:09Z

The benchmark analysis notebook here currently contains investigation of the count patterns seen in the cDNA vs. full transcriptome reference indexes, in service of the discussion in #4. I expect to add analysis of the decoy sequences soon (some parts of which can be seen here), but I thought it was worth getting this into a PR in its current state.

I initially tried to use some S3 libraries within R, but those did not seem to work the way I wanted, so I resorted to simply downloading the result files. to make that easier, I launch the docker image with the command described in README.md to import the local AWS tokens for access to S3 (the docker image includes the aws-cli tools). To run this notebook therefore requires that you already have run aws setup on your local system with credentials that allow read access to s3://nextflow-ccdl-results

…alysis

jashapiro · 2020-09-10T18:58:30Z

I have added some analysis of the decoy indexes to this notebook, which were more interesting than I expected! I am going to request @envest for review to start, but I think @cgreene and @jaclyn-taroni might want to have a look at the results as well.

Most intriguing results are right at the end, where I compare the full transcriptome to coding-only* indexes and notice that you get much more concordant results between the two when you include decoy sequences. Or I did something funny, which is always possible.

Coding only includes pseudogenes, for some reason. Ask Ensembl.

The easiest way to view the notebook is probably http://htmlpreview.github.io/?https://github.com/AlexsLemonade/alsf-scpca/blob/jashapiro/benchmark-analysis/workflows/benchmarks/benchmark-analysis.nb.html#Comparing_decoy_indexes

envest

Overall I agree with the "go ahead with the full decoy" conclusion. For deeper understanding of the differences due to different indices, I think it may be worth including some additional samples, especially from another data contributor.

(Not taking sides on indexes vs. indices...) 😝

envest · 2020-09-10T20:14:44Z

workflows/benchmarks/benchmark-analysis.Rmd

+### Get Annotations from AnnotationHub
+
+```{r}
+hub = AnnotationHub::AnnotationHub(ask = FALSE)


Notebook output had this warning:

Warning: call dbDisconnect() when finished working with a connection snapshotDate(): 2020-04-27

I've seen that occasionally, but I can't pin it down, so I am ignoring it. I don't love AnnotationHub (not the speediest), but it does seem to be nicely reliable.

envest · 2020-09-10T20:17:15Z

workflows/benchmarks/benchmark-analysis.Rmd

+ENSG00000197563: PIGN Phosphatidylinositol Glycan Anchor Biosynthesis Class N
+ENSG00000173559: NABP1 Nucleic Acid Binding Protein 1
+
+I don't really know what to do with this list... it seems like few genes will have large differences, but a few differences are enormous!


Hmm... If some ENGSs really are problematic, I would expect that the problem is consistent across samples within and across data providers. Worth checking out some more samples?

Yeah, some of these seem consistent between the two samples, but I have no idea more generally.

workflows/benchmarks/benchmark-analysis.Rmd

envest · 2020-09-10T20:38:24Z

workflows/benchmarks/benchmark-analysis.Rmd

+  geom_point()
+```
+
+Interestingly, the correlation between cDNA and txome expression values (for coding genes) using the full decoy seems substantially better than with the no decoy sequences!


Does this higher correlation hold for sample 834?

It does!
No decoy:

Full decoy:

envest · 2020-09-10T20:50:13Z

workflows/benchmarks/benchmark-analysis.Rmd

+library(tximport)
+
+# set seed
+set.seed(2020)


envest · 2020-09-10T20:55:54Z

workflows/benchmarks/benchmark-analysis.Rmd

+  geom_point()
+```
+Correlation is very good for genes present in both, with expression more often lower in the transcriptome set, which may make sense in the case of multimapping introduced by the larger number of potential targets.
+Interesting though that there are some genes which do not appear to be expressed in the cDNA set that have expression in the txome.


Hmmm what's going on here?

Pseudomapping is a bit of a black box?

Co-authored-by: Steven Foltz <stevenmasonfoltz@gmail.com>

jaclyn-taroni

This looks good to me in general! I had a couple documentation comments that should be addressed before merging, but no need for me to re-review before merging.

I do agree with @envest's sentiment:

For deeper understanding of the differences due to different indices, I think it may be worth including some additional samples, especially from another data contributor.

jaclyn-taroni · 2020-09-20T19:00:15Z

workflows/benchmarks/README.md

@@ -0,0 +1,15 @@
+## Docker
+
+Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home direcory and pass in the current user's AWS credentials and launch RStudio.


Suggested change

Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home direcory and pass in the current user's AWS credentials and launch RStudio.

Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home directory and pass in the current user's AWS credentials and launch RStudio.

Can you add a brief reference to .aws like you have in workflows/images/scpca-r/README.md and/or to running aws setup ahead of time? Quoting you from your initial comment filing the PR:

To run this notebook therefore requires that you already have run aws setup on your local system with credentials that allow read access to s3://nextflow-ccdl-results

jaclyn-taroni · 2020-09-20T19:07:53Z

workflows/benchmarks/benchmark-analysis.Rmd

+Combine all cell QC stats into a single data frame
+```{r}
+sce_cell_qc <- purrr::map_df(sces, 
+                        ~ as.data.frame(colData(.x)) %>%
+                          tibble::rownames_to_column(var = "cell_id"), 
+                        .id = "quant_id") %>%
+  dplyr::left_join(quant_info, by = c("quant_id" = "quant_dir"))
+```
+
+
+Combine all feature QC stats into a single data frame.
+```{r}
+sce_feature_qc <- purrr::map_df(sces, 
+                        ~ as.data.frame(rowData(.x)) %>%
+                          tibble::rownames_to_column(var = "gene_id"), 
+                        .id = "quant_id") %>%
+  dplyr::left_join(quant_info, by = c("quant_id" = "quant_dir"))


Very minor quibble - the spacing on these chunks seems odd to me.

jaclyn-taroni · 2020-09-20T19:09:03Z

workflows/benchmarks/benchmark-analysis.Rmd

+quant_info <- data.frame (quant_dir = quant_dirs, info = quant_dirs) %>%
+  tidyr::separate(info, sep = "[-]", 
+                  into = c("sample", "index_type")) %>%
+  tidyr::separate(index_type, 
+                  into = c("index_content", "kmer", "decoy"), 
+                  extra = "drop") %>%
+  dplyr::mutate(kmer = stringr::str_remove(kmer, "k"))


jaclyn-taroni · 2020-09-20T19:11:44Z

workflows/benchmarks/benchmark-analysis.Rmd

+And we have them... interestingly, the best correlation is between no decoys and the partial decoy: the full decoy index is more different by this measure.
+This leads me to the preliminary conclusion that the partial index (which notably uses a slightly different set of transcripts at this stage, due to not being built locally) may not be worth pursuing, as it seems to make very little difference.
+
+Don't forget to look at your data!


jaclyn-taroni · 2020-09-20T19:14:23Z

workflows/benchmarks/benchmark-analysis.Rmd

+
+Interestingly, the correlation between cDNA and txome expression values (for coding genes) using the full decoy seems substantially better than with the no decoy sequences!
+
+This seems to me to be a good argument for using the full decoy, if it is less sensitive to the chosen transcript list.


Can we stick a sessionInfo() here? I know this should always be run in a container, but this practice has been helpful for tracking down potential weirdness in the past.

jashapiro · 2020-09-21T17:49:01Z

I do agree with @envest's sentiment:

For deeper understanding of the differences due to different indices, I think it may be worth including some additional samples, especially from another data contributor.

When we have some, we can repeat?

jaclyn-taroni · 2020-09-21T17:55:41Z

👍 works for me! Are the samples you've included so far the same cancer type? I was under the perhaps mistaken impression that there was more than one cancer type as part of this particular project and I'm curious if we'd see differences/less consistency across cancer types or if it would be like "weird transcript is weird" uniformly. This is a musing beyond the scope of this pull request.

jashapiro · 2020-09-21T17:57:25Z

At this point, all the same type. We were supposed to get the other type, but there seems to be a delay...

and rerun to keep html in sync

jashapiro added 14 commits September 2, 2020 16:28

Ignore more

be3b4c7

Update docker file

3f1bf5f

Start of instructions for using docker w/ AWS

8ddb3b1

Initial benchmark analysis notebook

c82b5f4

More analysis of cdna vs ncRNA

d9456d3

Merge branch 'master' into jashapiro/benchmark-analysis

fedd3a5

Merge branch 'jashapiro/alevin-benchmark' into jashapiro/benchmark-an…

bc2fe3d

…alysis

Move benchmarks

247776c

Update dockerfile location and instructions

0c1931b

Update benchmark analysis notebook

5ac1b09

Add decoy comparison

bfbe53a

Spellcheck and rebuild with TOC

d3f39fb

Clean up headers

594cd9d

More header cleaning

8aeda51

jashapiro requested a review from envest September 10, 2020 18:58

envest reviewed Sep 10, 2020

View reviewed changes

Update workflows/benchmarks/benchmark-analysis.Rmd

0efdfd4

Co-authored-by: Steven Foltz <stevenmasonfoltz@gmail.com>

jaclyn-taroni self-requested a review September 17, 2020 19:18

jaclyn-taroni approved these changes Sep 20, 2020

View reviewed changes

jashapiro added 2 commits September 21, 2020 14:15

Add note on AWS credentials

e5d8629

Update format & add sessioninfo

ee65308

and rerun to keep html in sync

jashapiro merged commit 7c3cde0 into master Sep 21, 2020

jashapiro mentioned this pull request Oct 15, 2020

Compare mappers and pseudomappers #35

Closed

jashapiro mentioned this pull request Nov 2, 2020

Quantification comparison, PR2 of 2: Plot mapper comparisons #45

Merged

jashapiro deleted the jashapiro/benchmark-analysis branch October 22, 2021 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial benchmark analysis #22

Initial benchmark analysis #22

jashapiro commented Sep 9, 2020 •

edited

Loading

jashapiro commented Sep 10, 2020

envest left a comment

envest Sep 10, 2020

jashapiro Sep 11, 2020

envest Sep 10, 2020

jashapiro Sep 11, 2020

envest Sep 10, 2020

jashapiro Sep 11, 2020

envest Sep 10, 2020

envest Sep 10, 2020

jashapiro Sep 11, 2020

jaclyn-taroni left a comment

jaclyn-taroni Sep 20, 2020

jaclyn-taroni Sep 20, 2020

jaclyn-taroni Sep 20, 2020

jaclyn-taroni Sep 20, 2020

jaclyn-taroni Sep 20, 2020

jaclyn-taroni Sep 20, 2020

jashapiro commented Sep 21, 2020

jaclyn-taroni commented Sep 21, 2020

jashapiro commented Sep 21, 2020

		@@ -0,0 +1,15 @@
		## Docker

		Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home direcory and pass in the current user's AWS credentials and launch RStudio.


		Interestingly, the correlation between cDNA and txome expression values (for coding genes) using the full decoy seems substantially better than with the no decoy sequences!

		This seems to me to be a good argument for using the full decoy, if it is less sensitive to the chosen transcript list.

Initial benchmark analysis #22

Initial benchmark analysis #22

Conversation

jashapiro commented Sep 9, 2020 • edited Loading

jashapiro commented Sep 10, 2020

envest left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro commented Sep 21, 2020

jaclyn-taroni commented Sep 21, 2020

jashapiro commented Sep 21, 2020

jashapiro commented Sep 9, 2020 •

edited

Loading