-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial benchmark analysis #22
Conversation
I have added some analysis of the decoy indexes to this notebook, which were more interesting than I expected! I am going to request @envest for review to start, but I think @cgreene and @jaclyn-taroni might want to have a look at the results as well. Most intriguing results are right at the end, where I compare the full transcriptome to coding-only* indexes and notice that you get much more concordant results between the two when you include decoy sequences. Or I did something funny, which is always possible.
The easiest way to view the notebook is probably http://htmlpreview.github.io/?https://github.com/AlexsLemonade/alsf-scpca/blob/jashapiro/benchmark-analysis/workflows/benchmarks/benchmark-analysis.nb.html#Comparing_decoy_indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I agree with the "go ahead with the full decoy" conclusion. For deeper understanding of the differences due to different indices, I think it may be worth including some additional samples, especially from another data contributor.
(Not taking sides on indexes vs. indices...) 😝
### Get Annotations from AnnotationHub | ||
|
||
```{r} | ||
hub = AnnotationHub::AnnotationHub(ask = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notebook output had this warning:
Warning: call dbDisconnect() when finished working with a connection
snapshotDate(): 2020-04-27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that occasionally, but I can't pin it down, so I am ignoring it. I don't love AnnotationHub (not the speediest), but it does seem to be nicely reliable.
ENSG00000197563: PIGN Phosphatidylinositol Glycan Anchor Biosynthesis Class N | ||
ENSG00000173559: NABP1 Nucleic Acid Binding Protein 1 | ||
|
||
I don't really know what to do with this list... it seems like few genes will have large differences, but a few differences are enormous! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... If some ENGSs really are problematic, I would expect that the problem is consistent across samples within and across data providers. Worth checking out some more samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, some of these seem consistent between the two samples, but I have no idea more generally.
geom_point() | ||
``` | ||
|
||
Interestingly, the correlation between cDNA and txome expression values (for coding genes) using the full decoy seems substantially better than with the no decoy sequences! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this higher correlation hold for sample 834?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
library(tximport) | ||
|
||
# set seed | ||
set.seed(2020) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
geom_point() | ||
``` | ||
Correlation is very good for genes present in both, with expression more often lower in the transcriptome set, which may make sense in the case of multimapping introduced by the larger number of potential targets. | ||
Interesting though that there are some genes which do not appear to be expressed in the cDNA set that have expression in the txome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm what's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pseudomapping is a bit of a black box?
Co-authored-by: Steven Foltz <stevenmasonfoltz@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me in general! I had a couple documentation comments that should be addressed before merging, but no need for me to re-review before merging.
I do agree with @envest's sentiment:
For deeper understanding of the differences due to different indices, I think it may be worth including some additional samples, especially from another data contributor.
@@ -0,0 +1,15 @@ | |||
## Docker | |||
|
|||
Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home direcory and pass in the current user's AWS credentials and launch RStudio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home direcory and pass in the current user's AWS credentials and launch RStudio. | |
Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home directory and pass in the current user's AWS credentials and launch RStudio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief reference to .aws
like you have in workflows/images/scpca-r/README.md
and/or to running aws setup
ahead of time? Quoting you from your initial comment filing the PR:
To run this notebook therefore requires that you already have run
aws setup
on your local system with credentials that allow read access tos3://nextflow-ccdl-results
Combine all cell QC stats into a single data frame | ||
```{r} | ||
sce_cell_qc <- purrr::map_df(sces, | ||
~ as.data.frame(colData(.x)) %>% | ||
tibble::rownames_to_column(var = "cell_id"), | ||
.id = "quant_id") %>% | ||
dplyr::left_join(quant_info, by = c("quant_id" = "quant_dir")) | ||
``` | ||
|
||
|
||
Combine all feature QC stats into a single data frame. | ||
```{r} | ||
sce_feature_qc <- purrr::map_df(sces, | ||
~ as.data.frame(rowData(.x)) %>% | ||
tibble::rownames_to_column(var = "gene_id"), | ||
.id = "quant_id") %>% | ||
dplyr::left_join(quant_info, by = c("quant_id" = "quant_dir")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor quibble - the spacing on these chunks seems odd to me.
quant_info <- data.frame (quant_dir = quant_dirs, info = quant_dirs) %>% | ||
tidyr::separate(info, sep = "[-]", | ||
into = c("sample", "index_type")) %>% | ||
tidyr::separate(index_type, | ||
into = c("index_content", "kmer", "decoy"), | ||
extra = "drop") %>% | ||
dplyr::mutate(kmer = stringr::str_remove(kmer, "k")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nifty!
And we have them... interestingly, the best correlation is between no decoys and the partial decoy: the full decoy index is more different by this measure. | ||
This leads me to the preliminary conclusion that the partial index (which notably uses a slightly different set of transcripts at this stage, due to not being built locally) may not be worth pursuing, as it seems to make very little difference. | ||
|
||
Don't forget to look at your data! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂 💯
|
||
Interestingly, the correlation between cDNA and txome expression values (for coding genes) using the full decoy seems substantially better than with the no decoy sequences! | ||
|
||
This seems to me to be a good argument for using the full decoy, if it is less sensitive to the chosen transcript list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we stick a sessionInfo()
here? I know this should always be run in a container, but this practice has been helpful for tracking down potential weirdness in the past.
When we have some, we can repeat? |
👍 works for me! Are the samples you've included so far the same cancer type? I was under the perhaps mistaken impression that there was more than one cancer type as part of this particular project and I'm curious if we'd see differences/less consistency across cancer types or if it would be like "weird transcript is weird" uniformly. This is a musing beyond the scope of this pull request. |
At this point, all the same type. We were supposed to get the other type, but there seems to be a delay... |
and rerun to keep html in sync
The benchmark analysis notebook here currently contains investigation of the count patterns seen in the cDNA vs. full transcriptome reference indexes, in service of the discussion in #4. I expect to add analysis of the decoy sequences soon (some parts of which can be seen here), but I thought it was worth getting this into a PR in its current state.
I initially tried to use some S3 libraries within R, but those did not seem to work the way I wanted, so I resorted to simply downloading the result files. to make that easier, I launch the docker image with the command described in README.md to import the local AWS tokens for access to S3 (the docker image includes the aws-cli tools). To run this notebook therefore requires that you already have run
aws setup
on your local system with credentials that allow read access tos3://nextflow-ccdl-results