Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial benchmark analysis #22

Merged
merged 17 commits into from
Sep 21, 2020
Merged

Initial benchmark analysis #22

merged 17 commits into from
Sep 21, 2020

Conversation

jashapiro
Copy link
Member

@jashapiro jashapiro commented Sep 9, 2020

The benchmark analysis notebook here currently contains investigation of the count patterns seen in the cDNA vs. full transcriptome reference indexes, in service of the discussion in #4. I expect to add analysis of the decoy sequences soon (some parts of which can be seen here), but I thought it was worth getting this into a PR in its current state.

I initially tried to use some S3 libraries within R, but those did not seem to work the way I wanted, so I resorted to simply downloading the result files. to make that easier, I launch the docker image with the command described in README.md to import the local AWS tokens for access to S3 (the docker image includes the aws-cli tools). To run this notebook therefore requires that you already have run aws setup on your local system with credentials that allow read access to s3://nextflow-ccdl-results

@jashapiro
Copy link
Member Author

I have added some analysis of the decoy indexes to this notebook, which were more interesting than I expected! I am going to request @envest for review to start, but I think @cgreene and @jaclyn-taroni might want to have a look at the results as well.

Most intriguing results are right at the end, where I compare the full transcriptome to coding-only* indexes and notice that you get much more concordant results between the two when you include decoy sequences. Or I did something funny, which is always possible.

  • Coding only includes pseudogenes, for some reason. Ask Ensembl.

The easiest way to view the notebook is probably http://htmlpreview.github.io/?https://github.com/AlexsLemonade/alsf-scpca/blob/jashapiro/benchmark-analysis/workflows/benchmarks/benchmark-analysis.nb.html#Comparing_decoy_indexes

@jashapiro jashapiro requested a review from envest September 10, 2020 18:58
Copy link
Contributor

@envest envest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I agree with the "go ahead with the full decoy" conclusion. For deeper understanding of the differences due to different indices, I think it may be worth including some additional samples, especially from another data contributor.

(Not taking sides on indexes vs. indices...) 😝

### Get Annotations from AnnotationHub

```{r}
hub = AnnotationHub::AnnotationHub(ask = FALSE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notebook output had this warning:

Warning: call dbDisconnect() when finished working with a connection
snapshotDate(): 2020-04-27

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that occasionally, but I can't pin it down, so I am ignoring it. I don't love AnnotationHub (not the speediest), but it does seem to be nicely reliable.

ENSG00000197563: PIGN Phosphatidylinositol Glycan Anchor Biosynthesis Class N
ENSG00000173559: NABP1 Nucleic Acid Binding Protein 1

I don't really know what to do with this list... it seems like few genes will have large differences, but a few differences are enormous!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... If some ENGSs really are problematic, I would expect that the problem is consistent across samples within and across data providers. Worth checking out some more samples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, some of these seem consistent between the two samples, but I have no idea more generally.

workflows/benchmarks/benchmark-analysis.Rmd Outdated Show resolved Hide resolved
geom_point()
```

Interestingly, the correlation between cDNA and txome expression values (for coding genes) using the full decoy seems substantially better than with the no decoy sequences!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this higher correlation hold for sample 834?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does!
No decoy:
image
Full decoy:
image

library(tximport)

# set seed
set.seed(2020)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

geom_point()
```
Correlation is very good for genes present in both, with expression more often lower in the transcriptome set, which may make sense in the case of multimapping introduced by the larger number of potential targets.
Interesting though that there are some genes which do not appear to be expressed in the cDNA set that have expression in the txome.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm what's going on here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pseudomapping is a bit of a black box?

Co-authored-by: Steven Foltz <stevenmasonfoltz@gmail.com>
@jaclyn-taroni jaclyn-taroni self-requested a review September 17, 2020 19:18
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me in general! I had a couple documentation comments that should be addressed before merging, but no need for me to re-review before merging.

I do agree with @envest's sentiment:

For deeper understanding of the differences due to different indices, I think it may be worth including some additional samples, especially from another data contributor.

@@ -0,0 +1,15 @@
## Docker

Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home direcory and pass in the current user's AWS credentials and launch RStudio.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home direcory and pass in the current user's AWS credentials and launch RStudio.
Running the benchmark analysis notebook should be done via docker, with the following command to mount the local directory as the home directory and pass in the current user's AWS credentials and launch RStudio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief reference to .aws like you have in workflows/images/scpca-r/README.md and/or to running aws setup ahead of time? Quoting you from your initial comment filing the PR:

To run this notebook therefore requires that you already have run aws setup on your local system with credentials that allow read access to s3://nextflow-ccdl-results

Comment on lines 110 to 126
Combine all cell QC stats into a single data frame
```{r}
sce_cell_qc <- purrr::map_df(sces,
~ as.data.frame(colData(.x)) %>%
tibble::rownames_to_column(var = "cell_id"),
.id = "quant_id") %>%
dplyr::left_join(quant_info, by = c("quant_id" = "quant_dir"))
```


Combine all feature QC stats into a single data frame.
```{r}
sce_feature_qc <- purrr::map_df(sces,
~ as.data.frame(rowData(.x)) %>%
tibble::rownames_to_column(var = "gene_id"),
.id = "quant_id") %>%
dplyr::left_join(quant_info, by = c("quant_id" = "quant_dir"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor quibble - the spacing on these chunks seems odd to me.

Comment on lines +49 to +55
quant_info <- data.frame (quant_dir = quant_dirs, info = quant_dirs) %>%
tidyr::separate(info, sep = "[-]",
into = c("sample", "index_type")) %>%
tidyr::separate(index_type,
into = c("index_content", "kmer", "decoy"),
extra = "drop") %>%
dplyr::mutate(kmer = stringr::str_remove(kmer, "k"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nifty!

And we have them... interestingly, the best correlation is between no decoys and the partial decoy: the full decoy index is more different by this measure.
This leads me to the preliminary conclusion that the partial index (which notably uses a slightly different set of transcripts at this stage, due to not being built locally) may not be worth pursuing, as it seems to make very little difference.

Don't forget to look at your data!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂 💯


Interestingly, the correlation between cDNA and txome expression values (for coding genes) using the full decoy seems substantially better than with the no decoy sequences!

This seems to me to be a good argument for using the full decoy, if it is less sensitive to the chosen transcript list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stick a sessionInfo() here? I know this should always be run in a container, but this practice has been helpful for tracking down potential weirdness in the past.

@jashapiro
Copy link
Member Author

I do agree with @envest's sentiment:

For deeper understanding of the differences due to different indices, I think it may be worth including some additional samples, especially from another data contributor.

When we have some, we can repeat?

@jaclyn-taroni
Copy link
Member

👍 works for me! Are the samples you've included so far the same cancer type? I was under the perhaps mistaken impression that there was more than one cancer type as part of this particular project and I'm curious if we'd see differences/less consistency across cancer types or if it would be like "weird transcript is weird" uniformly. This is a musing beyond the scope of this pull request.

@jashapiro
Copy link
Member Author

At this point, all the same type. We were supposed to get the other type, but there seems to be a delay...

@jashapiro jashapiro merged commit 7c3cde0 into master Sep 21, 2020
@jashapiro jashapiro deleted the jashapiro/benchmark-analysis branch October 22, 2021 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants