Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add processing information to QC report #29

Merged
merged 13 commits into from
Sep 2, 2021

Conversation

allyhawkins
Copy link
Member

Closes #17. This PR adds in basic processing information to the beginning of the QC report using generate_qc_report(). As it is right now, I grab any metadata available from the sce as a data frame, remove any empty rows, and then print it out in a basic table. Because we have some types of tools that currently do not have metadata in them, I have also allowed for the possibility of empty metadata and it will print out a statement that no processing information is available.

What I have not done here is modify any of the row names from the original names in the metadata table or any of the values that are grabbed directly from the metadata which originates from the json files in alevin-fry. I wasn't sure if we also wanted to include all of the information that is present in the metadata. I was hoping to get some initial feedback first in terms of how much we should change from the original metadata table and if we want to do the changing here or in the original read_alevin_metadata function instead to allow this to be more flexible with multiple tools. Although I am assuming at this point we will for the most part be using Alevin-fry.

Some specific pieces of metadata I was thinking we should change was the af permit type and af resolution to just be filter method and resolution. I also wasn't sure if we wanted to change the index names at this point for external users or just keep them as they are. Please let me know if there is also other processing information not in the metadata that you think we should be including here as well.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the approach you have taken here is okay, but I am a bit worried about its opacity. That is, it isn't clear from the transformations and table creation what metadata we are reporting. I think I would generally favor a more explicit approach, where we decide exactly which values we want to display, give them good names (probably more than removing underscores), and then create the table for display. This could also make it a bit easier to pass in other data (via params, or other sources) that might be worth reporting but isn't captured in the sce metadata.


```{r echo=FALSE}
# extract sce metadata containing processing information as table
processing_info <- sce@metadata %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably works most of the time, but it is dangerous, as sce is not defined in this notebook.

I also switched to use the metadata() function, as I think this is a bit more standard than directly accessing the slot name.

Suggested change
processing_info <- sce@metadata %>%
processing_info <- metadata(params$sce) %>%

```{r echo=FALSE}
# extract sce metadata containing processing information as table
processing_info <- sce@metadata %>%
as.data.frame() %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for the current structure of the metadata, but I am a bit worried about it, as it is entirely possible that we could end up with a metadata element that is not a single value. We also might have some metadata values we don't want to display.

@allyhawkins
Copy link
Member Author

I think I would generally favor a more explicit approach, where we decide exactly which values we want to display, give them good names (probably more than removing underscores), and then create the table for display. This could also make it a bit easier to pass in other data (via params, or other sources) that might be worth reporting but isn't captured in the sce metadata.

I can definitely make this more explicit based on what we agree needs to go into this table. I am going to operate on the assumption that this is only being used for samples being processed with Alevin-fry right now and based on looking at both the QC reports from Alevin and Cellranger it looks like we might want a table that gives some version and parameter information. This would include the following:

  • Salmon Version
  • Alevin-fry Version
  • Index
  • Filtering Method (knee vs. unfiltered-pl)
  • Resolution
  • Transcripts Counted (spliced vs. unspliced or do we want to refer to this as a different name?)

It also looks like there might be usefulness in having a sample specific table with information about each specific sample that doesn't quite fall under the processing category including the following:

  • Sample ID
  • 10X Kit
  • Number of Reads
  • Number of Mapped Reads
  • Number of Cells Reported by Alevin-fry

If you could let me know what categories you like or don't like/ if you have any additional ideas that would be great and then I can go ahead and create the explicit tables that show these.

@jashapiro
Copy link
Member

If you could let me know what categories you like or don't like/ if you have any additional ideas that would be great and then I can go ahead and create the explicit tables that show these.

These look good to me. As I said, nothing we can't add or subtract later!

A question I have is about the read counts that you listed in the second section: where are the read counts coming from? Are they as reported by alevin? If so, I think having something about the mapping rate might be worthwhile as well. Some of this is bleeding into QC, but if we are just using the data that comes out of Alevin and clearly state that, I think it should be fine. We can report the more limited data.

I do think "transcripts counted" is a bit of an odd phrase, but I'm having trouble with a better compact phrase. Maybe just something boolean, like "Unspliced transcripts included" with a value of true or false/yes or no? I don't necessarily like that better.

@allyhawkins
Copy link
Member Author

@jashapiro, I adjusted these to make things more explicit so we are labeling them how we want and grabbing the fields that we want. I also added in the addition of a sample summary table that has information on the number of reads and mapped reads as I said in my previous comment. To get these values, I am pulling them directly from the Alevin output. In the aux_info/meta_info.json file it stores the number of reads processed and the number of reads mapped. This file is present for both alevin and alevin-fry. I adjusted the read_alevin_metadata function accordingly to grab that metadata that is being used to build the table. I didn't do much in terms of formatting or making things pretty but figured that could come later.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Most of my suggestions are pretty minor, but some should streamline things a bit.

One thing that is not in my line comments is that (for the second review in a row) I am going to suggest sentence case for the table row labels; I find it a bit easier to read. But I can be overruled as long as we are consistent.

R/read_alevin.R Outdated
@@ -199,6 +208,8 @@ read_alevin_metadata <- function(quant_dir){
meta$af_resolution <- quant_info[['resolution_strategy']]
meta$af_tx2gene <- cmd_info[['tgMap']]
meta$usa_mode <- quant_info[['usa_mode']]
meta$num_cells <- quant_info[['num_quantified_cells']]
meta$version_10X <- names(cmd_info)[[3]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to trust the order here to stay stable. Is there some other way we can get this information? I think this also would miss 10Xv3 vs. 10Xv3.1, which I think we will want to report. I don't really want to make it an argument to this function in particular, but adding to the SCE metadata later should not be difficult.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is something I was struggling with. I forgot to mention this in my previous comment, but I agree that this isn't the most stable method of getting the 10X version moving forward. This was the only way I could get it using the alevin output which is all we have currently as input to the read_alevin function where the metadata is added to the sce. Moving forward, I do think the best place to get it will actually be from the scpca-library-metadata.tsv sheet that we have that becomes the metadata map in the nextflow workflow. Because we plan on creating the QC report from the workflow, another option is to actually use the 10X version as input to the function along with the sample ID? Unless you have other ideas on how to grab this information I am open to suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it an optional argument to read_alevin... I think there are a number of places that it could go, but that probably makes the most sense, so it ends up in the SCE metadata if desired.

R/read_alevin.R Outdated
@@ -199,6 +208,8 @@ read_alevin_metadata <- function(quant_dir){
meta$af_resolution <- quant_info[['resolution_strategy']]
meta$af_tx2gene <- cmd_info[['tgMap']]
meta$usa_mode <- quant_info[['usa_mode']]
meta$num_cells <- quant_info[['num_quantified_cells']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a prefix here to specify that these are the alevin fry counts, as we will probably have other cell counts later that might end up added in.

Suggested change
meta$num_cells <- quant_info[['num_quantified_cells']]
meta$af_num_cells <- quant_info[['num_quantified_cells']]

inst/rmd/qc_report.rmd Outdated Show resolved Hide resolved
inst/rmd/qc_report.rmd Outdated Show resolved Hide resolved
Comment on lines 44 to 46
# remove underscores from rownames for table
rownames(processing_info) = str_replace_all(rownames(processing_info), "\\.", " ")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a tibble, we should be able to remove this.



```{r echo=FALSE}
sample_information <- data.frame(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestion to use a tibble here.

Suggested change
sample_information <- data.frame(
sample_information <- data.frame(

inst/rmd/qc_report.rmd Outdated Show resolved Hide resolved
inst/rmd/qc_report.rmd Outdated Show resolved Hide resolved
Comment on lines 40 to 41
# remove any empty slots before making table
na.omit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not remove empty slots. Seeing them as NA or NULL is a good sanity check which we would otherwise potentially miss.

inst/rmd/qc_report.rmd Outdated Show resolved Hide resolved
R/read_alevin.R Outdated
@@ -39,7 +40,8 @@
read_alevin <- function(quant_dir,
intron_mode = FALSE,
usa_mode = FALSE,
which_counts = c("spliced", "unspliced")){
which_counts = c("spliced", "unspliced"),
version_10x = NULL){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know everything has been 10X so far, but let's make this variable name more generic? Maybe something like kit_version, or tech_version? The values we are using do encode that it is 10X, so I don't think it should be ambiguous at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I updated it to be tech_version. Do we want the name in the table to also be Tech version or would we prefer something like Technology, Technology version, orTechnology used?

t() %>%
# remove any empty slots before making table
na.omit()
t()

if(!is.null(sample_information)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing the na.omit step, I think this if is no longer needed.

DESCRIPTION Outdated Show resolved Hide resolved
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I have a number of minor comments, but the overall content seems good to me. We will certainly be iterating on the report a bunch, so I don't think we need to get everything perfect at this stage, but getting the extra info into the metadata is going to be important for other steps later.

@@ -59,6 +60,7 @@ import_quant_data <- function(quant_dir,
usa_mode = FALSE,
filter = FALSE,
fdr_cutoff = 0.01,
tech_version = NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit, I don't love tech_version but I can't come up with anything better, so let's go with it.

R/read_alevin.R Outdated Show resolved Hide resolved
inst/rmd/qc_report.rmd Outdated Show resolved Hide resolved
inst/rmd/qc_report.rmd Outdated Show resolved Hide resolved


```{r echo=FALSE}
sample_information <- tibble::tibble(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a subheading of some kind here? I think we probably also want to switch the order of these two tables. I feel like people will want the read count info before they care about the details of our processing.

allyhawkins and others added 2 commits September 2, 2021 14:11
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins allyhawkins merged commit fc23948 into main Sep 2, 2021
@allyhawkins allyhawkins deleted the allyhawkins/add-processing-info branch September 2, 2021 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add processing information to QC report
2 participants