-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add processing information to QC report #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the approach you have taken here is okay, but I am a bit worried about its opacity. That is, it isn't clear from the transformations and table creation what metadata we are reporting. I think I would generally favor a more explicit approach, where we decide exactly which values we want to display, give them good names (probably more than removing underscores), and then create the table for display. This could also make it a bit easier to pass in other data (via params
, or other sources) that might be worth reporting but isn't captured in the sce
metadata.
inst/rmd/qc_report.rmd
Outdated
|
||
```{r echo=FALSE} | ||
# extract sce metadata containing processing information as table | ||
processing_info <- sce@metadata %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably works most of the time, but it is dangerous, as sce
is not defined in this notebook.
I also switched to use the metadata()
function, as I think this is a bit more standard than directly accessing the slot name.
processing_info <- sce@metadata %>% | |
processing_info <- metadata(params$sce) %>% |
inst/rmd/qc_report.rmd
Outdated
```{r echo=FALSE} | ||
# extract sce metadata containing processing information as table | ||
processing_info <- sce@metadata %>% | ||
as.data.frame() %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for the current structure of the metadata, but I am a bit worried about it, as it is entirely possible that we could end up with a metadata element that is not a single value. We also might have some metadata values we don't want to display.
I can definitely make this more explicit based on what we agree needs to go into this table. I am going to operate on the assumption that this is only being used for samples being processed with Alevin-fry right now and based on looking at both the QC reports from Alevin and Cellranger it looks like we might want a table that gives some version and parameter information. This would include the following:
It also looks like there might be usefulness in having a sample specific table with information about each specific sample that doesn't quite fall under the processing category including the following:
If you could let me know what categories you like or don't like/ if you have any additional ideas that would be great and then I can go ahead and create the explicit tables that show these. |
These look good to me. As I said, nothing we can't add or subtract later! A question I have is about the read counts that you listed in the second section: where are the read counts coming from? Are they as reported by alevin? If so, I think having something about the mapping rate might be worthwhile as well. Some of this is bleeding into QC, but if we are just using the data that comes out of Alevin and clearly state that, I think it should be fine. We can report the more limited data. I do think "transcripts counted" is a bit of an odd phrase, but I'm having trouble with a better compact phrase. Maybe just something boolean, like "Unspliced transcripts included" with a value of true or false/yes or no? I don't necessarily like that better. |
@jashapiro, I adjusted these to make things more explicit so we are labeling them how we want and grabbing the fields that we want. I also added in the addition of a sample summary table that has information on the number of reads and mapped reads as I said in my previous comment. To get these values, I am pulling them directly from the Alevin output. In the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Most of my suggestions are pretty minor, but some should streamline things a bit.
One thing that is not in my line comments is that (for the second review in a row) I am going to suggest sentence case for the table row labels; I find it a bit easier to read. But I can be overruled as long as we are consistent.
R/read_alevin.R
Outdated
@@ -199,6 +208,8 @@ read_alevin_metadata <- function(quant_dir){ | |||
meta$af_resolution <- quant_info[['resolution_strategy']] | |||
meta$af_tx2gene <- cmd_info[['tgMap']] | |||
meta$usa_mode <- quant_info[['usa_mode']] | |||
meta$num_cells <- quant_info[['num_quantified_cells']] | |||
meta$version_10X <- names(cmd_info)[[3]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to trust the order here to stay stable. Is there some other way we can get this information? I think this also would miss 10Xv3
vs. 10Xv3.1
, which I think we will want to report. I don't really want to make it an argument to this function in particular, but adding to the SCE metadata later should not be difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is something I was struggling with. I forgot to mention this in my previous comment, but I agree that this isn't the most stable method of getting the 10X version moving forward. This was the only way I could get it using the alevin output which is all we have currently as input to the read_alevin
function where the metadata is added to the sce. Moving forward, I do think the best place to get it will actually be from the scpca-library-metadata.tsv
sheet that we have that becomes the metadata map in the nextflow workflow. Because we plan on creating the QC report from the workflow, another option is to actually use the 10X version as input to the function along with the sample ID? Unless you have other ideas on how to grab this information I am open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make it an optional argument to read_alevin
... I think there are a number of places that it could go, but that probably makes the most sense, so it ends up in the SCE metadata if desired.
R/read_alevin.R
Outdated
@@ -199,6 +208,8 @@ read_alevin_metadata <- function(quant_dir){ | |||
meta$af_resolution <- quant_info[['resolution_strategy']] | |||
meta$af_tx2gene <- cmd_info[['tgMap']] | |||
meta$usa_mode <- quant_info[['usa_mode']] | |||
meta$num_cells <- quant_info[['num_quantified_cells']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a prefix here to specify that these are the alevin fry counts, as we will probably have other cell counts later that might end up added in.
meta$num_cells <- quant_info[['num_quantified_cells']] | |
meta$af_num_cells <- quant_info[['num_quantified_cells']] |
inst/rmd/qc_report.rmd
Outdated
# remove underscores from rownames for table | ||
rownames(processing_info) = str_replace_all(rownames(processing_info), "\\.", " ") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use a tibble, we should be able to remove this.
inst/rmd/qc_report.rmd
Outdated
|
||
|
||
```{r echo=FALSE} | ||
sample_information <- data.frame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestion to use a tibble here.
sample_information <- data.frame( | |
sample_information <- data.frame( |
inst/rmd/qc_report.rmd
Outdated
# remove any empty slots before making table | ||
na.omit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not remove empty slots. Seeing them as NA or NULL is a good sanity check which we would otherwise potentially miss.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
R/read_alevin.R
Outdated
@@ -39,7 +40,8 @@ | |||
read_alevin <- function(quant_dir, | |||
intron_mode = FALSE, | |||
usa_mode = FALSE, | |||
which_counts = c("spliced", "unspliced")){ | |||
which_counts = c("spliced", "unspliced"), | |||
version_10x = NULL){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know everything has been 10X so far, but let's make this variable name more generic? Maybe something like kit_version
, or tech_version
? The values we are using do encode that it is 10X, so I don't think it should be ambiguous at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I updated it to be tech_version
. Do we want the name in the table to also be Tech version
or would we prefer something like Technology
, Technology version
, orTechnology used
?
inst/rmd/qc_report.rmd
Outdated
t() %>% | ||
# remove any empty slots before making table | ||
na.omit() | ||
t() | ||
|
||
if(!is.null(sample_information)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After removing the na.omit
step, I think this if
is no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I have a number of minor comments, but the overall content seems good to me. We will certainly be iterating on the report a bunch, so I don't think we need to get everything perfect at this stage, but getting the extra info into the metadata is going to be important for other steps later.
@@ -59,6 +60,7 @@ import_quant_data <- function(quant_dir, | |||
usa_mode = FALSE, | |||
filter = FALSE, | |||
fdr_cutoff = 0.01, | |||
tech_version = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit, I don't love tech_version
but I can't come up with anything better, so let's go with it.
inst/rmd/qc_report.rmd
Outdated
|
||
|
||
```{r echo=FALSE} | ||
sample_information <- tibble::tibble( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want a subheading of some kind here? I think we probably also want to switch the order of these two tables. I feel like people will want the read count info before they care about the details of our processing.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Closes #17. This PR adds in basic processing information to the beginning of the QC report using
generate_qc_report()
. As it is right now, I grab any metadata available from the sce as a data frame, remove any empty rows, and then print it out in a basic table. Because we have some types of tools that currently do not have metadata in them, I have also allowed for the possibility of empty metadata and it will print out a statement that no processing information is available.What I have not done here is modify any of the row names from the original names in the metadata table or any of the values that are grabbed directly from the metadata which originates from the json files in alevin-fry. I wasn't sure if we also wanted to include all of the information that is present in the metadata. I was hoping to get some initial feedback first in terms of how much we should change from the original metadata table and if we want to do the changing here or in the original
read_alevin_metadata
function instead to allow this to be more flexible with multiple tools. Although I am assuming at this point we will for the most part be using Alevin-fry.Some specific pieces of metadata I was thinking we should change was the
af permit type
andaf resolution
to just befilter method
andresolution
. I also wasn't sure if we wanted to change the index names at this point for external users or just keep them as they are. Please let me know if there is also other processing information not in the metadata that you think we should be including here as well.