Add processing information to QC report #29

allyhawkins · 2021-08-05T18:28:21Z

Closes #17. This PR adds in basic processing information to the beginning of the QC report using generate_qc_report(). As it is right now, I grab any metadata available from the sce as a data frame, remove any empty rows, and then print it out in a basic table. Because we have some types of tools that currently do not have metadata in them, I have also allowed for the possibility of empty metadata and it will print out a statement that no processing information is available.

What I have not done here is modify any of the row names from the original names in the metadata table or any of the values that are grabbed directly from the metadata which originates from the json files in alevin-fry. I wasn't sure if we also wanted to include all of the information that is present in the metadata. I was hoping to get some initial feedback first in terms of how much we should change from the original metadata table and if we want to do the changing here or in the original read_alevin_metadata function instead to allow this to be more flexible with multiple tools. Although I am assuming at this point we will for the most part be using Alevin-fry.

Some specific pieces of metadata I was thinking we should change was the af permit type and af resolution to just be filter method and resolution. I also wasn't sure if we wanted to change the index names at this point for external users or just keep them as they are. Please let me know if there is also other processing information not in the metadata that you think we should be including here as well.

jashapiro

I think that the approach you have taken here is okay, but I am a bit worried about its opacity. That is, it isn't clear from the transformations and table creation what metadata we are reporting. I think I would generally favor a more explicit approach, where we decide exactly which values we want to display, give them good names (probably more than removing underscores), and then create the table for display. This could also make it a bit easier to pass in other data (via params, or other sources) that might be worth reporting but isn't captured in the sce metadata.

jashapiro · 2021-08-05T18:35:48Z

inst/rmd/qc_report.rmd

+
+```{r echo=FALSE}
+# extract sce metadata containing processing information as table
+processing_info <- sce@metadata %>%


This probably works most of the time, but it is dangerous, as sce is not defined in this notebook.

I also switched to use the metadata() function, as I think this is a bit more standard than directly accessing the slot name.

Suggested change

processing_info <- sce@metadata %>%

processing_info <- metadata(params$sce) %>%

jashapiro · 2021-08-05T18:38:47Z

inst/rmd/qc_report.rmd

+```{r echo=FALSE}
+# extract sce metadata containing processing information as table
+processing_info <- sce@metadata %>%
+  as.data.frame() %>%


This works for the current structure of the metadata, but I am a bit worried about it, as it is entirely possible that we could end up with a metadata element that is not a single value. We also might have some metadata values we don't want to display.

allyhawkins · 2021-08-05T22:23:30Z

I think I would generally favor a more explicit approach, where we decide exactly which values we want to display, give them good names (probably more than removing underscores), and then create the table for display. This could also make it a bit easier to pass in other data (via params, or other sources) that might be worth reporting but isn't captured in the sce metadata.

I can definitely make this more explicit based on what we agree needs to go into this table. I am going to operate on the assumption that this is only being used for samples being processed with Alevin-fry right now and based on looking at both the QC reports from Alevin and Cellranger it looks like we might want a table that gives some version and parameter information. This would include the following:

Salmon Version
Alevin-fry Version
Index
Filtering Method (knee vs. unfiltered-pl)
Resolution
Transcripts Counted (spliced vs. unspliced or do we want to refer to this as a different name?)

It also looks like there might be usefulness in having a sample specific table with information about each specific sample that doesn't quite fall under the processing category including the following:

Sample ID
10X Kit
Number of Reads
Number of Mapped Reads
Number of Cells Reported by Alevin-fry

If you could let me know what categories you like or don't like/ if you have any additional ideas that would be great and then I can go ahead and create the explicit tables that show these.

jashapiro · 2021-08-16T20:37:45Z

If you could let me know what categories you like or don't like/ if you have any additional ideas that would be great and then I can go ahead and create the explicit tables that show these.

These look good to me. As I said, nothing we can't add or subtract later!

A question I have is about the read counts that you listed in the second section: where are the read counts coming from? Are they as reported by alevin? If so, I think having something about the mapping rate might be worthwhile as well. Some of this is bleeding into QC, but if we are just using the data that comes out of Alevin and clearly state that, I think it should be fine. We can report the more limited data.

I do think "transcripts counted" is a bit of an odd phrase, but I'm having trouble with a better compact phrase. Maybe just something boolean, like "Unspliced transcripts included" with a value of true or false/yes or no? I don't necessarily like that better.

allyhawkins · 2021-08-17T23:04:41Z

@jashapiro, I adjusted these to make things more explicit so we are labeling them how we want and grabbing the fields that we want. I also added in the addition of a sample summary table that has information on the number of reads and mapped reads as I said in my previous comment. To get these values, I am pulling them directly from the Alevin output. In the aux_info/meta_info.json file it stores the number of reads processed and the number of reads mapped. This file is present for both alevin and alevin-fry. I adjusted the read_alevin_metadata function accordingly to grab that metadata that is being used to build the table. I didn't do much in terms of formatting or making things pretty but figured that could come later.

jashapiro

This looks good. Most of my suggestions are pretty minor, but some should streamline things a bit.

One thing that is not in my line comments is that (for the second review in a row) I am going to suggest sentence case for the table row labels; I find it a bit easier to read. But I can be overruled as long as we are consistent.

jashapiro · 2021-08-18T19:08:14Z

R/read_alevin.R

@@ -199,6 +208,8 @@ read_alevin_metadata <- function(quant_dir){
  meta$af_resolution <- quant_info[['resolution_strategy']]
  meta$af_tx2gene <- cmd_info[['tgMap']]
  meta$usa_mode <- quant_info[['usa_mode']]
+  meta$num_cells <- quant_info[['num_quantified_cells']]
+  meta$version_10X <- names(cmd_info)[[3]]


I don't think we want to trust the order here to stay stable. Is there some other way we can get this information? I think this also would miss 10Xv3 vs. 10Xv3.1, which I think we will want to report. I don't really want to make it an argument to this function in particular, but adding to the SCE metadata later should not be difficult.

So this is something I was struggling with. I forgot to mention this in my previous comment, but I agree that this isn't the most stable method of getting the 10X version moving forward. This was the only way I could get it using the alevin output which is all we have currently as input to the read_alevin function where the metadata is added to the sce. Moving forward, I do think the best place to get it will actually be from the scpca-library-metadata.tsv sheet that we have that becomes the metadata map in the nextflow workflow. Because we plan on creating the QC report from the workflow, another option is to actually use the 10X version as input to the function along with the sample ID? Unless you have other ideas on how to grab this information I am open to suggestions.

We could make it an optional argument to read_alevin... I think there are a number of places that it could go, but that probably makes the most sense, so it ends up in the SCE metadata if desired.

jashapiro · 2021-08-18T19:10:21Z

R/read_alevin.R

@@ -199,6 +208,8 @@ read_alevin_metadata <- function(quant_dir){
  meta$af_resolution <- quant_info[['resolution_strategy']]
  meta$af_tx2gene <- cmd_info[['tgMap']]
  meta$usa_mode <- quant_info[['usa_mode']]
+  meta$num_cells <- quant_info[['num_quantified_cells']]


We should probably add a prefix here to specify that these are the alevin fry counts, as we will probably have other cell counts later that might end up added in.

Suggested change

meta$num_cells <- quant_info[['num_quantified_cells']]

meta$af_num_cells <- quant_info[['num_quantified_cells']]

inst/rmd/qc_report.rmd

jashapiro · 2021-08-18T19:38:35Z

inst/rmd/qc_report.rmd

+  # remove underscores from rownames for table 
+  rownames(processing_info) = str_replace_all(rownames(processing_info), "\\.", " ")
+


If we use a tibble, we should be able to remove this.

jashapiro · 2021-08-18T19:39:16Z

inst/rmd/qc_report.rmd

+
+
+```{r echo=FALSE}
+sample_information <- data.frame(


Same suggestion to use a tibble here.

Suggested change

sample_information <- data.frame(

sample_information <- data.frame(

inst/rmd/qc_report.rmd

jashapiro · 2021-08-18T19:47:45Z

inst/rmd/qc_report.rmd

+  # remove any empty slots before making table
+  na.omit()


I think we should not remove empty slots. Seeing them as NA or NULL is a good sanity check which we would otherwise potentially miss.

inst/rmd/qc_report.rmd

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

jashapiro · 2021-08-19T13:20:22Z

R/read_alevin.R

@@ -39,7 +40,8 @@
 read_alevin <- function(quant_dir,
                        intron_mode = FALSE,
                        usa_mode = FALSE,
-                        which_counts = c("spliced", "unspliced")){
+                        which_counts = c("spliced", "unspliced"),
+                        version_10x = NULL){


I know everything has been 10X so far, but let's make this variable name more generic? Maybe something like kit_version, or tech_version? The values we are using do encode that it is 10X, so I don't think it should be ambiguous at all.

That makes sense. I updated it to be tech_version. Do we want the name in the table to also be Tech version or would we prefer something like Technology, Technology version, orTechnology used?

jashapiro · 2021-08-19T13:22:05Z

inst/rmd/qc_report.rmd

-  t() %>%
-  # remove any empty slots before making table
-  na.omit()
+  t()

 if(!is.null(sample_information)){


After removing the na.omit step, I think this if is no longer needed.

DESCRIPTION

jashapiro

This looks good to me. I have a number of minor comments, but the overall content seems good to me. We will certainly be iterating on the report a bunch, so I don't think we need to get everything perfect at this stage, but getting the extra info into the metadata is going to be important for other steps later.

jashapiro · 2021-09-02T13:43:27Z

R/import_quant_data.R

@@ -59,6 +60,7 @@ import_quant_data <- function(quant_dir,
                              usa_mode = FALSE,
                              filter = FALSE,
                              fdr_cutoff = 0.01,
+                              tech_version = NULL,


I'll admit, I don't love tech_version but I can't come up with anything better, so let's go with it.

R/read_alevin.R

inst/rmd/qc_report.rmd

jashapiro · 2021-09-02T13:59:59Z

inst/rmd/qc_report.rmd

+
+
+```{r echo=FALSE}
+sample_information <- tibble::tibble(


We probably want a subheading of some kind here? I think we probably also want to switch the order of these two tables. I feel like people will want the read count info before they care about the details of our processing.

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins added 3 commits August 4, 2021 18:17

generate table with processing information in qc report

d6619d8

update headers

26e5d41

allow for empty processing information

c95c39c

allyhawkins requested a review from jashapiro August 5, 2021 18:28

jashapiro reviewed Aug 5, 2021

View reviewed changes

allyhawkins added 2 commits August 17, 2021 18:00

add sample summary table

741e64f

add total reads and mapped reads to sce metadata

38d27b5

allyhawkins requested a review from jashapiro August 17, 2021 23:05

jashapiro reviewed Aug 18, 2021

View reviewed changes

allyhawkins and others added 3 commits August 18, 2021 15:41

Apply suggestions from code review

6372202

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

update docs with version_10X argument

e57e834

add optional version_10X argument

65cda77

jashapiro reviewed Aug 19, 2021

View reviewed changes

allyhawkins and others added 3 commits August 19, 2021 13:17

remove check for null processing info

c19d23b

change parameter name to tech_version

f442f97

Merge branch 'main' into allyhawkins/add-processing-info

23b9642

jashapiro reviewed Sep 2, 2021

View reviewed changes

DESCRIPTION Outdated Show resolved Hide resolved

jashapiro approved these changes Sep 2, 2021

View reviewed changes

allyhawkins and others added 2 commits September 2, 2021 14:11

Apply suggestions from code review

bcc6a72

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

switch order of processing information tables

41ea5e3

allyhawkins merged commit fc23948 into main Sep 2, 2021

allyhawkins deleted the allyhawkins/add-processing-info branch September 2, 2021 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add processing information to QC report #29

Add processing information to QC report #29

allyhawkins commented Aug 5, 2021

jashapiro left a comment

jashapiro Aug 5, 2021

jashapiro Aug 5, 2021

allyhawkins commented Aug 5, 2021

jashapiro commented Aug 16, 2021

allyhawkins commented Aug 17, 2021

jashapiro left a comment

jashapiro Aug 18, 2021

allyhawkins Aug 18, 2021

jashapiro Aug 18, 2021

jashapiro Aug 18, 2021

jashapiro Aug 18, 2021

jashapiro Aug 18, 2021

jashapiro Aug 18, 2021

jashapiro Aug 19, 2021

allyhawkins Aug 19, 2021

jashapiro Aug 19, 2021

jashapiro left a comment

jashapiro Sep 2, 2021

jashapiro Sep 2, 2021

	processing_info <- sce@metadata %>%
	processing_info <- metadata(params$sce) %>%

	meta$num_cells <- quant_info[['num_quantified_cells']]
	meta$af_num_cells <- quant_info[['num_quantified_cells']]

		# remove underscores from rownames for table
		rownames(processing_info) = str_replace_all(rownames(processing_info), "\\.", " ")

	sample_information <- data.frame(
	sample_information <- data.frame(

Add processing information to QC report #29

Add processing information to QC report #29

Conversation

allyhawkins commented Aug 5, 2021

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins commented Aug 5, 2021

jashapiro commented Aug 16, 2021

allyhawkins commented Aug 17, 2021

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment