Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RNA Seq normalization methods #136

Merged
merged 7 commits into from
May 14, 2024
Merged

Conversation

ScheidTo
Copy link
Contributor

@ScheidTo ScheidTo commented May 6, 2024

Thank you for contributing to BioFSharp. Please take the time to tell us a bit more about your PR.

Please list the changes introduced in this PR

  • added RPKM Normalization
  • added TPM Normalization

Description
This contribution adds RPKM and TPM normalization, as well as unit tests and documentation. RPKM and TPM are metrics for normalized RNA-sequencing data.

[Required] please make sure you checked that

  • The project builds without problems on your machine

[Optional]

  • Added unit tests regarding the added features

Copy link

codecov bot commented May 6, 2024

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

"\n",
"#### RPKM:\n",
"\n",
"RPKM (Reads per kilobase million) normalization at first determines a scaling factor, by calculating the sum of all reads in a sample and dividing that number by 1,000,000. That scaling factor is used to calculate RPM (Reads per million), by dividing the read counts for each sample with it, normalizing for sequencing depth. To get RPKM and normalize for gene length, RPM values are divided by genelength in kilobases. RPKM is applied by using the `RNASeq.rpkms` function.\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it would be beneficial to add the formula for RPKM.

"\n",
"#### TPM:\n",
"\n",
"What differentiates TPM (Transcripts per kilobase million) from RPKM is the order of operations. To calculate TPM values, data gets normalized for gene length first. This is achieved by calculating RPK values (reads per kilobase), by dividing the read counts by genelength in kilobases. The sum of all RPK values is divided by 1,000,000, to get a scaling factor. Finally, TPM values are calculated by dividing the RPK values by the scaling factor, also normalizing for sequencing depth.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it would be beneficial to add the formula for TPM.

"cell_type": "markdown",
"metadata": {},
"source": [
"The effects of both normalizations becomes apparent when comparing the relation of the samples "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, at least not at first glance. I would suggest setting the y axis range of the RPKM and TPM plots to the same range, which would show more clearly that tpm values are lower than rpkm values. As the plot stands now, the values look identical until one looks at the axes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this plot would really benefit from a 4th chart showing the gene length of each gene.

"metadata": {},
"source": [
"## RPKM & TPM\n",
"RNA-Seq is a transcriptomics technique, that quantifies RNA molecules in a biological sample. When dealing with RNA-sequencing data, normalization is needed to correct technical biases. RPKM and TPM are two metrics that normalize for gene length and sequencing depth. RNA-Sequencing data needs to be normalized for gene length, because longer genes show greater read counts when expressed at the same level and for sequencing depth, as deeper sequencing depth produces more read counts per gene.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this introduction understates the complexity of these datasets a little. I would suggest adding a few more sentences about the method, e.g. that it is high-throughput and can quantify the full transcriptome.

open System.Collections.Generic


module RNASeq =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least the public functions and types should have XML documentation to give context about what they do without the need of browsing the documentation page.

testCase "RPKM" (fun _ ->
Expect.equal
(RNASeq.rpkms testInSeq
|> Array.ofSeq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use Expect.sequenceEqual instead of casting to arrays here

Copy link
Member

@kMutagene kMutagene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. The documentation page could use some more information, see the individual comments. Here are some relevant sources:

@ScheidTo
Copy link
Contributor Author

ScheidTo commented May 6, 2024

@kMutagene

Copy link
Member

@kMutagene kMutagene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

small nitpick and we can merge this:

the y axis titles on the plot should be different from each other, as they are not all indicate 'Read Counts'. For that, you can set the axis title on the individual charts before creating the grid. Also, a little more space to improve readability and keep axes from overlapping would be nice (see Chart.withSize)

@kMutagene
Copy link
Member

🥳

@kMutagene kMutagene changed the title RNA Seq Add RNA Seq normalization methods May 13, 2024
@kMutagene kMutagene merged commit 32d20c3 into CSBiology:developer May 14, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants