Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about squab normalize command and outputs – potential bug #4

Open
abearab opened this issue Feb 3, 2025 · 3 comments
Open
Assignees

Comments

@abearab
Copy link

abearab commented Feb 3, 2025

Hi @zaeleus – I'm having trouble with the outputs of squab normalize command. First of all the > approach is also writing the warning message to the output file. I think something like -o / --output argument would help to improve this.

Image

More importantly, I'm seeing some marker genes in my experiment change their direction of change in different treatment conditions comparing raw counts vs. TPM normalized counts. It's making me worried maybe the result tables are somehow corrupted. Any thoughts?

Image

FYI, I'm loading my squab outputs using these python commands I wrote:
https://github.com/abearab/RNAMultiOmics/blob/main/src/multiomics/expression/__init__.py

@abearab
Copy link
Author

abearab commented Feb 3, 2025

Here is an scatter plot of raw counts vs. TPM for a single sample in my hands (I expected a linear correlation rather than this, right?):

Image

@zaeleus
Copy link
Owner

zaeleus commented Feb 4, 2025

Thank you for reporting. I rechecked the formula and output and can confirm that the normalized values are correct.

For the TPM calculation, we use

$$\text{TPM}_{i} = 10^{6} \frac{\frac{q_{i}}{l_{i}}}{\sum_{j} \frac{q_{i}}{l_{i}}}$$

where q are the raw counts; and l, the feature lengths.1

Raw counts are not linearly correlated with their TPM values because counts are dependent on their feature lengths.

Take, for example, three features, where their raw counts are 10, 1, and 100; and the feature lengths are, respectively, 1, 1000, and 100. Just looking at the proportions, 10/1 is more highly expressed than 1/1000 and 100/100.

  • TPM0 = 106 * (10 / 1) / (10 / 1 + 1 / 1000 + 100 / 100) = 909008
  • TPM1 = 106 * (1 / 1000) / (10 / 1 + 1 / 1000 + 100 / 100) = 90.9008
  • TPM2 = 106 * (100 / 100) / (10 / 1 + 1 / 1000 + 100 / 100) = 90900.8

We see that feature 0 (count 10) indeed has a higher TPM value than feature 2 (count 100). So this shows that raw counts don't necessarily correlate with their TPM values.

The advantage of normalizing the numerator with the sum of the length-normalized counts is so that, in principle, you're able to do cross-sample comparisons, unlike, e.g., FPKM.

Footnotes

  1. Wagner, G.P., Kin, K. & Lynch, V.J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285 (2012). https://doi.org/10.1007/s12064-012-0162-3

@abearab
Copy link
Author

abearab commented Feb 4, 2025

@zaeleus Thanks for your response.

In my initial example, I'm looking at the MDM2 gene which is the same feature that I'm extracting the counts for that across multiple samples (i.e. treatments, replicates, and time points). However, I'm seeing that the pattern of expression changes is shifting before or after normalization. What do you think? See this:

Image

It's possible that I'm messing something up while loading normalized counts. Let me double check my code and troubleshoot – I'll get back to you shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants