Question about squab normalize command and outputs – potential bug #4

abearab · 2025-02-03T20:41:54Z

Hi @zaeleus – I'm having trouble with the outputs of squab normalize command. First of all the > approach is also writing the warning message to the output file. I think something like -o / --output argument would help to improve this.

More importantly, I'm seeing some marker genes in my experiment change their direction of change in different treatment conditions comparing raw counts vs. TPM normalized counts. It's making me worried maybe the result tables are somehow corrupted. Any thoughts?

FYI, I'm loading my squab outputs using these python commands I wrote:
https://github.com/abearab/RNAMultiOmics/blob/main/src/multiomics/expression/__init__.py

The text was updated successfully, but these errors were encountered:

abearab · 2025-02-03T20:49:56Z

Here is an scatter plot of raw counts vs. TPM for a single sample in my hands (I expected a linear correlation rather than this, right?):

zaeleus · 2025-02-04T20:04:05Z

Thank you for reporting. I rechecked the formula and output and can confirm that the normalized values are correct.

For the TPM calculation, we use

$$\text{TPM}_{i} = 10^{6} \frac{\frac{q_{i}}{l_{i}}}{\sum_{j} \frac{q_{i}}{l_{i}}}$$

where q are the raw counts; and l, the feature lengths.¹

Raw counts are not linearly correlated with their TPM values because counts are dependent on their feature lengths.

Take, for example, three features, where their raw counts are 10, 1, and 100; and the feature lengths are, respectively, 1, 1000, and 100. Just looking at the proportions, 10/1 is more highly expressed than 1/1000 and 100/100.

TPM₀ = 10⁶ * (10 / 1) / (10 / 1 + 1 / 1000 + 100 / 100) = 909008
TPM₁ = 10⁶ * (1 / 1000) / (10 / 1 + 1 / 1000 + 100 / 100) = 90.9008
TPM₂ = 10⁶ * (100 / 100) / (10 / 1 + 1 / 1000 + 100 / 100) = 90900.8

We see that feature 0 (count 10) indeed has a higher TPM value than feature 2 (count 100). So this shows that raw counts don't necessarily correlate with their TPM values.

The advantage of normalizing the numerator with the sum of the length-normalized counts is so that, in principle, you're able to do cross-sample comparisons, unlike, e.g., FPKM.

Wagner, G.P., Kin, K. & Lynch, V.J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285 (2012). https://doi.org/10.1007/s12064-012-0162-3 ↩

abearab · 2025-02-04T20:25:03Z

@zaeleus Thanks for your response.

In my initial example, I'm looking at the MDM2 gene which is the same feature that I'm extracting the counts for that across multiple samples (i.e. treatments, replicates, and time points). However, I'm seeing that the pattern of expression changes is shifting before or after normalization. What do you think? See this:

It's possible that I'm messing something up while loading normalized counts. Let me double check my code and troubleshoot – I'll get back to you shortly.

zaeleus mentioned this issue Feb 3, 2025

Write log messages to stderr #5

Closed

abearab mentioned this issue Feb 4, 2025

Transcript assembly and quantification abearab/RNAMultiOmics#4

Open

zaeleus self-assigned this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about squab normalize command and outputs – potential bug #4

Question about squab normalize command and outputs – potential bug #4

abearab commented Feb 3, 2025

abearab commented Feb 3, 2025

zaeleus commented Feb 4, 2025

abearab commented Feb 4, 2025 •

edited

Loading

Question about squab normalize command and outputs – potential bug #4

Question about squab normalize command and outputs – potential bug #4

Comments

abearab commented Feb 3, 2025

abearab commented Feb 3, 2025

zaeleus commented Feb 4, 2025

Footnotes

abearab commented Feb 4, 2025 • edited Loading

abearab commented Feb 4, 2025 •

edited

Loading