-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about squab normalize command and outputs – potential bug #4
Comments
Thank you for reporting. I rechecked the formula and output and can confirm that the normalized values are correct. For the TPM calculation, we use where q are the raw counts; and l, the feature lengths.1 Raw counts are not linearly correlated with their TPM values because counts are dependent on their feature lengths. Take, for example, three features, where their raw counts are 10, 1, and 100; and the feature lengths are, respectively, 1, 1000, and 100. Just looking at the proportions, 10/1 is more highly expressed than 1/1000 and 100/100.
We see that feature 0 (count 10) indeed has a higher TPM value than feature 2 (count 100). So this shows that raw counts don't necessarily correlate with their TPM values. The advantage of normalizing the numerator with the sum of the length-normalized counts is so that, in principle, you're able to do cross-sample comparisons, unlike, e.g., FPKM. Footnotes
|
@zaeleus Thanks for your response. In my initial example, I'm looking at the MDM2 gene which is the same feature that I'm extracting the counts for that across multiple samples (i.e. treatments, replicates, and time points). However, I'm seeing that the pattern of expression changes is shifting before or after normalization. What do you think? See this: It's possible that I'm messing something up while loading normalized counts. Let me double check my code and troubleshoot – I'll get back to you shortly. |
Hi @zaeleus – I'm having trouble with the outputs of
squab normalize
command. First of all the>
approach is also writing the warning message to the output file. I think something like-o
/--output
argument would help to improve this.More importantly, I'm seeing some marker genes in my experiment change their direction of change in different treatment conditions comparing raw counts vs. TPM normalized counts. It's making me worried maybe the result tables are somehow corrupted. Any thoughts?
FYI, I'm loading my
squab
outputs using these python commands I wrote:https://github.com/abearab/RNAMultiOmics/blob/main/src/multiomics/expression/__init__.py
The text was updated successfully, but these errors were encountered: