Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

miseq report #92

Open
necrolyte2 opened this issue Mar 2, 2016 · 7 comments
Open

miseq report #92

necrolyte2 opened this issue Mar 2, 2016 · 7 comments
Assignees
Milestone

Comments

@necrolyte2
Copy link
Member

This is a bit retroactive at this point.
It was just a small side project, but seems to be quite nice

So essentially we need an easy way to visualize an entire MiSeq run.
That is, easily point out samples that have issues such as:

  • Too many reads assigned
  • Too few reads assigned
  • Low quality
  • Forward/Reverse read count that don't match well

  • fastqc run on all samples for other info

So my initial idea was to grab the following data for each sample:

  • Samplename
  • Total Reads (F+R)

  • F Reads

  • Avg F Qual
  • Avg F Length
  • F bases

  • R Reads

  • Avg R Qual
  • Avg R Length
  • R Bases

Once those stats were generated I looked at them in excel and noticed it would be really nice to color cells in the matrix that were outside of STDEV

So I colored them based on 6 criteria

  • +1, +2, +3 and -1, -2, -3 STDEV from the mean in each column
  • Each stddev would get slightly more bold color gradient(green for above, red for below)

The end result will be

  • single csv file with base stats as listed above
  • single html file that contains the colored matrix as the prototype excel file had
    • html file would contain links to fastqc for R1 and R2 reads
@necrolyte2 necrolyte2 self-assigned this Mar 2, 2016
@necrolyte2
Copy link
Member Author

Improvements:

  • Utilize numpy/pandas more for faster computation
  • Utilize jquery/d3 to make html report look even better and interactable
  • Show Mean/stdev at top of each column for reference
  • Missing legend for colors
  • Does not detect if all samples have 0 reads
  • Logging level is set to debug which spits out all debug from sh module
  • Very similar data yields small stddev which means highlighted data that probably should not be

@necrolyte2 necrolyte2 modified the milestone: MiSeq Report Mar 2, 2016
@averagehat
Copy link
Contributor

If you do refactor might want to look into this: http://pandas.pydata.org/pandas-docs/version/0.17.1/whatsnew.html#conditional-html-formatting

@necrolyte2
Copy link
Member Author

What do you think about not including undetermined reads when calculating stats?
I feel like they skew the mean/stddev.

@averagehat
Copy link
Contributor

"[undetermined reads are] Reads that the miseq index did not match to anyhing. Essentially each sample is defined by 2 adapter indexes. If a read doesn't match any then goes to undetermined"
Yes I would just drop those

@necrolyte2
Copy link
Member Author

Just to be clear, I think they are good to have in the report, but not part of the calculation to determine mean/stddev.

Then can color them same as the rest of the reads. The reasoning is that way people can see if Undetermined ended up with an abnormal amount of reads(like 99% of reads or something weird showing that the run failed)

@averagehat
Copy link
Contributor

Any thoughts on what kind of interactivity you would want?

@necrolyte2
Copy link
Member Author

I think sorting on columns maybe is it. Let's just leave it non interactive at first and the user can request later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants