-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vw experiments #13
Comments
So with a fiarly simple quadratic model once we have error rats under 2% on the open dataset with window size of 16, initial ngram experiments failed to learn, but the parameter spaceis reaching limit, so worth reruning those experiments with full data. freebayes@ip-10-0-0-6:~$ time vw -d ~/NA12878_V2.5_Robot_1.open_w16.hhga.gz --binary --passes 20 -q ha -c -b 26 -f ~/nongram_robot1.model --compressed on testing: loss of 0.013290 |
freebayes@ip-10-0-0-6:~$ time vw -d ~/NA12878_V2.5_Robot_1.open_w16.hhga.gz --binary --passes 20 -q ha --ngram a5 -c -b 26 -f ~/ngram5_robot1.model --compressed │· does well, |
Once we understood the theory, twopass quadratic generalizes [[edit this was wrong, info was leaking from future]] freebayes@ip-10-0-0-6:~$ time vw -d ~/NA12878_V2.5_Robot_1.open_w16.hhga.gz --binary --passes 2 -c -q ah -b 18 -f ~/twopass_robot1.model --compressed │· |
[[edit this was wrong, info was leaking from future]] |
finished run │· |
|
|
|
Trained models on full robot1 from garvan. Tested against the second garvan set (robot2).
And for xprize
|
The main blocking point right now is not knowing how close our current robot1-3 traning data is to the FDA challenge test set. Being able to produce a hhga file or the FDA test challenge is crucial at this point as it guides which direction to push the models in. The quadratic interactions are crcial for inter robot precision but make no difference (and cost us hugely in training time) for the xprize data, so getting a sense of their importance on the challenge is next. |
Does your truth set include allele frequency found by orthogonal validation? It would be interesting to see which models perform best at estimating true allele frequencies, especially in regions that are not diploid or with many repeats. |
We are not estimating allele frequencies. We estimating On Wed, Apr 20, 2016, 01:50 Nicolás Della Penna notifications@github.com
|
We found a problem with the previous examples. I was failing to swap out the alignments for each sample, which resulted in overfitting. These are the first results after having cleaned up the issue. We're now training only using alignments made using bwa and a simple duplicate marking post-processing step on precisionFDA. We train on Garvan robots 1, 3, and 4. (2 is huge, and still in the alignment process after 48 hours on 16 cores.) Then we test on Garvan vial 1. These are the results:
I'll update when the models with quadratic features between the haplotypes and allelels are up. They don't appear to outperform the ngram models here. |
A log of things we try.
Robot_1 vs Robot_2 comparisons using different models.
Trying:
The text was updated successfully, but these errors were encountered: