Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No data in VariantEval module #1158

Closed
olavurmortensen opened this issue Apr 16, 2020 · 11 comments
Closed

No data in VariantEval module #1158

olavurmortensen opened this issue Apr 16, 2020 · 11 comments
Labels
bug: core Bug in the main MultiQC code

Comments

@olavurmortensen
Copy link

olavurmortensen commented Apr 16, 2020

Description of bug:

I'm using the GATK module to display data from VariantEval. The "Variant counts" section just says Error - was not able to plot data. and the "Compare overlap" section displays an empty table.

MultiQC Error log:

[INFO   ]         multiqc : This is MultiQC v1.8
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching   : /home/olavur/Documents/sync/workstuff/ilegusavnid/linkseq_stuff/single_sample_reports/multiqc_varianteval
[INFO   ]     varianteval : Found 1 VariantEval reports
[WARNING]        bargraph : Tried to make bar plot, but had no data
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete

File that triggers the error:

variant_eval.table.txt

MultiQC run details (please complete the following):

  • multiqc .
  • MultiQC v1.8
  • Ubuntu 18.04
  • Python 3.6.10
  • conda create -n multiqc_varianteval_test -c bioconda -c conda-forge multiqc=1.8

Additional context

I've noticed that there isn't any data in the varianteval part of the multiqc_data.json:

"multiqc_gatk_varianteval": {
    "variant_eval.table": {
        "titv_reference": "unknown"
    }
}
@ewels ewels added the bug: core Bug in the main MultiQC code label Apr 16, 2020
@olavurmortensen
Copy link
Author

@ewels it looks like there are plenty of issues to work on at the moment. Do you think you'll have time to work on this one anytime soon?

@ewels
Copy link
Member

ewels commented May 20, 2020

I will prioritise everything labelled as bug before release. But I really need to get this next release out so if any are too big a job then I might push them back. Any help much appreciated!

Phil

@olavurmortensen
Copy link
Author

@ewels I would be happy to (try to) help. Any idea where I can start? I don't exactly have an overview of the code-base 😛

@olavurmortensen
Copy link
Author

I guess I can poke around in the varianteval.py module script. Most likely it's not parsing something correctly.

@olavurmortensen
Copy link
Author

Well, I think I've discovered the parsing error. VariantEval produces a table (or several tables in a single file). Some of the column names, and strata in these columns, are hard coded into the varianteval.py script, and these do not match the ones in my VariantEval table.

See here: https://github.com/ewels/MultiQC/blob/master/multiqc/modules/gatk/varianteval.py#L108

This code is inside a try-except block, and a KeyError exception is captured (line 115). This error should probably have been handled differently, raising at least a warning.

I think this could probably be generalized to any keys in the table. The tables basically consist of three types of columns: the first column just says what kind of data it is (e.g. CompOverlap), the next three columns are strata (CompFeatureInput, EvalFeatureInput, Filter), and the rest are values to be displayed. Although I can't be 100% sure the strata columns are always the same. And it might not be useful to plot everything in the data section.

@ewels What do you think we should do about this?

@ewels
Copy link
Member

ewels commented May 23, 2020

This error should probably have been handled differently, raising at least a warning.

This isn't really an error as such - it's just the mechanism that the code uses to parse this block of the logs. It keeps trying to parse columns until there are not enough columns (because the table has ended). Then it throws a KeyError which indicates the end of that particular section of the log file. It just happens that in your case you are missing a bunch of expected columns, so it thinks that the table ends immediately.

Do you know why your log file looks so different? I see that the GATK version at the top is v1.1:4 versus v1.1:9 in the example data. I'm a bit hesitant to make the module "just work" when the fields look so different, as I'm not sure if the data is exactly equivalent?

One thing that we could / should probably do is skip the report sections if there is no data to find, and possibly raise a warning if we found a VariantEval file but were not able to parse any data from it.

Phil

ewels added a commit to MultiQC/test-data that referenced this issue May 23, 2020
@olavurmortensen
Copy link
Author

I see that the GATK version at the top is v1.1:4 versus v1.1:9 in the example data

The version is v1.1, the number after the colon changes depending on what the input parameters to VariantEval are.

Do you know why your log file looks so different?

Primarily it's because I was not using the standard stratification nor evaluation modules, so the there were different tables in my file, and the tables had different fields. Using the all the standard stratification and evaluation modules rectifies most of the differences. I think it's fair to not try to generalize this, simply because that would be too much work.

The one difference that remains is that the key CompFeatureInput in my file is called CompRod in the MultiQC module. This CompRod key is depricated, CompFeatureInput is used in the newer versions (I'm guessing from GATK4).

So, if we can change line 108 and line 146, then I'm happy. Although this would break the module for GATK3 users (if they still exist).

Alternatively, instead of hard-coding everything, these keys could be entered in a MultiQC config, right?

@ewels
Copy link
Member

ewels commented May 28, 2020

Thanks for the clarification @olavurmortensen!

I've updated the code so that it looks for CompFeatureInput but falls back to CompRod if it's not found. Hopefully that will work for everyone.

I've also added a sentence to the GATK module docs saying that it only works if you use standard stratification & evaluation.

Finally, I updated the check for data which almost always passed because of the unknown default for TiTv. So now when you run on your samples MultiQC says that it didn't find anything, instead of producing an empty report.

I hope this is all ok! Let me know if you spot any more problems. Sorry we couldn't get the module to print your data - feel free to open a new issue to generalise this module more in the future if you'd like support.

Phil

@ewels ewels closed this as completed May 28, 2020
ewels added a commit that referenced this issue May 28, 2020
Now works with output from @olavurmortensen which was run in a different way.

See #1158
@ewels
Copy link
Member

ewels commented May 28, 2020

Ok, I took another look at your file and felt bad that MultiQC was just ignoring it when it really did look pretty similar.

I have refactored the code to be a lot more generous and essentially assume that Novelty = all is the same as Filter = raw. I've then parsed columns only if they're there and added new keys for the TiTV ratios for called and filtered variants.

Please take a look and sanity check all of the numbers and labels if you can 👍 Here's a report using all of the test data, so a mixture of yours and the previously supported formats so that you can see how the numbers line up: multiqc_report.html

Shout if there's anything I've missed or messed up..

Phil

@olavurmortensen
Copy link
Author

Hey @ewels, sorry for not getting back to you. I tried it out and it works just fine :)

@ewels
Copy link
Member

ewels commented Jun 15, 2020

Great! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: core Bug in the main MultiQC code
Projects
None yet
Development

No branches or pull requests

2 participants