Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialise fails to load blast dbase .. can't find entries ... dictionary value error issue #392

Closed
adamfreedman opened this issue Mar 23, 2021 · 13 comments · Fixed by #393
Closed
Assignees

Comments

@adamfreedman
Copy link

running the latest mikado using similar cmds to what i used in 2020 with success ...

the cmd:
mikado serialise --json-conf configuration.yaml --xml blastx/mikado.blastx.xml.cocnat_2021.03.23.xml.gz --orfs transdecoder/mikado_prepared.fasta.transdecoder.bed --blast_targets xtrop_xlaevis_nparkeri_lcatesbeianus_protein.faa

stderror:
Mikado crashed, cause:
ref|XP_018411542.1| not found (Accession: {'_id': None, '_id_alt': [], '_query_id': None, '_description': 'PREDICTED: ras association domain-containing protein 7 [Nanorana parkeri]', '_description_alt': [], '_query_description': '', 'attributes': {}, 'dbxrefs': [], '_items': [HSP(hit_id='ref|XP_018411542.1|', query_id='scallop_TU12746', 1 fragments)], 'blast_id': 'ref|XP_018411542.1|', 'accession': 'XP_018411542', 'seq_len': 431})
Traceback (most recent call last):
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/main.py", line 68, in main
args.func(args)
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 378, in serialise
load_blast(args, logger)
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 125, in load_blast
part_launcher(filenames)
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/subprograms/serialise.py", line 53, in xml_launcher
xml_serializer()
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 360, in call
self.serialize()
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 342, in serialize
self.__serialise_xmls()
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/blast_serialiser.py", line 351, in __serialise_xmls
_serialise_xmls(self)
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 124, in _serialise_xmls
max_target_seqs=self._max_target_seqs, logger=self.logger, off_by_one=off_by_one)
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_serialiser.py", line 224, in objectify_record
current_target, cache["target"] = _get_target_for_blast(alignment, cache["target"])
File "/n/home_rc/afreedman/.conda/envs/mikado2021/lib/python3.7/site-packages/Mikado/serializers/blast_serializer/xml_utils.py", line 89, in _get_target_for_blast
raise ValueError("{} not found (Accession: {})".format(alignment.id, alignment.dict))
ValueError: ref|XP_018411542.1| not found (Accession: {'_id': None, '_id_alt': [], '_query_id': None, '_description': 'PREDICTED: ras association domain-containing protein 7 [Nanorana parkeri]', '_description_alt': [], '_query_description': '', 'attributes': {}, 'dbxrefs': [], '_items': [HSP(hit_id='ref|XP_018411542.1|', query_id='scallop_TU12746', 1 fragments)], 'blast_id': 'ref|XP_018411542.1|', 'accession': 'XP_018411542', 'seq_len': 431})

@wyim-pgl
Copy link

Hi!
It may be helpful for you to explain how you ran BLAST.

@adamfreedman
Copy link
Author

adamfreedman commented Mar 23, 2021 via email

@wyim-pgl
Copy link

How about makeblastdb? it might need to have -parse_seqids.
BTW, you can do cat *.gz >> output.gz instead of zcat and gzip

@adamfreedman
Copy link
Author

adamfreedman commented Mar 23, 2021 via email

@lucventurini
Copy link
Collaborator

Dear @adamfreedman

Thank you for reporting this, and thank you to @wyim-pgl for helping out!

I fear @ljyanesm and I might have introduced a bug in the parsing of the reference sequence in the latest release, I know we touched the relevant regular expression. Would you please be able to send us a minimal example here (e.g. some ten sequences on the blast database and ten from the mikado_prepared.fasta file, that you know do get aligned together) so that we can test this?

As another note, @ljyanesm and I have recently moved Mikado away from using XML files as the default for BLAST, please see the documentation here: https://mikado.readthedocs.io/en/stable/Usage/Serialise/?highlight=tabular#blast-files

I am in the process of revising the documentation and I will make sure to update the tutorial if it is out of sync with this change.

It might very well be that the bug you encountered will affect the tabular format as well. Regardless, we would appreciate if you could send us a test file so that we can diagnose and solve the issue as soon as possible.

Kind regards,

@adamfreedman
Copy link
Author

here are fasta files of queries and targets for which the former hit the latter with blastx
testqueries.fasta.gz
testtargets.fasta.gz

@lucventurini
Copy link
Collaborator

Dear @adamfreedman

@ljyanesm and I identified the cause, it was indeed linked to the regular expression. Briefly, Mikado was malfunctioning when using the parse_seqids during database construction with NCBI BLAST+.

We have fixed the code and I am currently implementing the tests. We will be releasing a new version (2.2.3) later today UK time I hope.

Kind regards,

lucventurini added a commit that referenced this issue Mar 24, 2021
…pplying at all stages (query, target, XML loading, tabular loading)
@lucventurini lucventurini mentioned this issue Mar 24, 2021
lucventurini added a commit that referenced this issue Mar 24, 2021
* Fix tests on osx

* Changing the GHA to use the cache for PIP and Conda

* Disabling the full daijin_assemble run on the OSX tests as Portcullis is not (yet) available for it on Conda.

* Properly fix #392, with attending tests

* Fixed the log crash detected on OSX by @ljyanesm

Co-authored-by: ljyanesm <yanes.luis@gmail.com>
@lucventurini
Copy link
Collaborator

Dear @adamfreedman

We have fixed this in 69e45a4. I am about to release to PyPI and Conda.

Kind regards,

@adamfreedman
Copy link
Author

adamfreedman commented Mar 25, 2021 via email

@adamfreedman
Copy link
Author

adamfreedman commented Mar 25, 2021 via email

@lucventurini
Copy link
Collaborator

Dear @adamfreedman ,

Thank you for the update. May I suggest inspecting the XML files passed to serialise though? I strongly suspect that one or more might be truncated.

I am asking this because the traceback indicates that the error was triggered in the BioPython code for parsing XML files, which itself was triggered by what seems an unexpected truncation of the document at line 53663.

Admittedly the Mikado code could handle this better and better inform the user of what has happened, and in which file. This is something we can try to improve on.

In case you indeed need to regenerate the BLAST files, I would like again to point out that the new Mikado versions can load data faster by using the tabular format rather than XML, with custom fields.

Many thanks for your patience and feedback.

@adamfreedman
Copy link
Author

adamfreedman commented Mar 25, 2021 via email

@lucventurini
Copy link
Collaborator

Dear @adamfreedman

Thank you again for the update. I hope that this time Mikado will run more smoothly. Please let us know if you encounter any other issue.

Many thanks,
Luca Venturini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants