-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem during correction #1959
Comments
There are several related issues (#1370, #1365, #1061) which are all caused by non-ACGT characters in the input. Can you check your read files for non-ACG or special characters? If that doesn't address is, are you able to share your data (we'd need the seqStore and the corStore folders), the FAQ has info on how to send us data. |
Thanks for the quick response. I have attemped to remove non-ATGC characters using the following command on raw data files: sed -e '/^[^>]/s/[^ATGCNatgcn]/N/g' longest.168.fa > longest.168.clean.fa If that was not good, I can run another command that could correct the data files. Just let me know. I'll read the FAQs for sending you data but it's 29 gigs for seqStore and 8.5 gigs for corStore.... Thanks! |
Let it run with the updated data and see if it gives the same error before uploading the data. If you tar.gz the folders it should compress it a bit. |
Thanks. I get the error with the cleaned fasta files. This is what I copied above... Will tarball the directories and send to you. Thank you so much for your help!!! ... I need to wait for one process to finish... |
Unfortunately, I could not create a file... (base) [laforestm@biocluster correction]$ ftp ftp.cbcb.umd.edu I have tried blank for password as well as my email address (as it used to be in the good old times!) Please let me know what I'm doing wrong. Thank you again. |
You're missing the |
Duh!
Thanks
|
Hi Sergey,
Files have been uploaded now. Thanks again for your help. It is greatly
appreciated.
Martin
|
I took a quick look, it does look like there are non-ACGT characters in the store you uploaded, specifically in this read:
Can you check your files and your cleaned files to see if that special character is there? I'll work on a fix so Canu refuses to load such reads in the meantime. |
So it was a cute bug in that our check for invalid bases didn't account for char being interpreted as a signed value so your non-printable character got missed. Ideally, it should have segfaulted on the seqStore construction but it didn't. I checked your seqStore and found that there was only 1 read with non-printable characters. Any read which overlaps it will crash in correction, I know of 907450 which you hit above but there may be others. You have two options:
|
Thank you so much! I will remove bizarre characters from the file (how did they ever got there?). Strange that the sed command didn't work. Thank you again |
Hi, me again... I ran strings as discussed. The file size is exactly the same before and after. unfortunately. I also ran the following command: cat At.longest.3.fa | grep CATTAGATGTCÔGGGGGCCCTTAATATGG I got not output. Could it be an i/o error while constructing seqStore? ... Forget about it ... it is a "|" in the original file... |
This command seems to have worked: perl -i.bak -pe 's/[^[:ascii:]]//g' At.longest.4.fa I am running it for the 3 input files and starting canu again... |
I confirmed that you can correct all reads with the exception of |
I have run the perl command above and restarted the entire assembly. Runs beautifully. I now have my corrected reads and the rest is moving along. |
Hello,
I run an issue repeatedly while running canu. Version is canu 2.1.1. I am using the longest reads from a PacBio run.
Here is the command:
Here is the content of one (correction/2-correction/results/0020.err) file:
[truncated]
Thank you so much in advance.
The text was updated successfully, but these errors were encountered: