Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem during correction #1959

Closed
MartinLaforest opened this issue Jun 7, 2021 · 15 comments
Closed

Problem during correction #1959

MartinLaforest opened this issue Jun 7, 2021 · 15 comments

Comments

@MartinLaforest
Copy link

MartinLaforest commented Jun 7, 2021

Hello,

I run an issue repeatedly while running canu. Version is canu 2.1.1. I am using the longest reads from a PacBio run.

Here is the command:

canu -p Ass1 \
-d /path \
genomesize=1800m \
gridEngineResourceOption="-pe smp THREADS -l mem_free=MEMORY" \
correctedErrorRate=0.085 corMhapSensitivity=normal minReadLength=15000 \
corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 256 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" \
gridOptions="-S /bin/bash" \
-pacbio At.longest.3.clean.fa \
-pacbio At.longest.4.clean.fa \
-pacbio At.longest.168.clean.fa

Here is the content of one (correction/2-correction/results/0020.err) file:
[truncated]

  907401   16319       81      0-16210 ( 16318) memory act          0 est  694514996 act/est 0.00
  907403   18608       81      0-18438 ( 18607) memory act          0 est  718424918 act/est 0.00
  907404   15775        6      0-11455 ( 11829) memory act          0 est  667081544 act/est 0.00
  907405   18294       94   1064-17644 ( 18281) memory act          0 est  701879690 act/est 0.00
  907406   19936       12    337-16804 ( 19790) memory act          0 est  705145946 act/est 0.00
  907408   19008       23    191-18753 ( 18791) memory act          0 est  699528008 act/est 0.00
  907409   20462       17   6968-20303 ( 15919) memory act          0 est  711738620 act/est 0.00
  907410   15959        8    166-15006 ( 15354) memoryfalconsense: utility/src/utility/edlib.C:436: void edlibAlignmentToStrings(const unsigned char*, int, int, int, int, int, const char*, const char*, char*, char*): Assertion `strlen(qry_aln_str) == alignmentLength && strlen(tgt_aln_str) == alignmentLength' failed.
 act          0 est  669112010 act/est 0.00
  907411   43465       13   2347-16873   22977-40000 ( 17891) memory act          0 est  917930210 act/est 0.00
  907412   21694       15   2127-21419 ( 21669) memory act          0 est  722874344 act/est 0.00
  907413   21226        9  12184-18861 ( 18098) memory act          0 est  716742698 act/est 0.00
  907414   17100       29  14257-16048 (  4692) memory act          0 est  679434788 act/est 0.00
  907415   21471      181    109-16692 ( 21366) memory act          0 est  747943148 act/est 0.00
  907416   22951        8   7979-21890 ( 17729) memory act          0 est  732062966 act/est 0.00
  907417   19001       42      0-0     ( 15297) memory act          0 est  703713338 act/est 0.00
  907418   26810       96      0-26081 ( 26809) memory act          0 est  803694248 act/est 0.00
  907419   25853       13      0-0     ( 11309) memory act          0 est  758096144 act/est 0.00
  907420   18681       31    548-18024 ( 18596) memory act          0 est  697724894 act/est 0.00
  907421   21514        9   1611-20334 ( 19873) memory act          0 est  719937860 act/est 0.00
  907422   29147        6  11324-26793 ( 16290) memory act          0 est  787735250 act/est 0.00
  907423   17137       27      0-15660 ( 17124) memory act          0 est  681263078 act/est 0.00
  907425   17984       13    493-14703 ( 14939) memory act          0 est  687552668 act/est 0.00
  907426   22124        6   4531-5136  (  4934) memory act          0 est  723675680 act/est 0.00
  907427   20408       22   1078-14108 ( 19651) memory act          0 est  712663244 act/est 0.00
  907428   20657        7   8866-15533 (  7087) memory act          0 est  710990738 act/est 0.00
  907429   25778        4  21874-24997 (  3220) memory act          0 est  756402290 act/est 0.00
  907431   18883       13    186-17007 ( 17790) memory act          0 est  696875738 act/est 0.00
  907432  129487      123      0-0     (129474) memory act          0 est 1873312304 act/est 0.00
  907433   16804       11   1259-2781     5631-6385    10946-15719 (  8721) memory act          0 est  676489730 act/est 0.00
  907435   22215       17   4618-21053 ( 17674) memory act          0 est  726161906 act/est 0.00
  907436   26216        8      0-0     ( 17033) memory act          0 est  761381558 act/est 0.00
  907437   24974       26      0-22058 ( 24393) memory act          0 est  752259644 act/est 0.00
  907438   18454       27    586-16192 ( 18019) memory act          0 est  695486768 act/est 0.00
  907439   20234       19    550-19387 ( 19673) memory act          0 est  709441424 act/est 0.00
  907440   17911        7    488-16993 ( 17898) memory act          0 est  687222734 act/est 0.00
  907441   20014       15      0-19208 ( 19977) memory act          0 est  706585730 act/est 0.00
  907442   25663       12   7953-24034 ( 16418) memory act          0 est  756936302 act/est 0.00
  907443   91304       94      0-0     ( 91291) memory act          0 est 1475719226 act/est 0.00
  907444   17894       93   2024-17743 ( 17121) memory act          0 est  698292638 act/est 0.00
  907445   26559       14      0-19173 ( 19735) memory act          0 est  765387236 act/est 0.00
  907446   20856       73     53-19932 ( 20797) memory act          0 est  722655362 act/est 0.00
  907448   26450       93      0-17486 ( 26437) memory act          0 est  799872632 act/est 0.00
  907449   23296       23     77-18699 ( 21089) memory act          0 est  738287564 act/est 0.00
  907450   18284        9
Failed with 'Aborted'; backtrace (libbacktrace):
utility/src/utility/system-stackTrace.C::83 in _Z17AS_UTL_catchCrashiP7siginfoPv()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
utility/src/utility/edlib.C::436 in _Z23edlibAlignmentToStringsPKhiiiiiPKcS2_PcS3_()
correction/falconConsensus-alignTag.C::239 in _Z20alignReadsToTemplateP11falconInputjdjb._omp_fn.0()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()

Thank you so much in advance.

@skoren
Copy link
Member

skoren commented Jun 7, 2021

There are several related issues (#1370, #1365, #1061) which are all caused by non-ACGT characters in the input. Can you check your read files for non-ACG or special characters? If that doesn't address is, are you able to share your data (we'd need the seqStore and the corStore folders), the FAQ has info on how to send us data.

@MartinLaforest
Copy link
Author

Thanks for the quick response. I have attemped to remove non-ATGC characters using the following command on raw data files:

sed -e '/^[^>]/s/[^ATGCNatgcn]/N/g' longest.168.fa > longest.168.clean.fa

If that was not good, I can run another command that could correct the data files. Just let me know. I'll read the FAQs for sending you data but it's 29 gigs for seqStore and 8.5 gigs for corStore....

Thanks!

@skoren
Copy link
Member

skoren commented Jun 7, 2021

Let it run with the updated data and see if it gives the same error before uploading the data. If you tar.gz the folders it should compress it a bit.

@MartinLaforest
Copy link
Author

MartinLaforest commented Jun 7, 2021

Thanks. I get the error with the cleaned fasta files. This is what I copied above... Will tarball the directories and send to you.

Thank you so much for your help!!!

... I need to wait for one process to finish...

@MartinLaforest
Copy link
Author

Unfortunately, I could not create a file...

(base) [laforestm@biocluster correction]$ ftp ftp.cbcb.umd.edu
Trying 128.8.132.70...
Connected to ftp.cbcb.umd.edu (128.8.132.70).
220-
220-Welcome to the CBCB FTP Server
220-
220-Please visit, http://www.cbcb.umd.edu
220-for more information.
220-
220
Name (ftp.cbcb.umd.edu:laforestm): anonymous
331 Please specify the password.
Password:
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> put TRIFIDA_Ass1.corStore.tar.gz
local: TRIFIDA_Ass1.corStore.tar.gz remote: TRIFIDA_Ass1.corStore.tar.gz
227 Entering Passive Mode (128,8,132,70,31,93)
553 Could not create file.
ftp> quit
221 Goodbye.

I have tried blank for password as well as my email address (as it used to be in the good old times!)

Please let me know what I'm doing wrong.

Thank you again.

@skoren
Copy link
Member

skoren commented Jun 8, 2021

You're missing the cd incoming/sergek part after logging in to change to the correct incoming folder.

@MartinLaforest
Copy link
Author

MartinLaforest commented Jun 8, 2021 via email

@MartinLaforest
Copy link
Author

MartinLaforest commented Jun 8, 2021 via email

@skoren
Copy link
Member

skoren commented Jun 9, 2021

I took a quick look, it does look like there are non-ACGT characters in the store you uploaded, specifically in this read:

>m64128_210306_095608/75432255/178865_205774 id=323026
...CATTAGATGTCÔGGGGGCCCTTAATATGG...

Can you check your files and your cleaned files to see if that special character is there? I'll work on a fix so Canu refuses to load such reads in the meantime.

@skoren
Copy link
Member

skoren commented Jun 9, 2021

So it was a cute bug in that our check for invalid bases didn't account for char being interpreted as a signed value so your non-printable character got missed. Ideally, it should have segfaulted on the seqStore construction but it didn't. I checked your seqStore and found that there was only 1 read with non-printable characters. Any read which overlaps it will crash in correction, I know of 907450 which you hit above but there may be others.

You have two options:

  1. You can run strings <your file> and it should remove these characters so it works with the unpatched Canu (you could also patch your Canu code with the fix above). I confirmed the sed command does not. You'd then have to remove your run folder completely and re-start.
  2. You can try removing 907450 from the *.readsToCorrect file and see if the correction jobs finish. If it crashes again, you'd have to identify/remove the failing read again (last one in the failed partition) and keep removing reads until your correction jobs succeed.

@MartinLaforest
Copy link
Author

Thank you so much! I will remove bizarre characters from the file (how did they ever got there?). Strange that the sed command didn't work.

Thank you again

@MartinLaforest
Copy link
Author

MartinLaforest commented Jun 11, 2021

Hi, me again...

I ran strings as discussed. The file size is exactly the same before and after. unfortunately. I also ran the following command:

cat At.longest.3.fa | grep CATTAGATGTCÔGGGGGCCCTTAATATGG

I got not output. Could it be an i/o error while constructing seqStore?

...

Forget about it ... it is a "|" in the original file...

@MartinLaforest
Copy link
Author

This command seems to have worked:

perl -i.bak -pe 's/[^[:ascii:]]//g' At.longest.4.fa

I am running it for the 3 input files and starting canu again...

@skoren
Copy link
Member

skoren commented Jun 14, 2021

I confirmed that you can correct all reads with the exception of 907450 and 2082645 (remove those from *.readsToCorrect) with the original results you shared. I can share the full set of corrected reads from my run if you'd like or you can wait for your re-started run to complete.

@MartinLaforest
Copy link
Author

I have run the perl command above and restarted the entire assembly. Runs beautifully. I now have my corrected reads and the rest is moving along.
Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants