-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mikado serialize running time #280
Comments
Dear @srividya22 , Even with the newer version, though, a key way that we speed up the reading of BLAST results is to split the FASTA file in different chunks, execute BLAST on them, and then have Mikado serialise read the different XML files in parallel. This is because reading XML files is inherently a very slow process. From what I am seeing, I think however that you probably have run BLAST on the file as a whole, without chunking. In such an instance Mikado serialise would indeed be very slow. With chunking (say in 20 or 30 threads) I would say that probably
This would be a problem, in general, yes. As you have ~half a million transcripts, I would say that both We do have a SnakeMake-based pipeline, Daijin, that automates these steps and which is included in Mikado (documentation here; code here). Hopefully it can provide some guidance in how to execute Mikado for your project. |
Update: the main problem is in the function which performs the nucleotide-to-protein translation; it is by far the bottleneck in loading ORFs. I am looking for ways to improve its speed (hopefully e.g. with a drop-in C extension). Hopefully this will massively increase the speed of |
…ORFs was that the reading process waas **blocking**, due to using a simple queue with a bare put instead of put_nowait. This single change makes the code finally parallel and provides therefore a massive speed-up.
I have created a new branch, By speed-up I mean a ~30% increase in performance (single-threaded) and an almost linear increase in performance for multithreading, as the current |
Thanks @lucventurini. Do you recommend me installing the code from that branch issue-280 for speed up ? |
I would yes. If you could also report us whether the improvement is present also on your system, we would be very grateful. Kind regards |
Sure, I will do that. Also as you suggested I have split the blastx xml files ( using a java script ) for speed up. However I am not sure, If I should split the mikado_prepared.fasta file and the ORFS.bed from Transdecoder for speed up. Kindly comment on that. I have been running to running transdecoder v5.5.0 . But Transdecoder.Predict seems to fail with the mikado_prepared.fasta at this step below. gff3_file_to_proteins.pl line 81. The sequence corresponding to transdecoder in the longest_orfs.pep is "CLASS2_SRR7066585_Super-Scaffold_1.0.0.p1" they are different. looks like an error in parsing the Ids from gff3 . |
No, mikado will take care of parallelising that (as BED files are much easier to parse and process than XML files).
Unfortunately I am not an expert on the code of Transdecoder. I know, though, that its internal pipeline manager can get confused if a run gets cancelled and restarted afterwards. If that's happened I would recommend deleting the whole subfolder and restart from LongOrf. |
As suggested I have built the version from issue-280 branch of the repository. I ran prodigal instead of Transdecoder for this test. The orfs got loaded but when loading the blast xml I am seeing errors in the log like File "blast_serializer/utils.py", line 171, in prepare_hit |
That error comes from the parsing of the BLAST files. However there is something strange.... That error is NOT at line 171 but rather 185. Apologies for asking, would you please be able to double check the installation of mikado? It looks like a different, older version might be installed.. Edit: I checked and it looks like you have installed and run a version that is quite old, that still contains bug #150. Would you please be able to check whether that's the case? Kind regards |
…her than simple lists. Also, now using properties to avoid calculating the invalidity of a BED12 object over and over again. This should lead to some more speed improvements.
I am also working on making the BLAST XML serialisation faster. I will track progress on this here.
|
…unctions using NumPy. Some improvements but currently broken, and it could be better
…ame that the new parser from bioPython is so slow
I have created another branch with the improvements for the XML serialisation, In this new branch, I have more than halved the time needed to process a single HSP. Unfortunately, moving to the new version of |
…er, thanks to Stack Overflow
… Some improvements but currently broken, and it could be better
I see you have made some changes since yesterday. Let me know when is a good time to rebuild and test on the new changes in the issue-280-dev branch. |
…ole of Mikado*. Lots of amendments to allow it to function properly. For \#280.
Now Mikado uses Commit f074581 is the latest, passes all tests, and should be used to re-test the branch. I forecast a memory usage of about 20GB for your massive dataset and probably less than 52 minutes for the analysis when taking in the whole file. Please let me know. This should function and behave well with SLURM. |
The pre chunked job is running but I can see an error reported from the orf loading and 0 orfs loaded, perhaps this is not due to the changes introduced here?
|
The f65bbb1 run with no pre-chunking needed over 600G of memory |
Good news the pre-chunked latest f074581 version loaded the BLAST in 25 mins
However, the job finished at 22:38 according to the log
but as of Thu 2 Apr 22:46:39 BST 2020 it's still running
|
Hi @swarbred Thanks for testing this. Regarding the run:
It's great to hear though that it finished quickly. |
…ecause they start too soon). EI-CoreBioinformatics#280
6fcad94 completed successfully on the pre-chunked tab file, took 54 mins for the blast loading (so longer than the previous run on the same data) When not pre-chunking the file the memory required for serialise is substantial 700G, that seems excessive just to split the file . While it's simple for someone to either run diamond on chunks or split the file post diamond, it would seem helpful to do this internally in mikado. |
Thank you for testing and confirming that the hanging does not happen any longer. I would expect that the extra time is most likely due to the loading of the ORFs.
This is indeed excessive. I do expect somewhat high memory usage as I am:
Now, I do expect this to require quite a bit of memory, but 700G is really excessive. How big is, in GB, the tabular BLAST file? Knowing this might help me understand better the order of magnitude. Also, it might be worthwhile to explicitly set: in the configuration file (same level as Also relaunching with |
To be fair, @ljyanesm insists that we could do something different - ie quickly scan the file, divide it by lines, then tell each subprocess to analyse from line A to line B. |
Related to the question above about how large the file is. Would you please be able to run the data using only one processor?
So I would expect memory for a single-process run to be between 2 and 3 times lower. |
The 54 mins was just for the homology loading , the orf loading was an additional 7 mins (so similar to earlier runs)
The 700G run was with the previous 6fcad94 version, due to the orf loaading issue it hung so no time -v output but from slurm
I will start a run on the non-chunked file with the latest version and restrict to one processor. |
OK, thank you! If (as I suspect) memory will be acceptable for the single-processor run, I will be focussing on finding ways to restrict memory usage for the ancillary processes. |
…e data from the temporary msgpack files. This should reduce the memory usage. Also, reducing the maximum number of objects per-process (EI-CoreBioinformatics#280).
@lucventurini
|
Cool, then as I thought the main problem with memory is indeed the sub-processes. Hopefully the new change will improve matters. If I may insist, please make sure that |
Hi @swarbred I have rerun on my data of H. sapiens, which should be comparable to yours, using 08d6cbc: $ du -csh blast.tsv
1.7G blast.tsv
1.7G total Command line: Running time and memory: $ sacct -j 10482 -o JobID,JobName,Partition,AllocCPUs,State,Elapsed,ExitCode,AveCPU,MaxRSS,MaxVM,ReqMem --units=GMKTP
JobID JobName Partition AllocCPUS State Elapsed ExitCode AveCPU MaxRSS MaxVMSize ReqMem
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- ----------
10482 serialise medium 20 COMPLETED 00:06:29 0:0 00:05:58 11.84G 26.61G 48.83Gn So this version should be reasonable both in terms of time and memory. Greater number of processors used => greater memory usage as each process will hold some of the data in RAM while performing operations. A greater number of objects in memory (parameter From my observation of the process, loading the data into memory with |
PS again with 08d6cbc, I am not able to replicate the problem with multi-threading being as slow as single-threaded. The single-threaded run took ~6.5 times longer than the multi-process one (above with 20 processes): $ sacct -j 10484 -o JobID,JobName,Partition,AllocCPUs,State,Elapsed,ExitCode,AveCPU,MaxRSS,MaxVM,ReqMem --units=GMKTP
JobID JobName Partition AllocCPUS State Elapsed ExitCode AveCPU MaxRSS MaxVMSize ReqMem
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- ----------
10484 serialise medium 1 COMPLETED 00:38:13 0:0 00:37:59 9.82G 10.04G 48.83Gn For a bit less memory usage, the same data took ~38m to analyse, vs ~6. |
Results with 08d6cbc no prechunking -p32: Elapsed (wall clock) time (h:mm:ss or m:ss): 46:40.63 (32 mins for the blast loading)
prechunking (32 files) -p32: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:23:05 (58 mins for the blast loading)
no prechunking -p1: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:48:28
memory use looks fine (multiprocessing_method: spawn) Just to query, it's correct that the number of processors should match the number of tab chunks? i.e. as indicated below
below is the blast loading lines from the log for the prechunking run above, the reading and finished reading time stamps are sequential for each chunk, is that as it should be? In my prechunked run i'm pointing at a directory of 32 tsv files.
and from the run with no prechunking and -p32
|
Excellent news, thank you! Now the job running statistics make sense, finally. To answer your query, when analysing tabular data the number of processors does not need to match the number of files. So eg mikado will be able to exploit all threads when analysing 4 files with 32 processors. The reason analysing multiple files takes longer is that the beginning stage (loading of the BLAST file and computation of additional values such as eg query ID in the database or minimum evalue per hit) happen in a single threaded way. There is therefore a one-off cost that makes loading multiple files more time-expensive (at the gain of lesser memory usage). Memory usage, as I was saying earlier, can be decreased by tweaking the maximum number of objects in memory. The tradeoff is that a lower number means flushing to disk more often, a single - threaded and quite expensive bottleneck (this is down to Sqlite and I really can't do much about it directly). Do you think we can finally merge to |
@lucventurini |
* For EI-CoreBioinformatics#280: first tentative to improve the speed of serialise. * For EI-CoreBioinformatics#280: the biggest, by far, problem in `mikado serialise` for the ORFs was that the reading process waas **blocking**, due to using a simple queue with a bare put instead of put_nowait. This single change makes the code finally parallel and provides therefore a massive speed-up. * Correct commit 4b88696 Removing the evaluation of invalidity causes bugs. It's better to recalculate constantly in this case. * Correct previous commit * Correct .travis.yml Edited Travis to reflect the changes in master * Correct travis * Removing another simple queue from ORF, which could have again considerably slowed down `serialise`. * For EI-CoreBioinformatics#280: now BED12 objects use numpy arrays rather than simple lists. Also, now using properties to avoid calculating the invalidity of a BED12 object over and over again. This should lead to some more speed improvements. * BROKEN; for EI-CoreBioinformatics#280: trying to implement the slow functions using NumPy. Some improvements but currently broken, and it could be better * For EI-CoreBioinformatics#280: fixed the previous error. * For EI-CoreBioinformatics#280: fastest version of serialise yet. A shame that the new parser from bioPython is so slow * For EI-CoreBioinformatics#280: making the merging algorithm much faster, thanks to Stack Overflow * For EI-CoreBioinformatics#280: starting to reorganise code to allow using BLAST tabular - as opposed to only XMLs. * For EI-CoreBioinformatics#280: progress on definining the functions for parsing tabular blast. * Starting to implement the functions necessary to subdivide the work in groups. * Stub for the prepare_tab_hit function.
…ioinformatics#305) * Issue EI-CoreBioinformatics#280: now mikado serialise has been refactored so that: * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields) * `daijin` now automatically generates *tabular* rather than XML BLAST results * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage). * Issue EI-CoreBioinformatics#270: now Mikado will remove redundancy based on intron chains * For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.
Hi ,
I have been running mikado on the new tomato genome (size : 1.01 Gbp). It finished the mikado prepare step. However it has been running mikado serialize for quite a long time now. Also I have not provided any ORFS for this run. I would like to know what could be a estimated time for the completion for this step. Also please let me know if i dont provide ORFS, it could be any problem.
2020-02-18 22:31:08,078 - serialiser - serialise.py:273 - INFO - setup - MainProcess - Command line: /sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/bin/mikado serialise --json-conf groundcherry_conf.yaml --xml mikado.blast.xml.gz --blast_targets ../all_proteins_for_mikado_uniq.fasta -p 50 --transcripts mikado_prepared.fasta
2020-02-18 22:31:08,079 - serialiser - serialise.py:288 - INFO - setup - MainProcess - Requested 50 threads, forcing single thread: False
2020-02-18 22:31:08,079 - serialiser - serialise.py:79 - INFO - load_junctions - MainProcess - Starting to load junctions: ['/seq/schatz/sramakri/genome_annotations/groundcherry/RNASeq/STAR/portcullis/3-filt/portcullis_filtered.pass.junctions.bed']
2020-02-18 22:31:40,761 - serialiser - serialise.py:89 - INFO - load_junctions - MainProcess - Loaded junctions
2020-02-18 22:31:40,761 - serialiser - serialise.py:147 - INFO - load_orfs - MainProcess - No ORF data provided, skipping
2020-02-18 22:31:40,761 - serialiser - serialise.py:104 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-02-18 22:32:17,543 - serialiser - xml_serialiser.py:364 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-02-18 22:32:30,714 - serialiser - xml_serialiser.py:404 - INFO - __serialize_targets - MainProcess - Loaded 642296 objects into the "target" table
2020-02-18 22:32:48,264 - serialiser - xml_serialiser.py:305 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-02-18 22:33:07,826 - serialiser - xml_serialiser.py:351 - INFO - __serialize_queries - MainProcess - 491431 in queries
The text was updated successfully, but these errors were encountered: