Mikado serialize running time #280

srividya22 · 2020-02-27T16:21:23Z

Hi ,

I have been running mikado on the new tomato genome (size : 1.01 Gbp). It finished the mikado prepare step. However it has been running mikado serialize for quite a long time now. Also I have not provided any ORFS for this run. I would like to know what could be a estimated time for the completion for this step. Also please let me know if i dont provide ORFS, it could be any problem.

2020-02-18 22:31:08,078 - serialiser - serialise.py:273 - INFO - setup - MainProcess - Command line: /sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/bin/mikado serialise --json-conf groundcherry_conf.yaml --xml mikado.blast.xml.gz --blast_targets ../all_proteins_for_mikado_uniq.fasta -p 50 --transcripts mikado_prepared.fasta
2020-02-18 22:31:08,079 - serialiser - serialise.py:288 - INFO - setup - MainProcess - Requested 50 threads, forcing single thread: False
2020-02-18 22:31:08,079 - serialiser - serialise.py:79 - INFO - load_junctions - MainProcess - Starting to load junctions: ['/seq/schatz/sramakri/genome_annotations/groundcherry/RNASeq/STAR/portcullis/3-filt/portcullis_filtered.pass.junctions.bed']
2020-02-18 22:31:40,761 - serialiser - serialise.py:89 - INFO - load_junctions - MainProcess - Loaded junctions
2020-02-18 22:31:40,761 - serialiser - serialise.py:147 - INFO - load_orfs - MainProcess - No ORF data provided, skipping
2020-02-18 22:31:40,761 - serialiser - serialise.py:104 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-02-18 22:32:17,543 - serialiser - xml_serialiser.py:364 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-02-18 22:32:30,714 - serialiser - xml_serialiser.py:404 - INFO - __serialize_targets - MainProcess - Loaded 642296 objects into the "target" table
2020-02-18 22:32:48,264 - serialiser - xml_serialiser.py:305 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-02-18 22:33:07,826 - serialiser - xml_serialiser.py:351 - INFO - __serialize_queries - MainProcess - 491431 in queries

lucventurini · 2020-02-27T16:47:05Z

Dear @srividya22 ,
Thank you for reporting this. May I ask which version of Mikado you are running? If you are using the last official release (1.2.4), we have made quite a bit of improvements, so that installing from the master branch here on GitHub might be a better option.

Even with the newer version, though, a key way that we speed up the reading of BLAST results is to split the FASTA file in different chunks, execute BLAST on them, and then have Mikado serialise read the different XML files in parallel. This is because reading XML files is inherently a very slow process. From what I am seeing, I think however that you probably have run BLAST on the file as a whole, without chunking. In such an instance Mikado serialise would indeed be very slow.

With chunking (say in 20 or 30 threads) I would say that probably mikado serialise should finish in two or three hours tops, from experience.

Also please let me know if i dont provide ORFS, it could be any problem.

This would be a problem, in general, yes. As you have ~half a million transcripts, I would say that both TransDecoder and prodigal would be appropriate.

We do have a SnakeMake-based pipeline, Daijin, that automates these steps and which is included in Mikado (documentation here; code here). Hopefully it can provide some guidance in how to execute Mikado for your project.

lucventurini · 2020-03-02T13:27:01Z

Update: the main problem is in the function which performs the nucleotide-to-protein translation; it is by far the bottleneck in loading ORFs.

I am looking for ways to improve its speed (hopefully e.g. with a drop-in C extension). Hopefully this will massively increase the speed of serialise.

…ORFs was that the reading process waas **blocking**, due to using a simple queue with a bare put instead of put_nowait. This single change makes the code finally parallel and provides therefore a massive speed-up.

lucventurini · 2020-03-02T19:06:38Z

@srividya22
@swarbred

I have created a new branch, issue-280, where I am making the necessary improvements. The first two commits there should already provide a great speed-up compared to the current version of the code.

By speed-up I mean a ~30% increase in performance (single-threaded) and an almost linear increase in performance for multithreading, as the current master version is actually slower in multiprocessing than in single threaded mode.

srividya22 · 2020-03-02T19:12:39Z

Thanks @lucventurini. Do you recommend me installing the code from that branch issue-280 for speed up ?

lucventurini · 2020-03-02T19:13:47Z

@srividya22

I would yes. If you could also report us whether the improvement is present also on your system, we would be very grateful.

Kind regards

srividya22 · 2020-03-02T19:19:41Z

Sure, I will do that. Also as you suggested I have split the blastx xml files ( using a java script ) for speed up. However I am not sure, If I should split the mikado_prepared.fasta file and the ORFS.bed from Transdecoder for speed up. Kindly comment on that.

I have been running to running transdecoder v5.5.0 . But Transdecoder.Predict seems to fail with the mikado_prepared.fasta at this step below.
Error, no sequence for CLASS2_SRR7066585_Super-Scaffold_1.0.0 at

gff3_file_to_proteins.pl line 81.
Error, cmd: gff3_file_to_proteins.pl --gff3 longest_orfs.pep.transdecoder.gff3 --fasta transdecoder_out/longest_orfs.pep --genetic_code Universal > longest_orfs.pep.transdecoder.pep died with ret 3072 No such file or directory at transdecoder/PerlLib/Pipeliner.pm line 185.
Pipeliner::run(Pipeliner=HASH(0x2b2dc41a9dd0)) called at /sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/bin/TransDecoder.Predict line 379

The sequence corresponding to transdecoder in the longest_orfs.pep is "CLASS2_SRR7066585_Super-Scaffold_1.0.0.p1" they are different. looks like an error in parsing the Ids from gff3 .

lucventurini · 2020-03-02T21:26:32Z

@srividya22

However I am not sure, If I should split the mikado_prepared.fasta file and the ORFS.bed from Transdecoder for speed up. Kindly comment on that.

No, mikado will take care of parallelising that (as BED files are much easier to parse and process than XML files).

I have been running to running transdecoder v5.5.0 . But Transdecoder.Predict seems to fail with the mikado_prepared.fasta at this step below.

Unfortunately I am not an expert on the code of Transdecoder. I know, though, that its internal pipeline manager can get confused if a run gets cancelled and restarted afterwards. If that's happened I would recommend deleting the whole subfolder and restart from LongOrf.

srividya22 · 2020-03-03T02:38:19Z

Hi @lucventurini

As suggested I have built the version from issue-280 branch of the repository. I ran prodigal instead of Transdecoder for this test. The orfs got loaded but when loading the blast xml I am seeing errors in the log like

File "blast_serializer/utils.py", line 171, in prepare_hit
len(positives) * qmulti, q_aligned))
ValueError: Number of identical positions (2451) greater than number of aligned positions (2423)!

lucventurini · 2020-03-03T07:25:36Z

@srividya22

That error comes from the parsing of the BLAST files. However there is something strange.... That error is NOT at line 171 but rather 185. Apologies for asking, would you please be able to double check the installation of mikado? It looks like a different, older version might be installed..

Edit: I checked and it looks like you have installed and run a version that is quite old, that still contains bug #150. Would you please be able to check whether that's the case?

Kind regards

…her than simple lists. Also, now using properties to avoid calculating the invalidity of a BED12 object over and over again. This should lead to some more speed improvements.

lucventurini · 2020-03-03T16:31:37Z

@srividya22
@swarbred

I am also working on making the BLAST XML serialisation faster. I will track progress on this here.
There are two bottlenecks:

(potential) I am using the old BioPython utilities to parse the BLAST results, not the new Bio.SearchIO.parse. Switching to it might yield better performance
Over half of the time is spent on a single function that looks at the alignment lines. Any improvement there is bound to have high leverage on performance.

…unctions using NumPy. Some improvements but currently broken, and it could be better

…ame that the new parser from bioPython is so slow

lucventurini · 2020-03-03T20:31:21Z

@swarbred

I have created another branch with the improvements for the XML serialisation, issue-280-dev.
This branch can be tested tomorrow after we confirm that the ORFs are now serialised in a reasonable time.

In this new branch, I have more than halved the time needed to process a single HSP. Unfortunately, moving to the new version of BioPython XML parsing actually slows down operations.
The result should still be a significant speed-up compared to the current version.

…er, thanks to Stack Overflow

… Some improvements but currently broken, and it could be better

…er from bioPython is so slow

…verflow

srividya22 · 2020-03-04T15:02:22Z

Hi @lucventurini

I see you have made some changes since yesterday. Let me know when is a good time to rebuild and test on the new changes in the issue-280-dev branch.

…ole of Mikado*. Lots of amendments to allow it to function properly. For \#280.

lucventurini · 2020-03-30T19:11:13Z

Hi @swarbred @ljyanesm

Now Mikado uses spawn as the default multiprocessing method. This change required a lot of refactoring, hence the big commit (8bf584b) and the subsequent commits fixing minor bugs (last: f074581).

Commit f074581 is the latest, passes all tests, and should be used to re-test the branch. I forecast a memory usage of about 20GB for your massive dataset and probably less than 52 minutes for the analysis when taking in the whole file.

Please let me know.

This should function and behave well with SLURM.

swarbred · 2020-04-02T21:30:11Z

Hi @lucventurini

The pre chunked job is running but I can see an error reported from the orf loading and 0 orfs loaded, perhaps this is not due to the changes introduced here?

2020-04-02 21:04:50,160 - serialiser - orf.py:477 - INFO - __serialize_multiple_threads - MainProcess - Finished loading 0 ORFs into the database
2020-04-02 21:04:50,375 - serialiser - serialise.py:152 - INFO - load_orfs - MainProcess - Finished loading ORF data
2020-04-02 21:04:50,645 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-04-02 21:04:51,808 - serialiser - blast_serialiser.py:78 - INFO - __init__ - MainProcess - Number of dedicated workers: 32

Process Bed12ParseWrapper-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-3:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-4:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-5:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-6:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-7:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-8:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-9:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-10:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-11:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-12:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-13:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-14:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-15:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-16:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-17:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-18:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-19:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-20:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-21:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-22:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-23:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-24:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-25:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-26:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-27:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-28:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-29:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-30:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-31:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-32:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-33:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError

swarbred · 2020-04-02T21:35:43Z

@lucventurini

I have also re-programmed it so that Mikado will chunk the BLAST file internally

The f65bbb1 run with no pre-chunking needed over 600G of memory

swarbred · 2020-04-02T21:50:46Z

@lucventurini

Good news the pre-chunked latest f074581 version loaded the BLAST in 25 mins

2020-04-02 21:04:50,645 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-04-02 21:29:40,232 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data

However, the job finished at 22:38 according to the log

2020-04-02 21:38:32,965 - serialiser - serialise.py:383 - INFO - serialise - MainProcess - Finished

but as of Thu 2 Apr 22:46:39 BST 2020

it's still running

26283420 swarbred           ei-cb      1    32       120G    RUNNING  2020-04-02T22:02:51  2020-04-02T22:02:52           43:35              t384n16 swarbred.mikado-2.0r

lucventurini · 2020-04-02T22:19:14Z

Hi @swarbred

Thanks for testing this. Regarding the run:

mikado is hanging because the ORF serialising crashed.
in turn those crashed because of the changes related to the multiprocessing. It's a synchronisation problem - when the children processes start they expect that the main process has already dispatched data to analyse. For some internal delay they actually started too soon, hence the crash. It's a very silly mistake of mine that should be easily fixable.

It's great to hear though that it finished quickly.

…ecause they start too soon). EI-CoreBioinformatics#280

lucventurini · 2020-04-03T08:43:47Z

Hi @swarbred

Commit 6fcad94 corrects the bug that caused the hanging.

swarbred · 2020-04-05T21:35:03Z

@lucventurini

6fcad94 completed successfully on the pre-chunked tab file, took 54 mins for the blast loading (so longer than the previous run on the same data)

When not pre-chunking the file the memory required for serialise is substantial 700G, that seems excessive just to split the file . While it's simple for someone to either run diamond on chunks or split the file post diamond, it would seem helpful to do this internally in mikado.

lucventurini · 2020-04-05T22:21:52Z

@swarbred

6fcad94 completed successfully on the pre-chunked tab file, took 54 mins for the blast loading (so longer than the previous run on the same data)

Thank you for testing and confirming that the hanging does not happen any longer. I would expect that the extra time is most likely due to the loading of the ORFs.

When not pre-chunking the file the memory required for serialise is substantial 700G, that seems excessive just to split the file . While it's simple for someone to either run diamond on chunks or split the file post diamond, it would seem helpful to do this internally in mikado.

This is indeed excessive. I do expect somewhat high memory usage as I am:

loading the whole data in memory using pandas
calculating quickly per-hit and per-query necessary data (reordering of HSPs, best e-value, etc)
splitting the data amongst the sub-processes

Now, I do expect this to require quite a bit of memory, but 700G is really excessive. How big is, in GB, the tabular BLAST file? Knowing this might help me understand better the order of magnitude.

Also, it might be worthwhile to explicitly set:
multiprocessing_method=spawn

in the configuration file (same level as threads). My reasoning is that I am somewhat disbelieving of the memory usage and I am wondering whether it is still a fallacy of SLURM accounting.

Also relaunching with time -v might help to have a better idea.

lucventurini · 2020-04-05T22:48:50Z

@swarbred

To be fair, @ljyanesm insists that we could do something different - ie quickly scan the file, divide it by lines, then tell each subprocess to analyse from line A to line B.
This would be much lighter on memory.
The problem is that it would rely on the file being pre-sorted appropriately. At a minimum, it would also require parsing the file twice, which stinks of being very slow esp. for the large files we are talking about.
I will have a think on how to improve the behaviour without having to consume so much memory.

lucventurini · 2020-04-06T07:22:37Z

@swarbred

Related to the question above about how large the file is. Would you please be able to run the data using only one processor?
The reason I am asking is that I would expect memory usage to follow this pattern:

three high bumps when loading the query, target, and final BLAST data into memory (single thread)
slightly higher memory usage while cleaning/processing data (single thread)
Approx. doubling of memory when dispatching data to processes (I can intervene on this by deleting data from the main processor, which should halve memory usage hopefully)
Slightly higher increase of memory while the child processes munch through the data

So I would expect memory for a single-process run to be between 2 and 3 times lower.
If this is not the case, then the problem is in how I dispatch data to the processes.
If this is the case, then the problem is inherent in how I am using pandas.

swarbred · 2020-04-06T08:41:35Z

6fcad94 completed successfully on the pre-chunked tab file, took 54 mins for the blast loading (so longer than the previous run on the same data)

Thank you for testing and confirming that the hanging does not happen any longer. I would expect that the extra time is most likely due to the loading of the ORFs.

The 54 mins was just for the homology loading , the orf loading was an additional 7 mins (so similar to earlier runs)

Now, I do expect this to require quite a bit of memory, but 700G is really excessive. How big is, in GB, the tabular BLAST file?

du -h ../../diamond-0.9.24/diamond_tsv/mikado_prepared_matches.tsv
4.5G

The 700G run was with the previous 6fcad94 version, due to the orf loaading issue it hung so no time -v output but from slurm

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26283593      swarbred       tgac   16:00:56 90-00:00:+                 800Gn                    32      0:0 
26283593.ba+                 tgac   16:00:57            693241824K      800Gn 738333412K         32     0:15 
26283593.ex+                 tgac   16:00:56                 3384K      800Gn    180108K         32      0:0

I will start a run on the non-chunked file with the latest version and restrict to one processor.

lucventurini · 2020-04-06T08:55:52Z

I will start a run on the non-chunked file with the latest version and restrict to one processor.

OK, thank you! If (as I suspect) memory will be acceptable for the single-processor run, I will be focussing on finding ways to restrict memory usage for the ancillary processes.

…e data from the temporary msgpack files. This should reduce the memory usage. Also, reducing the maximum number of objects per-process (EI-CoreBioinformatics#280).

lucventurini · 2020-04-06T10:33:18Z

@swarbred

With 08d6cbc I should have reduced significantly the amount of memory used by the sub-processes.
I will wait for you to let me know whether the single-process run uses a small-enough amount of memory; if that is the case, then 08d6cbc should have a good chance of actually fixing the memory usage.

swarbred · 2020-04-06T10:44:14Z

@lucventurini
Ok the run completed just the homology loading took 1 hour with 1 process (so little difference to the prechunked run), 20G memory

	Command being timed: "mikado serialise -p 1 -od mikado-2.0rc6_6fcad94_CBG_altrun --json-conf configuration.test2.yaml"
	User time (seconds): 3721.56
	System time (seconds): 206.08
	Percent of CPU this job got: 83%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:18:44
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 21724976
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 338
	Minor (reclaiming a frame) page faults: 52947180
	Voluntary context switches: 756872
	Involuntary context switches: 14044
	Swaps: 0
	File system inputs: 88397308
	File system outputs: 19900136
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26309509      swarbred       tgac   01:18:50 90-00:00:+                 300Gn                     1      0:0 
26309509.ba+                 tgac   01:18:50             20189156K      300Gn  20637780K          1      0:0 
26309509.ex+                 tgac   01:18:51                 1260K      300Gn    178244K          1      0:0

lucventurini · 2020-04-06T10:54:29Z

@swarbred

@lucventurini
Ok the run completed just the homology loading took 1 hour with 1 process (so little difference to the prechunked run), 20G memory

Cool, then as I thought the main problem with memory is indeed the sub-processes. Hopefully the new change will improve matters.

If I may insist, please make sure that multiprocessing_method is either unset or set to spawn in the configuration file. Otherwise, whatever we do, SLURM will not account for the memory correctly.

lucventurini · 2020-04-06T15:40:14Z

Hi @swarbred

I have rerun on my data of H. sapiens, which should be comparable to yours, using 08d6cbc:

$ du -csh blast.tsv
1.7G    blast.tsv
1.7G    total

Command line:
mikado serialise -p 20 --json-conf configuration.toml --blast-targets blast/index/blastdb-proteins.fa --tsv blast.tsv.gz --transcripts mikado_prepared.fasta --log tsv.log --procs 20 -mo 10000000 --force -- mikado.tsv.db

Running time and memory:

$ sacct -j 10482 -o JobID,JobName,Partition,AllocCPUs,State,Elapsed,ExitCode,AveCPU,MaxRSS,MaxVM,ReqMem --units=GMKTP
       JobID    JobName  Partition  AllocCPUS      State    Elapsed ExitCode     AveCPU     MaxRSS  MaxVMSize     ReqMem
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- ----------
10482         serialise     medium         20  COMPLETED   00:06:29      0:0   00:05:58     11.84G     26.61G    48.83Gn

So this version should be reasonable both in terms of time and memory. Greater number of processors used => greater memory usage as each process will hold some of the data in RAM while performing operations.

A greater number of objects in memory (parameter -mo, here 10M) will of course also require higher memory.

From my observation of the process, loading the data into memory with pandas required ~6GB of RAM, while an additional 5GB of RAM came from the multiprocessing. The latter can be controlled through the number of processes and the -mo parameter.

lucventurini · 2020-04-06T16:34:29Z

@swarbred

PS again with 08d6cbc, I am not able to replicate the problem with multi-threading being as slow as single-threaded. The single-threaded run took ~6.5 times longer than the multi-process one (above with 20 processes):

$ sacct -j 10484 -o JobID,JobName,Partition,AllocCPUs,State,Elapsed,ExitCode,AveCPU,MaxRSS,MaxVM,ReqMem --units=GMKTP
       JobID    JobName  Partition  AllocCPUS      State    Elapsed ExitCode     AveCPU     MaxRSS  MaxVMSize     ReqMem
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- ----------
10484         serialise     medium          1  COMPLETED   00:38:13      0:0   00:37:59      9.82G     10.04G    48.83Gn

For a bit less memory usage, the same data took ~38m to analyse, vs ~6.

swarbred · 2020-04-06T18:11:45Z

Hi @lucventurini

Results with 08d6cbc

no prechunking -p32: Elapsed (wall clock) time (h:mm:ss or m:ss): 46:40.63 (32 mins for the blast loading)

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26313018      swarbred       tgac   00:46:47 90-00:00:+                 300Gn                    32      0:0 
26313018.ba+                 tgac   00:46:47             24655292K      300Gn  63829288K         32      0:0 
26313018.ex+                 tgac   00:46:47                 1448K      300Gn    178244K         32      0:0

prechunking (32 files) -p32: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:23:05 (58 mins for the blast loading)

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26313017      swarbred       tgac   01:23:14 90-00:00:+                 120Gn                    32      0:0 
26313017.ba+                 tgac   01:23:14             18677664K      120Gn  63842080K         32      0:0 
26313017.ex+                 tgac   01:23:14                 1420K      120Gn    178204K         32      0:0

no prechunking -p1: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:48:28

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26313021      swarbred       tgac   01:48:35 90-00:00:+                 300Gn                     1      0:0 
26313021.ba+                 tgac   01:48:35             20351316K      300Gn  20981652K          1      0:0 
26313021.ex+                 tgac   01:48:35                 1264K      300Gn    178212K          1      0:0

memory use looks fine (multiprocessing_method: spawn)

Just to query, it's correct that the number of processors should match the number of tab chunks? i.e. as indicated below

 -p PROCS, --procs PROCS
                        Number of threads to use for analysing the BLAST
                        files. This number should not be higher than the total
                        number of XML files.

below is the blast loading lines from the log for the prechunking run above, the reading and finished reading time stamps are sequential for each chunk, is that as it should be? In my prechunked run i'm pointing at a directory of 32 tsv files.

2020-04-06 13:44:03,826 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-04-06 13:44:03,830 - serialiser - blast_serialiser.py:78 - INFO - __init__ - MainProcess - Number of dedicated workers: 32
2020-04-06 13:44:35,765 - serialiser - blast_serialiser.py:244 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-04-06 13:44:40,418 - serialiser - blast_serialiser.py:279 - INFO - __serialize_targets - MainProcess - Loaded 330060 objects into the "target" table
2020-04-06 13:44:40,422 - serialiser - blast_serialiser.py:173 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-04-06 13:44:40,818 - serialiser - blast_serialiser.py:221 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2020-04-06 13:44:40,907 - serialiser - tab_serialiser.py:30 - INFO - _serialise_tabular - MainProcess - Creating a pool with 32 workers for analysing BLAST results
2020-04-06 13:44:46,501 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--023_matches.tsv data
2020-04-06 13:45:02,795 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--023_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:46:16,611 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--001_matches.tsv data
2020-04-06 13:46:34,295 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--001_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:47:43,716 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--005_matches.tsv data
2020-04-06 13:48:00,714 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--005_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:49:07,599 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--022_matches.tsv data
2020-04-06 13:49:25,358 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--022_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:50:34,200 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--020_matches.tsv data
2020-04-06 13:50:51,062 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--020_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:52:01,151 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--032_matches.tsv data
2020-04-06 13:52:13,653 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--032_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:53:32,963 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--031_matches.tsv data
2020-04-06 13:53:46,350 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--031_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:55:04,394 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--015_matches.tsv data
2020-04-06 13:55:20,499 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--015_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:56:28,470 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--003_matches.tsv data
2020-04-06 13:56:46,161 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--003_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:57:55,454 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--008_matches.tsv data
2020-04-06 13:58:13,241 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--008_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:59:24,458 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--029_matches.tsv data
2020-04-06 13:59:38,955 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--029_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:00:55,779 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--030_matches.tsv data
2020-04-06 14:01:08,996 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--030_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:02:23,426 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--016_matches.tsv data
2020-04-06 14:02:41,662 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--016_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:03:49,672 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--004_matches.tsv data
2020-04-06 14:04:06,882 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--004_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:05:15,120 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--028_matches.tsv data
2020-04-06 14:05:29,479 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--028_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:06:45,604 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--027_matches.tsv data
2020-04-06 14:07:02,223 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--027_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:08:19,280 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--019_matches.tsv data
2020-04-06 14:08:36,852 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--019_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:09:50,058 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--024_matches.tsv data
2020-04-06 14:10:07,010 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--024_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:11:21,449 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--025_matches.tsv data
2020-04-06 14:11:37,560 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--025_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:13:00,959 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--021_matches.tsv data
2020-04-06 14:13:16,439 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--021_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:14:30,398 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--018_matches.tsv data
2020-04-06 14:14:46,533 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--018_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:16:00,004 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--006_matches.tsv data
2020-04-06 14:16:17,724 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--006_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:17:30,142 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--009_matches.tsv data
2020-04-06 14:17:47,043 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--009_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:18:53,849 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--026_matches.tsv data
2020-04-06 14:19:10,900 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--026_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:20:37,243 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--002_matches.tsv data
2020-04-06 14:20:54,821 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--002_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:22:07,516 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--010_matches.tsv data
2020-04-06 14:22:25,296 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--010_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:23:42,136 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--012_matches.tsv data
2020-04-06 14:23:59,367 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--012_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:25:22,087 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--017_matches.tsv data
2020-04-06 14:25:40,294 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--017_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:27:05,720 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--014_matches.tsv data
2020-04-06 14:27:24,028 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--014_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:28:54,857 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--013_matches.tsv data
2020-04-06 14:29:11,571 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--013_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:30:28,454 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--011_matches.tsv data
2020-04-06 14:30:47,068 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--011_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:32:00,191 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--007_matches.tsv data
2020-04-06 14:32:17,249 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--007_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:33:39,435 - serialiser - tab_serialiser.py:43 - INFO - _serialise_tabular - MainProcess - Finished loading blast hits
2020-04-06 14:46:35,976 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data

and from the run with no prechunking and -p32

2020-04-06 13:41:03,700 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-04-06 13:41:03,707 - serialiser - blast_serialiser.py:78 - INFO - __init__ - MainProcess - Number of dedicated workers: 32
2020-04-06 13:41:15,657 - serialiser - blast_serialiser.py:244 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-04-06 13:41:17,844 - serialiser - blast_serialiser.py:279 - INFO - __serialize_targets - MainProcess - Loaded 330060 objects into the "target" table
2020-04-06 13:41:17,847 - serialiser - blast_serialiser.py:173 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-04-06 13:41:18,069 - serialiser - blast_serialiser.py:221 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2020-04-06 13:41:18,123 - serialiser - tab_serialiser.py:30 - INFO - _serialise_tabular - MainProcess - Creating a pool with 32 workers for analysing BLAST results
2020-04-06 13:41:20,681 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches.tsv data
2020-04-06 13:45:51,808 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:09:32,636 - serialiser - tab_serialiser.py:43 - INFO - _serialise_tabular - MainProcess - Finished loading blast hits
2020-04-06 14:12:45,957 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data

lucventurini · 2020-04-06T18:36:19Z

@swarbred
@ljyanesm

Excellent news, thank you! Now the job running statistics make sense, finally.

To answer your query, when analysing tabular data the number of processors does not need to match the number of files. So eg mikado will be able to exploit all threads when analysing 4 files with 32 processors.

The reason analysing multiple files takes longer is that the beginning stage (loading of the BLAST file and computation of additional values such as eg query ID in the database or minimum evalue per hit) happen in a single threaded way. There is therefore a one-off cost that makes loading multiple files more time-expensive (at the gain of lesser memory usage).

Memory usage, as I was saying earlier, can be decreased by tweaking the maximum number of objects in memory. The tradeoff is that a lower number means flushing to disk more often, a single - threaded and quite expensive bottleneck (this is down to Sqlite and I really can't do much about it directly).

Do you think we can finally merge to master?

swarbred · 2020-04-06T19:07:42Z

@lucventurini
Memory usage and run time look good to me so yes lest merge this, thanks

* For EI-CoreBioinformatics#280: first tentative to improve the speed of serialise. * For EI-CoreBioinformatics#280: the biggest, by far, problem in `mikado serialise` for the ORFs was that the reading process waas **blocking**, due to using a simple queue with a bare put instead of put_nowait. This single change makes the code finally parallel and provides therefore a massive speed-up. * Correct commit 4b88696 Removing the evaluation of invalidity causes bugs. It's better to recalculate constantly in this case. * Correct previous commit * Correct .travis.yml Edited Travis to reflect the changes in master * Correct travis * Removing another simple queue from ORF, which could have again considerably slowed down `serialise`. * For EI-CoreBioinformatics#280: now BED12 objects use numpy arrays rather than simple lists. Also, now using properties to avoid calculating the invalidity of a BED12 object over and over again. This should lead to some more speed improvements. * BROKEN; for EI-CoreBioinformatics#280: trying to implement the slow functions using NumPy. Some improvements but currently broken, and it could be better * For EI-CoreBioinformatics#280: fixed the previous error. * For EI-CoreBioinformatics#280: fastest version of serialise yet. A shame that the new parser from bioPython is so slow * For EI-CoreBioinformatics#280: making the merging algorithm much faster, thanks to Stack Overflow * For EI-CoreBioinformatics#280: starting to reorganise code to allow using BLAST tabular - as opposed to only XMLs. * For EI-CoreBioinformatics#280: progress on definining the functions for parsing tabular blast. * Starting to implement the functions necessary to subdivide the work in groups. * Stub for the prepare_tab_hit function.

…ioinformatics#305) * Issue EI-CoreBioinformatics#280: now mikado serialise has been refactored so that: * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields) * `daijin` now automatically generates *tabular* rather than XML BLAST results * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage). * Issue EI-CoreBioinformatics#270: now Mikado will remove redundancy based on intron chains * For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.

lucventurini self-assigned this Feb 27, 2020

lucventurini added duplicate enhancement labels Feb 27, 2020

lucventurini assigned swarbred Feb 27, 2020

lucventurini added the EI-Internal label Mar 2, 2020

lucventurini added this to the 2.0 milestone Mar 2, 2020

lucventurini added a commit that referenced this issue Mar 2, 2020

For #280: first tentative to improve the speed of serialise.

262abc5

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 3, 2020

BROKEN; for EI-CoreBioinformatics#280: trying to implement the slow f…

73218fc

…unctions using NumPy. Some improvements but currently broken, and it could be better

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 3, 2020

For EI-CoreBioinformatics#280: fixed the previous error.

9a6d823

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 3, 2020

For EI-CoreBioinformatics#280: fastest version of serialise yet. A sh…

12df69a

…ame that the new parser from bioPython is so slow

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 4, 2020

For EI-CoreBioinformatics#280: making the merging algorithm much fast…

a1dbfaa

…er, thanks to Stack Overflow

lucventurini added a commit that referenced this issue Mar 4, 2020

BROKEN; for #280: trying to implement the slow functions using NumPy.…

7e03df6

… Some improvements but currently broken, and it could be better

lucventurini added a commit that referenced this issue Mar 4, 2020

For #280: fixed the previous error.

eecc06a

lucventurini added a commit that referenced this issue Mar 4, 2020

For #280: fastest version of serialise yet. A shame that the new pars…

9ed7fd6

…er from bioPython is so slow

lucventurini added a commit that referenced this issue Mar 4, 2020

For #280: making the merging algorithm much faster, thanks to Stack O…

d5f9130

…verflow

lucventurini mentioned this issue Mar 4, 2020

serialise error #269

Closed

lucventurini added a commit that referenced this issue Mar 30, 2020

Now setting 'spawn' as the default multiprocessing method *for the wh…

8bf584b

…ole of Mikado*. Lots of amendments to allow it to function properly. For \#280.

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 30, 2020

Fix previous commit (for EI-CoreBioinformatics#280)

ffb13e7

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 30, 2020

Again a fix for 8bf584b (EI-CoreBioinformatics#280)

e526a35

lucventurini added a commit to lucventurini/mikado that referenced this issue Apr 3, 2020

Fix the bug found by @swarbred in the last run (BED12 parsers dying b…

6fcad94

…ecause they start too soon). EI-CoreBioinformatics#280

lucventurini mentioned this issue Apr 3, 2020

error during mikado config #297

Closed

lucventurini closed this as completed in b3f3d7a Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mikado serialize running time #280

Mikado serialize running time #280

srividya22 commented Feb 27, 2020

lucventurini commented Feb 27, 2020

lucventurini commented Mar 2, 2020

lucventurini commented Mar 2, 2020 •

edited

Loading

srividya22 commented Mar 2, 2020

lucventurini commented Mar 2, 2020

srividya22 commented Mar 2, 2020 •

edited

Loading

lucventurini commented Mar 2, 2020

srividya22 commented Mar 3, 2020 •

edited

Loading

lucventurini commented Mar 3, 2020 •

edited

Loading

lucventurini commented Mar 3, 2020

lucventurini commented Mar 3, 2020

srividya22 commented Mar 4, 2020

lucventurini commented Mar 30, 2020 •

edited

Loading

swarbred commented Apr 2, 2020

swarbred commented Apr 2, 2020

swarbred commented Apr 2, 2020

lucventurini commented Apr 2, 2020

lucventurini commented Apr 3, 2020

swarbred commented Apr 5, 2020

lucventurini commented Apr 5, 2020 •

edited

Loading

lucventurini commented Apr 5, 2020

lucventurini commented Apr 6, 2020

swarbred commented Apr 6, 2020

lucventurini commented Apr 6, 2020

lucventurini commented Apr 6, 2020

swarbred commented Apr 6, 2020

lucventurini commented Apr 6, 2020

lucventurini commented Apr 6, 2020 •

edited

Loading

lucventurini commented Apr 6, 2020

swarbred commented Apr 6, 2020

lucventurini commented Apr 6, 2020

swarbred commented Apr 6, 2020

Mikado serialize running time #280

Mikado serialize running time #280

Comments

srividya22 commented Feb 27, 2020

lucventurini commented Feb 27, 2020

lucventurini commented Mar 2, 2020

lucventurini commented Mar 2, 2020 • edited Loading

srividya22 commented Mar 2, 2020

lucventurini commented Mar 2, 2020

srividya22 commented Mar 2, 2020 • edited Loading

lucventurini commented Mar 2, 2020

srividya22 commented Mar 3, 2020 • edited Loading

lucventurini commented Mar 3, 2020 • edited Loading

lucventurini commented Mar 3, 2020

lucventurini commented Mar 3, 2020

srividya22 commented Mar 4, 2020

lucventurini commented Mar 30, 2020 • edited Loading

swarbred commented Apr 2, 2020

swarbred commented Apr 2, 2020

swarbred commented Apr 2, 2020

lucventurini commented Apr 2, 2020

lucventurini commented Apr 3, 2020

swarbred commented Apr 5, 2020

lucventurini commented Apr 5, 2020 • edited Loading

lucventurini commented Apr 5, 2020

lucventurini commented Apr 6, 2020

swarbred commented Apr 6, 2020

lucventurini commented Apr 6, 2020

lucventurini commented Apr 6, 2020

swarbred commented Apr 6, 2020

lucventurini commented Apr 6, 2020

lucventurini commented Apr 6, 2020 • edited Loading

lucventurini commented Apr 6, 2020

swarbred commented Apr 6, 2020

lucventurini commented Apr 6, 2020

swarbred commented Apr 6, 2020

lucventurini commented Mar 2, 2020 •

edited

Loading

srividya22 commented Mar 2, 2020 •

edited

Loading

srividya22 commented Mar 3, 2020 •

edited

Loading

lucventurini commented Mar 3, 2020 •

edited

Loading

lucventurini commented Mar 30, 2020 •

edited

Loading

lucventurini commented Apr 5, 2020 •

edited

Loading

lucventurini commented Apr 6, 2020 •

edited

Loading