Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mikado serialize running time #280

Closed
srividya22 opened this issue Feb 27, 2020 · 75 comments
Closed

Mikado serialize running time #280

srividya22 opened this issue Feb 27, 2020 · 75 comments

Comments

@srividya22
Copy link

Hi ,

I have been running mikado on the new tomato genome (size : 1.01 Gbp). It finished the mikado prepare step. However it has been running mikado serialize for quite a long time now. Also I have not provided any ORFS for this run. I would like to know what could be a estimated time for the completion for this step. Also please let me know if i dont provide ORFS, it could be any problem.

2020-02-18 22:31:08,078 - serialiser - serialise.py:273 - INFO - setup - MainProcess - Command line: /sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/bin/mikado serialise --json-conf groundcherry_conf.yaml --xml mikado.blast.xml.gz --blast_targets ../all_proteins_for_mikado_uniq.fasta -p 50 --transcripts mikado_prepared.fasta
2020-02-18 22:31:08,079 - serialiser - serialise.py:288 - INFO - setup - MainProcess - Requested 50 threads, forcing single thread: False
2020-02-18 22:31:08,079 - serialiser - serialise.py:79 - INFO - load_junctions - MainProcess - Starting to load junctions: ['/seq/schatz/sramakri/genome_annotations/groundcherry/RNASeq/STAR/portcullis/3-filt/portcullis_filtered.pass.junctions.bed']
2020-02-18 22:31:40,761 - serialiser - serialise.py:89 - INFO - load_junctions - MainProcess - Loaded junctions
2020-02-18 22:31:40,761 - serialiser - serialise.py:147 - INFO - load_orfs - MainProcess - No ORF data provided, skipping
2020-02-18 22:31:40,761 - serialiser - serialise.py:104 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-02-18 22:32:17,543 - serialiser - xml_serialiser.py:364 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-02-18 22:32:30,714 - serialiser - xml_serialiser.py:404 - INFO - __serialize_targets - MainProcess - Loaded 642296 objects into the "target" table
2020-02-18 22:32:48,264 - serialiser - xml_serialiser.py:305 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-02-18 22:33:07,826 - serialiser - xml_serialiser.py:351 - INFO - __serialize_queries - MainProcess - 491431 in queries

@lucventurini
Copy link
Collaborator

Dear @srividya22 ,
Thank you for reporting this. May I ask which version of Mikado you are running? If you are using the last official release (1.2.4), we have made quite a bit of improvements, so that installing from the master branch here on GitHub might be a better option.

Even with the newer version, though, a key way that we speed up the reading of BLAST results is to split the FASTA file in different chunks, execute BLAST on them, and then have Mikado serialise read the different XML files in parallel. This is because reading XML files is inherently a very slow process. From what I am seeing, I think however that you probably have run BLAST on the file as a whole, without chunking. In such an instance Mikado serialise would indeed be very slow.

With chunking (say in 20 or 30 threads) I would say that probably mikado serialise should finish in two or three hours tops, from experience.

Also please let me know if i dont provide ORFS, it could be any problem.

This would be a problem, in general, yes. As you have ~half a million transcripts, I would say that both TransDecoder and prodigal would be appropriate.

We do have a SnakeMake-based pipeline, Daijin, that automates these steps and which is included in Mikado (documentation here; code here). Hopefully it can provide some guidance in how to execute Mikado for your project.

@lucventurini
Copy link
Collaborator

Update: the main problem is in the function which performs the nucleotide-to-protein translation; it is by far the bottleneck in loading ORFs.

I am looking for ways to improve its speed (hopefully e.g. with a drop-in C extension). Hopefully this will massively increase the speed of serialise.

lucventurini added a commit that referenced this issue Mar 2, 2020
…ORFs was that the reading process waas **blocking**, due to using a simple queue with a bare put instead of put_nowait. This single change makes the code finally parallel and provides therefore a massive speed-up.
@lucventurini
Copy link
Collaborator

lucventurini commented Mar 2, 2020

@srividya22
@swarbred

I have created a new branch, issue-280, where I am making the necessary improvements. The first two commits there should already provide a great speed-up compared to the current version of the code.

By speed-up I mean a ~30% increase in performance (single-threaded) and an almost linear increase in performance for multithreading, as the current master version is actually slower in multiprocessing than in single threaded mode.

@srividya22
Copy link
Author

Thanks @lucventurini. Do you recommend me installing the code from that branch issue-280 for speed up ?

@lucventurini
Copy link
Collaborator

@srividya22

I would yes. If you could also report us whether the improvement is present also on your system, we would be very grateful.

Kind regards

@srividya22
Copy link
Author

srividya22 commented Mar 2, 2020

Sure, I will do that. Also as you suggested I have split the blastx xml files ( using a java script ) for speed up. However I am not sure, If I should split the mikado_prepared.fasta file and the ORFS.bed from Transdecoder for speed up. Kindly comment on that.

I have been running to running transdecoder v5.5.0 . But Transdecoder.Predict seems to fail with the mikado_prepared.fasta at this step below.
Error, no sequence for CLASS2_SRR7066585_Super-Scaffold_1.0.0 at

gff3_file_to_proteins.pl line 81.
Error, cmd: gff3_file_to_proteins.pl --gff3 longest_orfs.pep.transdecoder.gff3 --fasta transdecoder_out/longest_orfs.pep --genetic_code Universal > longest_orfs.pep.transdecoder.pep died with ret 3072 No such file or directory at transdecoder/PerlLib/Pipeliner.pm line 185.
Pipeliner::run(Pipeliner=HASH(0x2b2dc41a9dd0)) called at /sonas-hs/schatz/hpc/home/sramakri/miniconda2/envs/mikado/bin/TransDecoder.Predict line 379

The sequence corresponding to transdecoder in the longest_orfs.pep is "CLASS2_SRR7066585_Super-Scaffold_1.0.0.p1" they are different. looks like an error in parsing the Ids from gff3 .

@lucventurini
Copy link
Collaborator

@srividya22

However I am not sure, If I should split the mikado_prepared.fasta file and the ORFS.bed from Transdecoder for speed up. Kindly comment on that.

No, mikado will take care of parallelising that (as BED files are much easier to parse and process than XML files).

I have been running to running transdecoder v5.5.0 . But Transdecoder.Predict seems to fail with the mikado_prepared.fasta at this step below.

Unfortunately I am not an expert on the code of Transdecoder. I know, though, that its internal pipeline manager can get confused if a run gets cancelled and restarted afterwards. If that's happened I would recommend deleting the whole subfolder and restart from LongOrf.

@srividya22
Copy link
Author

srividya22 commented Mar 3, 2020

Hi @lucventurini

As suggested I have built the version from issue-280 branch of the repository. I ran prodigal instead of Transdecoder for this test. The orfs got loaded but when loading the blast xml I am seeing errors in the log like

File "blast_serializer/utils.py", line 171, in prepare_hit
len(positives) * qmulti, q_aligned))
ValueError: Number of identical positions (2451) greater than number of aligned positions (2423)!

@lucventurini
Copy link
Collaborator

lucventurini commented Mar 3, 2020

@srividya22

That error comes from the parsing of the BLAST files. However there is something strange.... That error is NOT at line 171 but rather 185. Apologies for asking, would you please be able to double check the installation of mikado? It looks like a different, older version might be installed..

Edit: I checked and it looks like you have installed and run a version that is quite old, that still contains bug #150. Would you please be able to check whether that's the case?

Kind regards

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 3, 2020
…her than simple lists. Also, now using properties to avoid calculating the invalidity of a BED12 object over and over again. This should lead to some more speed improvements.
@lucventurini
Copy link
Collaborator

@srividya22
@swarbred

I am also working on making the BLAST XML serialisation faster. I will track progress on this here.
There are two bottlenecks:

  • (potential) I am using the old BioPython utilities to parse the BLAST results, not the new Bio.SearchIO.parse. Switching to it might yield better performance
  • Over half of the time is spent on a single function that looks at the alignment lines. Any improvement there is bound to have high leverage on performance.

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 3, 2020
…unctions using NumPy. Some improvements but currently broken, and it could be better
lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 3, 2020
lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 3, 2020
…ame that the new parser from bioPython is so slow
@lucventurini
Copy link
Collaborator

@swarbred

I have created another branch with the improvements for the XML serialisation, issue-280-dev.
This branch can be tested tomorrow after we confirm that the ORFs are now serialised in a reasonable time.

In this new branch, I have more than halved the time needed to process a single HSP. Unfortunately, moving to the new version of BioPython XML parsing actually slows down operations.
The result should still be a significant speed-up compared to the current version.

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 4, 2020
lucventurini added a commit that referenced this issue Mar 4, 2020
… Some improvements but currently broken, and it could be better
lucventurini added a commit that referenced this issue Mar 4, 2020
lucventurini added a commit that referenced this issue Mar 4, 2020
@srividya22
Copy link
Author

Hi @lucventurini

I see you have made some changes since yesterday. Let me know when is a good time to rebuild and test on the new changes in the issue-280-dev branch.

lucventurini added a commit that referenced this issue Mar 30, 2020
…ole of Mikado*. Lots of amendments to allow it to function properly. For \#280.
@lucventurini
Copy link
Collaborator

lucventurini commented Mar 30, 2020

Hi @swarbred @ljyanesm

Now Mikado uses spawn as the default multiprocessing method. This change required a lot of refactoring, hence the big commit (8bf584b) and the subsequent commits fixing minor bugs (last: f074581).

Commit f074581 is the latest, passes all tests, and should be used to re-test the branch. I forecast a memory usage of about 20GB for your massive dataset and probably less than 52 minutes for the analysis when taking in the whole file.

Please let me know.

This should function and behave well with SLURM.

lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 30, 2020
lucventurini added a commit to lucventurini/mikado that referenced this issue Mar 30, 2020
@swarbred
Copy link
Collaborator

swarbred commented Apr 2, 2020

Hi @lucventurini

The pre chunked job is running but I can see an error reported from the orf loading and 0 orfs loaded, perhaps this is not due to the changes introduced here?

2020-04-02 21:04:50,160 - serialiser - orf.py:477 - INFO - __serialize_multiple_threads - MainProcess - Finished loading 0 ORFs into the database
2020-04-02 21:04:50,375 - serialiser - serialise.py:152 - INFO - load_orfs - MainProcess - Finished loading ORF data
2020-04-02 21:04:50,645 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-04-02 21:04:51,808 - serialiser - blast_serialiser.py:78 - INFO - __init__ - MainProcess - Number of dedicated workers: 32
Process Bed12ParseWrapper-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-3:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-4:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-5:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-6:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-7:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-8:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-9:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-10:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-11:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-12:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-13:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-14:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-15:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-16:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-17:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-18:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-19:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-20:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-21:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-22:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-23:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-24:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-25:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-26:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-27:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-28:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-29:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-30:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-31:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-32:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError
Process Bed12ParseWrapper-33:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/Mikado/parsers/bed12.py", line 1711, in run
    raise ValueError
ValueError

@swarbred
Copy link
Collaborator

swarbred commented Apr 2, 2020

@lucventurini

I have also re-programmed it so that Mikado will chunk the BLAST file internally

The f65bbb1 run with no pre-chunking needed over 600G of memory

@swarbred
Copy link
Collaborator

swarbred commented Apr 2, 2020

@lucventurini

Good news the pre-chunked latest f074581 version loaded the BLAST in 25 mins

2020-04-02 21:04:50,645 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-04-02 21:29:40,232 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data

However, the job finished at 22:38 according to the log

2020-04-02 21:38:32,965 - serialiser - serialise.py:383 - INFO - serialise - MainProcess - Finished

but as of Thu 2 Apr 22:46:39 BST 2020

it's still running

26283420 swarbred           ei-cb      1    32       120G    RUNNING  2020-04-02T22:02:51  2020-04-02T22:02:52           43:35              t384n16 swarbred.mikado-2.0r

@lucventurini
Copy link
Collaborator

Hi @swarbred

Thanks for testing this. Regarding the run:

  • mikado is hanging because the ORF serialising crashed.
  • in turn those crashed because of the changes related to the multiprocessing. It's a synchronisation problem - when the children processes start they expect that the main process has already dispatched data to analyse. For some internal delay they actually started too soon, hence the crash. It's a very silly mistake of mine that should be easily fixable.

It's great to hear though that it finished quickly.

lucventurini added a commit to lucventurini/mikado that referenced this issue Apr 3, 2020
@lucventurini
Copy link
Collaborator

Hi @swarbred

Commit 6fcad94 corrects the bug that caused the hanging.

@swarbred
Copy link
Collaborator

swarbred commented Apr 5, 2020

@lucventurini

6fcad94 completed successfully on the pre-chunked tab file, took 54 mins for the blast loading (so longer than the previous run on the same data)

When not pre-chunking the file the memory required for serialise is substantial 700G, that seems excessive just to split the file . While it's simple for someone to either run diamond on chunks or split the file post diamond, it would seem helpful to do this internally in mikado.

@lucventurini
Copy link
Collaborator

lucventurini commented Apr 5, 2020

@swarbred

6fcad94 completed successfully on the pre-chunked tab file, took 54 mins for the blast loading (so longer than the previous run on the same data)

Thank you for testing and confirming that the hanging does not happen any longer. I would expect that the extra time is most likely due to the loading of the ORFs.

When not pre-chunking the file the memory required for serialise is substantial 700G, that seems excessive just to split the file . While it's simple for someone to either run diamond on chunks or split the file post diamond, it would seem helpful to do this internally in mikado.

This is indeed excessive. I do expect somewhat high memory usage as I am:

  • loading the whole data in memory using pandas
  • calculating quickly per-hit and per-query necessary data (reordering of HSPs, best e-value, etc)
  • splitting the data amongst the sub-processes

Now, I do expect this to require quite a bit of memory, but 700G is really excessive. How big is, in GB, the tabular BLAST file? Knowing this might help me understand better the order of magnitude.

Also, it might be worthwhile to explicitly set:
multiprocessing_method=spawn

in the configuration file (same level as threads). My reasoning is that I am somewhat disbelieving of the memory usage and I am wondering whether it is still a fallacy of SLURM accounting.

Also relaunching with time -v might help to have a better idea.

@lucventurini
Copy link
Collaborator

@swarbred

To be fair, @ljyanesm insists that we could do something different - ie quickly scan the file, divide it by lines, then tell each subprocess to analyse from line A to line B.
This would be much lighter on memory.
The problem is that it would rely on the file being pre-sorted appropriately. At a minimum, it would also require parsing the file twice, which stinks of being very slow esp. for the large files we are talking about.
I will have a think on how to improve the behaviour without having to consume so much memory.

@lucventurini
Copy link
Collaborator

@swarbred

Related to the question above about how large the file is. Would you please be able to run the data using only one processor?
The reason I am asking is that I would expect memory usage to follow this pattern:

  • three high bumps when loading the query, target, and final BLAST data into memory (single thread)
  • slightly higher memory usage while cleaning/processing data (single thread)
  • Approx. doubling of memory when dispatching data to processes (I can intervene on this by deleting data from the main processor, which should halve memory usage hopefully)
  • Slightly higher increase of memory while the child processes munch through the data

So I would expect memory for a single-process run to be between 2 and 3 times lower.
If this is not the case, then the problem is in how I dispatch data to the processes.
If this is the case, then the problem is inherent in how I am using pandas.

@swarbred
Copy link
Collaborator

swarbred commented Apr 6, 2020

6fcad94 completed successfully on the pre-chunked tab file, took 54 mins for the blast loading (so longer than the previous run on the same data)

Thank you for testing and confirming that the hanging does not happen any longer. I would expect that the extra time is most likely due to the loading of the ORFs.

The 54 mins was just for the homology loading , the orf loading was an additional 7 mins (so similar to earlier runs)

Now, I do expect this to require quite a bit of memory, but 700G is really excessive. How big is, in GB, the tabular BLAST file?

du -h ../../diamond-0.9.24/diamond_tsv/mikado_prepared_matches.tsv
4.5G

The 700G run was with the previous 6fcad94 version, due to the orf loaading issue it hung so no time -v output but from slurm

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26283593      swarbred       tgac   16:00:56 90-00:00:+                 800Gn                    32      0:0 
26283593.ba+                 tgac   16:00:57            693241824K      800Gn 738333412K         32     0:15 
26283593.ex+                 tgac   16:00:56                 3384K      800Gn    180108K         32      0:0 

I will start a run on the non-chunked file with the latest version and restrict to one processor.

@lucventurini
Copy link
Collaborator

I will start a run on the non-chunked file with the latest version and restrict to one processor.

OK, thank you! If (as I suspect) memory will be acceptable for the single-processor run, I will be focussing on finding ways to restrict memory usage for the ancillary processes.

lucventurini added a commit to lucventurini/mikado that referenced this issue Apr 6, 2020
…e data from the temporary msgpack files. This should reduce the memory usage. Also, reducing the maximum number of objects per-process (EI-CoreBioinformatics#280).
@lucventurini
Copy link
Collaborator

@swarbred

With 08d6cbc I should have reduced significantly the amount of memory used by the sub-processes.
I will wait for you to let me know whether the single-process run uses a small-enough amount of memory; if that is the case, then 08d6cbc should have a good chance of actually fixing the memory usage.

@swarbred
Copy link
Collaborator

swarbred commented Apr 6, 2020

@lucventurini
Ok the run completed just the homology loading took 1 hour with 1 process (so little difference to the prechunked run), 20G memory

	Command being timed: "mikado serialise -p 1 -od mikado-2.0rc6_6fcad94_CBG_altrun --json-conf configuration.test2.yaml"
	User time (seconds): 3721.56
	System time (seconds): 206.08
	Percent of CPU this job got: 83%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:18:44
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 21724976
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 338
	Minor (reclaiming a frame) page faults: 52947180
	Voluntary context switches: 756872
	Involuntary context switches: 14044
	Swaps: 0
	File system inputs: 88397308
	File system outputs: 19900136
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26309509      swarbred       tgac   01:18:50 90-00:00:+                 300Gn                     1      0:0 
26309509.ba+                 tgac   01:18:50             20189156K      300Gn  20637780K          1      0:0 
26309509.ex+                 tgac   01:18:51                 1260K      300Gn    178244K          1      0:0 

@lucventurini
Copy link
Collaborator

@swarbred

@lucventurini
Ok the run completed just the homology loading took 1 hour with 1 process (so little difference to the prechunked run), 20G memory

Cool, then as I thought the main problem with memory is indeed the sub-processes. Hopefully the new change will improve matters.

If I may insist, please make sure that multiprocessing_method is either unset or set to spawn in the configuration file. Otherwise, whatever we do, SLURM will not account for the memory correctly.

@lucventurini
Copy link
Collaborator

lucventurini commented Apr 6, 2020

Hi @swarbred

I have rerun on my data of H. sapiens, which should be comparable to yours, using 08d6cbc:

$ du -csh blast.tsv
1.7G    blast.tsv
1.7G    total

Command line:
mikado serialise -p 20 --json-conf configuration.toml --blast-targets blast/index/blastdb-proteins.fa --tsv blast.tsv.gz --transcripts mikado_prepared.fasta --log tsv.log --procs 20 -mo 10000000 --force -- mikado.tsv.db

Running time and memory:

$ sacct -j 10482 -o JobID,JobName,Partition,AllocCPUs,State,Elapsed,ExitCode,AveCPU,MaxRSS,MaxVM,ReqMem --units=GMKTP
       JobID    JobName  Partition  AllocCPUS      State    Elapsed ExitCode     AveCPU     MaxRSS  MaxVMSize     ReqMem
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- ----------
10482         serialise     medium         20  COMPLETED   00:06:29      0:0   00:05:58     11.84G     26.61G    48.83Gn

So this version should be reasonable both in terms of time and memory. Greater number of processors used => greater memory usage as each process will hold some of the data in RAM while performing operations.

A greater number of objects in memory (parameter -mo, here 10M) will of course also require higher memory.

From my observation of the process, loading the data into memory with pandas required ~6GB of RAM, while an additional 5GB of RAM came from the multiprocessing. The latter can be controlled through the number of processes and the -mo parameter.

@lucventurini
Copy link
Collaborator

@swarbred

PS again with 08d6cbc, I am not able to replicate the problem with multi-threading being as slow as single-threaded. The single-threaded run took ~6.5 times longer than the multi-process one (above with 20 processes):

$ sacct -j 10484 -o JobID,JobName,Partition,AllocCPUs,State,Elapsed,ExitCode,AveCPU,MaxRSS,MaxVM,ReqMem --units=GMKTP
       JobID    JobName  Partition  AllocCPUS      State    Elapsed ExitCode     AveCPU     MaxRSS  MaxVMSize     ReqMem
------------ ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- ----------
10484         serialise     medium          1  COMPLETED   00:38:13      0:0   00:37:59      9.82G     10.04G    48.83Gn

For a bit less memory usage, the same data took ~38m to analyse, vs ~6.

@swarbred
Copy link
Collaborator

swarbred commented Apr 6, 2020

Hi @lucventurini

Results with 08d6cbc

no prechunking -p32: Elapsed (wall clock) time (h:mm:ss or m:ss): 46:40.63 (32 mins for the blast loading)

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26313018      swarbred       tgac   00:46:47 90-00:00:+                 300Gn                    32      0:0 
26313018.ba+                 tgac   00:46:47             24655292K      300Gn  63829288K         32      0:0 
26313018.ex+                 tgac   00:46:47                 1448K      300Gn    178244K         32      0:0 

prechunking (32 files) -p32: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:23:05 (58 mins for the blast loading)

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26313017      swarbred       tgac   01:23:14 90-00:00:+                 120Gn                    32      0:0 
26313017.ba+                 tgac   01:23:14             18677664K      120Gn  63842080K         32      0:0 
26313017.ex+                 tgac   01:23:14                 1420K      120Gn    178204K         32      0:0 

no prechunking -p1: Elapsed (wall clock) time (h:mm:ss or m:ss): 1:48:28

       JobID      User    Account    Elapsed  Timelimit     MaxRSS     ReqMem  MaxVMSize      NCPUS ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------- 
26313021      swarbred       tgac   01:48:35 90-00:00:+                 300Gn                     1      0:0 
26313021.ba+                 tgac   01:48:35             20351316K      300Gn  20981652K          1      0:0 
26313021.ex+                 tgac   01:48:35                 1264K      300Gn    178212K          1      0:0 

memory use looks fine (multiprocessing_method: spawn)

Just to query, it's correct that the number of processors should match the number of tab chunks? i.e. as indicated below

 -p PROCS, --procs PROCS
                        Number of threads to use for analysing the BLAST
                        files. This number should not be higher than the total
                        number of XML files.

below is the blast loading lines from the log for the prechunking run above, the reading and finished reading time stamps are sequential for each chunk, is that as it should be? In my prechunked run i'm pointing at a directory of 32 tsv files.

2020-04-06 13:44:03,826 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-04-06 13:44:03,830 - serialiser - blast_serialiser.py:78 - INFO - __init__ - MainProcess - Number of dedicated workers: 32
2020-04-06 13:44:35,765 - serialiser - blast_serialiser.py:244 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-04-06 13:44:40,418 - serialiser - blast_serialiser.py:279 - INFO - __serialize_targets - MainProcess - Loaded 330060 objects into the "target" table
2020-04-06 13:44:40,422 - serialiser - blast_serialiser.py:173 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-04-06 13:44:40,818 - serialiser - blast_serialiser.py:221 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2020-04-06 13:44:40,907 - serialiser - tab_serialiser.py:30 - INFO - _serialise_tabular - MainProcess - Creating a pool with 32 workers for analysing BLAST results
2020-04-06 13:44:46,501 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--023_matches.tsv data
2020-04-06 13:45:02,795 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--023_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:46:16,611 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--001_matches.tsv data
2020-04-06 13:46:34,295 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--001_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:47:43,716 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--005_matches.tsv data
2020-04-06 13:48:00,714 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--005_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:49:07,599 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--022_matches.tsv data
2020-04-06 13:49:25,358 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--022_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:50:34,200 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--020_matches.tsv data
2020-04-06 13:50:51,062 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--020_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:52:01,151 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--032_matches.tsv data
2020-04-06 13:52:13,653 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--032_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:53:32,963 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--031_matches.tsv data
2020-04-06 13:53:46,350 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--031_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:55:04,394 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--015_matches.tsv data
2020-04-06 13:55:20,499 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--015_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:56:28,470 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--003_matches.tsv data
2020-04-06 13:56:46,161 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--003_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:57:55,454 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--008_matches.tsv data
2020-04-06 13:58:13,241 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--008_matches.tsv data, starting serialisation with 32 processors
2020-04-06 13:59:24,458 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--029_matches.tsv data
2020-04-06 13:59:38,955 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--029_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:00:55,779 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--030_matches.tsv data
2020-04-06 14:01:08,996 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--030_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:02:23,426 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--016_matches.tsv data
2020-04-06 14:02:41,662 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--016_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:03:49,672 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--004_matches.tsv data
2020-04-06 14:04:06,882 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--004_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:05:15,120 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--028_matches.tsv data
2020-04-06 14:05:29,479 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--028_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:06:45,604 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--027_matches.tsv data
2020-04-06 14:07:02,223 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--027_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:08:19,280 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--019_matches.tsv data
2020-04-06 14:08:36,852 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--019_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:09:50,058 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--024_matches.tsv data
2020-04-06 14:10:07,010 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--024_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:11:21,449 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--025_matches.tsv data
2020-04-06 14:11:37,560 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--025_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:13:00,959 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--021_matches.tsv data
2020-04-06 14:13:16,439 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--021_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:14:30,398 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--018_matches.tsv data
2020-04-06 14:14:46,533 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--018_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:16:00,004 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--006_matches.tsv data
2020-04-06 14:16:17,724 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--006_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:17:30,142 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--009_matches.tsv data
2020-04-06 14:17:47,043 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--009_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:18:53,849 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--026_matches.tsv data
2020-04-06 14:19:10,900 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--026_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:20:37,243 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--002_matches.tsv data
2020-04-06 14:20:54,821 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--002_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:22:07,516 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--010_matches.tsv data
2020-04-06 14:22:25,296 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--010_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:23:42,136 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--012_matches.tsv data
2020-04-06 14:23:59,367 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--012_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:25:22,087 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--017_matches.tsv data
2020-04-06 14:25:40,294 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--017_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:27:05,720 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--014_matches.tsv data
2020-04-06 14:27:24,028 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--014_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:28:54,857 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--013_matches.tsv data
2020-04-06 14:29:11,571 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--013_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:30:28,454 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--011_matches.tsv data
2020-04-06 14:30:47,068 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--011_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:32:00,191 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--007_matches.tsv data
2020-04-06 14:32:17,249 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches_chunked_tsv/mikado_prepared.chunk--007_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:33:39,435 - serialiser - tab_serialiser.py:43 - INFO - _serialise_tabular - MainProcess - Finished loading blast hits
2020-04-06 14:46:35,976 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data

and from the run with no prechunking and -p32

2020-04-06 13:41:03,700 - serialiser - serialise.py:107 - INFO - load_blast - MainProcess - Starting to load BLAST data
2020-04-06 13:41:03,707 - serialiser - blast_serialiser.py:78 - INFO - __init__ - MainProcess - Number of dedicated workers: 32
2020-04-06 13:41:15,657 - serialiser - blast_serialiser.py:244 - INFO - __serialize_targets - MainProcess - Started to serialise the targets
2020-04-06 13:41:17,844 - serialiser - blast_serialiser.py:279 - INFO - __serialize_targets - MainProcess - Loaded 330060 objects into the "target" table
2020-04-06 13:41:17,847 - serialiser - blast_serialiser.py:173 - INFO - __serialize_queries - MainProcess - Started to serialise the queries
2020-04-06 13:41:18,069 - serialiser - blast_serialiser.py:221 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table
2020-04-06 13:41:18,123 - serialiser - tab_serialiser.py:30 - INFO - _serialise_tabular - MainProcess - Creating a pool with 32 workers for analysing BLAST results
2020-04-06 13:41:20,681 - serialiser - tabular_utils.py:363 - INFO - parse_tab_blast - MainProcess - Reading mikado_prepared_matches.tsv data
2020-04-06 13:45:51,808 - serialiser - tabular_utils.py:385 - INFO - parse_tab_blast - MainProcess - Finished reading mikado_prepared_matches.tsv data, starting serialisation with 32 processors
2020-04-06 14:09:32,636 - serialiser - tab_serialiser.py:43 - INFO - _serialise_tabular - MainProcess - Finished loading blast hits
2020-04-06 14:12:45,957 - serialiser - serialise.py:125 - INFO - load_blast - MainProcess - Finished to load BLAST data

@lucventurini
Copy link
Collaborator

@swarbred
@ljyanesm

Excellent news, thank you! Now the job running statistics make sense, finally.

To answer your query, when analysing tabular data the number of processors does not need to match the number of files. So eg mikado will be able to exploit all threads when analysing 4 files with 32 processors.

The reason analysing multiple files takes longer is that the beginning stage (loading of the BLAST file and computation of additional values such as eg query ID in the database or minimum evalue per hit) happen in a single threaded way. There is therefore a one-off cost that makes loading multiple files more time-expensive (at the gain of lesser memory usage).

Memory usage, as I was saying earlier, can be decreased by tweaking the maximum number of objects in memory. The tradeoff is that a lower number means flushing to disk more often, a single - threaded and quite expensive bottleneck (this is down to Sqlite and I really can't do much about it directly).

Do you think we can finally merge to master?

@swarbred
Copy link
Collaborator

swarbred commented Apr 6, 2020

@lucventurini
Memory usage and run time look good to me so yes lest merge this, thanks

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
* For EI-CoreBioinformatics#280: first tentative to improve the speed of serialise.

* For EI-CoreBioinformatics#280: the biggest, by far, problem in `mikado serialise` for the ORFs was that the reading process waas **blocking**, due to using a simple queue with a bare put instead of put_nowait. This single change makes the code finally parallel and provides therefore a massive speed-up.

* Correct commit 4b88696

Removing the evaluation of invalidity causes bugs. It's better to recalculate constantly in this case.

* Correct previous commit

* Correct .travis.yml

Edited Travis to reflect the changes in master

* Correct travis

* Removing another simple queue from ORF, which could have again considerably slowed down `serialise`.

* For EI-CoreBioinformatics#280: now BED12 objects use numpy arrays rather than simple lists. Also, now using properties to avoid calculating the invalidity of a BED12 object over and over again. This should lead to some more speed improvements.

* BROKEN; for EI-CoreBioinformatics#280: trying to implement the slow functions using NumPy. Some improvements but currently broken, and it could be better

* For EI-CoreBioinformatics#280: fixed the previous error.

* For EI-CoreBioinformatics#280: fastest version of serialise yet. A shame that the new parser from bioPython is so slow

* For EI-CoreBioinformatics#280: making the merging algorithm much faster, thanks to Stack Overflow

* For EI-CoreBioinformatics#280: starting to reorganise code to allow using BLAST tabular - as opposed to only XMLs.

* For EI-CoreBioinformatics#280: progress on definining the functions for parsing tabular blast.

* Starting to implement the functions necessary to subdivide the work in groups.

* Stub for the prepare_tab_hit function.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
…ioinformatics#305)

* Issue EI-CoreBioinformatics#280: now mikado serialise has been refactored so that:
   * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function
   * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields)
   * `daijin` now automatically generates *tabular* rather than XML BLAST results
   * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage).
* Issue EI-CoreBioinformatics#270: now Mikado will remove redundancy based on intron chains
  * For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants