Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appears to be a tmp folder persisting? #401

Closed
Desperate-Dan opened this issue Apr 4, 2022 · 12 comments
Closed

Appears to be a tmp folder persisting? #401

Desperate-Dan opened this issue Apr 4, 2022 · 12 comments

Comments

@Desperate-Dan
Copy link

Hello! I've run pangolin a couple of times since the update and there appears to be a tmp folder that persists after completion that contains a lot of what look to me like intermediate files:

image

It's always called tmpXXXXXXXX which changes each time (eg above is tmprk5ptk5p) so potentially just a missing / somewhere?

@AngieHinrichs
Copy link
Member

@Desperate-Dan can you send the pangolin command line that you're running? Where is the tmp folder created? And can you send the last ~10 lines of pangolin output? Thanks.

@aineniamh
Copy link
Member

So that looks like everything that would be output from the preprossessing.smk pipeline, I can't seem to replicate this as the directory is cleared as expected on my system. If you try running in verbose mode and see where it says your tempdir lives it might give a clue?

@rgerhards
Copy link

I have the same issue and worked around it by clearing at end of pipeline. This is the case for all types of runs, including successful one. I can confirm it fills the disk if no manual cleanup is done. I run pangolin with --tempdir <directory>.

@aineniamh
Copy link
Member

That's very interesting @rgerhards, everything should get wrtten to a tempdir that gets cleared automatically upon completion (uses python tempdir module). Are you still seeing all the intermediate files?

@rgerhards
Copy link

I need to re-run as soon as the current run is through (as I said, the pipeline clears temp files out). Will happily do that, but will take me at least until tomorrow morning to provide more info.

@rgerhards
Copy link

rgerhards commented Apr 6, 2022

I can confirm the issue. file system after run:

$ du -h $TEMPDIR/
4.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/conda-archive
4.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/incomplete
8.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/metadata
4.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/singularity
4.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/locks
4.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/shadow
4.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/conda
8.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/log
4.0K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake/auxiliary
48K	/data/rger/corona/pangolin/tmpiqmluibm/.snakemake
8.0K	/data/rger/corona/pangolin/tmpiqmluibm/logs
64G	/data/rger/corona/pangolin/tmpiqmluibm
64G	/data/rger/corona/pangolin/
$ ls -lh /data/rger/corona/pangolin/tmpiqmluibm
total 64G
-rw-rw-r-- 1 rger rger  22G Apr  6 08:01 alignment.fasta
drwxrwxr-x 2 rger rger 4.0K Apr  6 08:57 logs
-rw-rw-r-- 1 rger rger  22G Apr  6 07:51 mapped.sam
-rw-rw-r-- 1 rger rger  22G Apr  6 07:21 stdin_query.fasta

one log file:

[M::mm_idx_gen::0.011*1.37] collected minimizers
[M::mm_idx_gen::0.016*1.88] sorted minimizers
[M::main::0.016*1.88] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.017*1.83] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.018*1.79] distinct minimizers: 5370 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.569; total length: 29903
[M::worker_pipeline::54.591*2.89] mapped 16732 sequences
[M::worker_pipeline::93.303*3.33] mapped 16734 sequences

...

[M::worker_pipeline::1613.625*3.90] mapped 16767 sequences
[M::worker_pipeline::1652.378*3.90] mapped 16760 sequences
[M::worker_pipeline::1690.681*3.90] mapped 16776 sequences
[M::worker_pipeline::1729.323*3.90] mapped 16768 sequences
[M::worker_pipeline::1766.802*3.90] mapped 16784 sequences
[M::worker_pipeline::1773.885*3.90] mapped 3336 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -a -x asm20 --sam-hit-only --secondary=no --score-N=0 -t 4 -o /data/rger/corona/pangolin/tmpiqmluibm/mapped.sam /data/rger/miniconda3/envs/pangolin/lib/python3.8/site-packages/pangolin/data/reference.fasta -
[M::main] Real time: 1774.045 sec; CPU: 6922.025 sec; Peak RSS: 1.877 GB

pangolin call:

time unxz -k - < $WORKDIR/SARS-CoV-2-Sequenzdaten_Deutschland.fasta.xz |pangolin  --analysis-mode fast --tempdir "$TEMPDIR" -t$PANGOLIN_THREADS -

Versions:

$ pangolin --all-versions
****
Pangolin running in usher mode.
****
pangolin: 4.0
pangolin-data: 1.2.133
constellations: v0.1.4
scorpio: 0.3.16

I'll update to 4.0.x soon.

@rgerhards
Copy link

Update: same with 4.0.2 I now also have a scorpio log file (with no helpful info as far as I can see).

@Desperate-Dan
Copy link
Author

Apologies for the delay, I've done some tweaking and I believe my issue was being caused by running pangolin in the /tmp directory within a docker container. Switching to another directory in my container for running pangolin means that the pangolin /tmpXXXXX dir is produced in the container's /tmp dir, so isn't maintained when the container shuts down. If I keep the container running though I can go to /tmp dir in the container and see the pangolin /tmpXXXXX dir so it is persisting after the command. This isn't really an issue for me as my container is continually restarted, but it is something that's changed since the update. Perhaps this is now a permissions issue within my container? I'll investigate further.

I've also attached the full output from the pangolin command @AngieHinrichs.

Thanks!

Docker_container_pangolin_tmp_issue

@rgerhards
Copy link

For me, it is a regular VM install, so no container involved.

I was a bit lazy and just did a grep -r tempfile over the pangolin source tree and what came up were calls to tempfile.mkdtemp(). Not sure if I missed other relevant bits by not doing a real code review. If not, tempfiledoc says:

"The user of mkdtemp() is responsible for deleting the temporary directory and its contents when done with it."

Source: https://docs.python.org/3/library/tempfile.html#tempfile.mkdtemp

Might this be the root issue? My apologies if I did look at the wrong spots (a full code review is unfortunately out of scope for me). Thanks again for all your great work!

@rgerhards
Copy link

rgerhards commented Apr 6, 2022

Oh, indeded. see commit: d30284b

As far as I read it, it changed TemporaryDirectory(), which does cleanup when done (https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory), to mkdtemp(), which does NOT cleanup.

Edit info: used wrong sample, but commit was OK, so collapsing this to relevant piece of info.

@aineniamh
Copy link
Member

I think that is the source of the issue you're right, I've added in a cleanup step to the current dev branch and will just see if it passes tests!

@aineniamh
Copy link
Member

Resolved in pangolin v4.0.3!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants