-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of Threads Limit? #512
Comments
When using
same machine, same dataset. |
Just an example log file: |
Just another info: building the reference with |
Hi @sklages here are a few things I would recommend to try: To resolve seg-faults: try the dynamically linked binary, and also compiling from scratch with make. Cheers |
I always build STAR from source and I tried with statically and dynamically linked binary. Both failed. The output of |
Hi @sklages Thanks a lot for helping me to debug this problem. Cheers |
I am afraid, I can't ... as I quite a lot I have removed these dirs. But I have run a new one with 40 threads which also failed (using STAR statically linked, for GNU/Linux 4.4.34): Log.out.txt hope this helps ... |
I tried to built the binary
I also had to use “our” libhts, as I couldn’t build the sources in
Unfortunately, the stack trace are not easily usable.
Maybe you are more lucky to get something out of that. |
Strangely, taking the message below as an example from the original post, there is no file with the name 43 (
So, how does |
Any idea, how files, like Before calling
Sometime in
|
Ignore my last post as I missed |
Running
|
Hi Paul, thanks for your help! Cheers |
What may cause such potential write errors / problems? It does not appear to happen randomly. I routinely use STAR with 20 cores in our RNAseq workflow. It only seems to crash when using more cores, no matter if I work on a heavily loaded file server or on my on workstation with local SSDs... |
Am 03.11.18 um 17:11 schrieb alexdobin:
thanks for your help!
No problem. Thank you for providing the software.
Were you able to reproduce the error in your run?
Yes, I can consistently reproduce it each time I use more than 20
threads. Additionally, I traced the regression down to the mentioned commit.
Do you have access to a system with more than 20 threads to reproduce
the issue?
You do not seem to have 0-sized bin files. For some reason in
@sklages run at least one bin (45th in the latest example) is
empty.
Using `ls -lR` I often have also (exactly(?)) one 0-sized bin file. (I
have to check that tomorrow, so please take it with a grain of salt.)
See the file 49 in my example. The error is about a different bin number
though.
I think the error happens when the files are written. I am actually
not checking if the write operation completes successfully - I will
add these checks and release a new patch shortly.
I am not sure about that, as the error is about another bin number.
Printing the value for `s1` in `BAMbinSortByCoordinate()`, I found a
strange behavior. `s1` is often the same but the actual file sizes
differ when checked with `ls -l`. The value of read bytes is then
strangely 0. C++ fstream stuff should be thread safe for reading, and
`s1` is a local(?) variable, so I am quite puzzled.
|
Hi Paul, @sklages thanks a lot for your help, I think I figured out what the problem was. TL;DR: Cheers |
On 11/05/18 16:59, alexdobin wrote:
thanks a lot for your help, I think I figured out what the problem
was.
It looks like you did. Thank you.
It's caused by the value of ulimit -n being too small, =1024 for most
systems by default.
On my standard systems, this value is increased to 10000, that's why
I could not reproduce the problem. When I tried it on another system,
I got the error with >20 threads.
The default is 1024 on our a system, and the hard limit 4096.
$ more /proc/sys/fs/file-max
105587815
$ ulimit -Sn
1024
$ ulimit -Hn
4096
Increasing the limit to the allowed maximum for the user (4096), the
issue was gone with the test data I had here.
The release that Paul found actually increase the number of bins to
50, which resulted in total number of temp files (21*50) to go above
1000. When I open the temp files, I did not check for errors, so it
could not write into them, and thus complained file size did not
match the recorded values.
I fixed that - if you try the latest patch from GitHub master, you
will get the error before the mapping starts that the temp file
cannot be open.
By the way, mentioning the issue number in a commit to link both has
proven very valuable.
Fixes: #512
Anyway, with your changes I get the error below as intended.
BAMoutput.cpp:27:BAMoutput: exiting because of *OUTPUT FILE* error: could not create output file temp.STAR.21//BAMsort/20/16
SOLUTION: check that the path exists and you have write permission for this file. Also check ulimit -n and increase it to allow more open files.
So, the issues is analyzed and fixed.
The limit of maximum open file descriptors could also explain, the
strange `s1` values in my debug output. `bamInStream.tellg()` might
just return garbage. [On error `-1` is returned][1], which does not
work well with an unsigned integer `s1`.
uint s1=bamInStream.tellg();
I created the separate issue [#516][2] for that.
Thank you again for fixing the issue.
[1]: http://www.cplusplus.com/reference/istream/istream/tellg/
[2]: #516
|
Hi @alexdobin , thanks for solving the issue, and @paulmenzel for your efforts hunting down the problem :-). |
Check soft limit and hard limit
Set a higher soft limit
|
System/Data
This is a standard human dataset, reference has been created from GENCODE:
Passed Mapping (20 threads)
Failed Mapping (21 threads)
Before reducing the threads parameter I was playing around (at threads=40) with gzipped/unzipped input, different versions of STAR, incl. dynamically or statically linked binary. All jobs failed with the same error.
Why is (my version of) STAR crashing when using more than 20 threads?
It is not really needed, because STAR is quite fast, .. but I don't like to see any segfaults :-)
(log says:
The text was updated successfully, but these errors were encountered: