-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
realloc(18446744039349813248) failed. #147
Comments
If it's relevant, I only really need the median and IQR. |
Spot-on re 32-bit ints:
I definitely intend to handle things > 4GB but I've tested little; clearly missed a callsite. |
Thanks for looking into it, John! |
@sjackman I found one spot, at least. Fix committed to head & passing regression tests. Now to validate ...
runs me out of RAM entirely on my laptop (not surprising since |
Thanks for the quick fix, John! I appreciate it. I'll test it in August when I'm back from travels. |
To test without the non-streaming seq 5000000000 | mlr stats1 -a sum,count,min,max,p50 -f 1 |
@sjackman thanks! FWIW I ran out of RAM on my larger host too. (Your 2.5T hardware is impressive indeed.) Let me know how it works for you. |
Do you have an estimate of how much RAM you expect it to use? |
It looks like this command will take about 80 GB of RAM to run. It's using 8 GB at the 10% mark. ❯❯❯ seq 5000000000 | pv -pls 5000000000 | mlr stats1 -a sum,count,min,max,p50 -f 1
[======> ] 10%
❯❯❯ top -p 167197
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
167197 sjackman 20 0 9.773g 7.975g 100.0 0.3 10:22.94 R mlr |
Still running at 75% now and 55 GB of memory usage. Looks promising. |
That's one impressive machine - 2.5TB of RAM! If miller works out, I think it deserves a little write up of how you're using it. |
Memory usage has levelled off at 91 GB. Now it's thinking hard. |
It worked! It took 2 hours elapsed time. Would there be any speed gains in multithreading parts of Miller?
Does |
2 hours is fast for that data size, I think -- given single-threaded execution. Miller is single-threaded by design; a little command-line tool for those times when you don't want to bring out the big guns (hadoop or whatever). My experience with this kind of processing over the years is that disk-reads and data-parsing take up the lion's share of the time & in-core computations are relatively small. So multi-threading helps a little but the disk is still single-threaded, as it were. :^/ So I kept the code single-threaded and simple. If disk files can be split up across machines then there is some parallelism to be had, even for single-threaded programs like Miller. (I.e. run multiple instances of simple programs over files on multiple hosts.) Mean, sum, count, min, max are easily distributable. Percentiles not so much. :^/ |
Makes sense to me. Thank again for the quick fix, John! |
Is a stable release with this fix imminent? I'll update the Homebrew/Linuxbrew formula for Miller. |
yeah now that you've verified it i'll cut a bugfix release, next few days. i usually update homebrew as part of the process; no need for you to duplicate that. thanks @sjackman!!! |
Great. Thanks, John! |
Thanks, John! |
Hi, John. I'm running
mlr
to calculatecount,p25,p50,p75,mean,stddev
of one integer column with three billion rows, one row per nucleotide of the human genome. It fails with the error messagerealloc(18446744039349813248) failed.
The machine in question has 2.5 terabytes of RAM, so it should have enough RAM to hold the column in memory, about 24 GB at 8 bytes per row. Is the bug possibly caused by holding the number of rows in a 32-bitint
rather than a 64-bitsize_t
?The text was updated successfully, but these errors were encountered: