Advise the kernel to preload the mapped memory #740

prusnak · 2023-04-03T10:30:39Z

Hopefully this helps with the loading times when using mmap() on Windows and Unix (Linux/macOS).

I tested only on macOS when the load time of 7B model got decreased from 7 seconds to 2 seconds with no inference performance change.

This needs further testing, so I am opening this as a draft.

One possible improvement is to call VirtualLock(addr, length) for Windows to lock the specified region of the process's virtual address space into physical memory. But I need someone to test this for me, whether this is needed and helpful.

prusnak · 2023-04-03T10:33:55Z

cc @comex @danielzgtg for testing since this is essentially your idea

diimdeep · 2023-04-03T11:53:14Z

For pure experiment you need to use different files on disk between runs, cos prev used file is still hangs in memory and affects subseq runs.

No change for me. macos catalina, llvm 16, haswell, 2 core

no patch
% bash -c "~/Downloads/rusage ./build16/bin/main -m ./models/ggml-model-q4_0.bin --ignore-eos --keep -1 -n 32 -p \"I love you so much that I would rather die than live without you.\" --mlock -s 7 -t 2"

llama_print_timings:        load time = 58458.70 ms
llama_print_timings:      sample time =    44.36 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  5917.82 ms /    16 tokens (  369.86 ms per token)
llama_print_timings:        eval time = 14649.04 ms /    31 runs   (  472.55 ms per run)
llama_print_timings:       total time = 75936.25 ms
RL: took 76,915,464µs wall time
RL: ballooned to 4,321,284kb in size
RL: needed 55,957,104µs cpu (26% kernel)
RL: caused 1,773,952 page faults (45% memcpy)
RL: 236,491 context switches (0% consensual)

second run

llama_print_timings:        load time =  5078.12 ms
llama_print_timings:      sample time =    43.43 ms /    32 runs   (    1.36 ms per run)
llama_print_timings: prompt eval time =  5759.01 ms /    16 tokens (  359.94 ms per token)
llama_print_timings:        eval time = 13144.22 ms /    31 runs   (  424.01 ms per run)
llama_print_timings:       total time = 21058.44 ms
RL: took 22,119,822µs wall time
RL: ballooned to 4,439,024kb in size
RL: needed 40,761,705µs cpu (6% kernel)
RL: caused 1,561,818 page faults (99% memcpy)
RL: 18,908 context switches (0% consensual)

with patch
% bash -c "~/Downloads/rusage ./build_madvise/bin/main -m ./models/ggml-model-q4_0_dub.bin --ignore-eos --keep -1 -n 32 -p \"I love you so much that I would rather die than live without you.\" --mlock -s 7 -t 2"

llama_print_timings:        load time = 62299.05 ms
llama_print_timings:      sample time =    43.28 ms /    32 runs   (    1.35 ms per run)
llama_print_timings: prompt eval time =  6342.92 ms /    16 tokens (  396.43 ms per token)
llama_print_timings:        eval time = 13020.24 ms /    31 runs   (  420.01 ms per run)
llama_print_timings:       total time = 78148.81 ms
RL: took 79,165,368µs wall time
RL: ballooned to 4,523,928kb in size
RL: needed 52,584,490µs cpu (25% kernel)
RL: caused 1,793,321 page faults (45% memcpy)
RL: 254,742 context switches (0% consensual)

second run

llama_print_timings:        load time =  5132.96 ms
llama_print_timings:      sample time =    44.25 ms /    32 runs   (    1.38 ms per run)
llama_print_timings: prompt eval time =  5786.85 ms /    16 tokens (  361.68 ms per token)
llama_print_timings:        eval time = 12903.39 ms /    31 runs   (  416.24 ms per run)
llama_print_timings:       total time = 20866.49 ms
RL: took 21,957,904µs wall time
RL: ballooned to 4,516,600kb in size
RL: needed 40,377,956µs cpu (6% kernel)
RL: caused 1,561,398 page faults (99% memcpy)
RL: 18,235 context switches (0% consensual)

danielzgtg · 2023-04-03T11:56:45Z

Testing with ./main -m ./models/7B/ggml-model-q4_0.bin -n 1 after echo 3 > /proc/sys/vm/drop_caches. Before is 437e778 and after is eaec9b6:

Linux HDD Before

llama_print_timings:        load time = 56575.00 ms
llama_print_timings:        load time = 57326.13 ms
llama_print_timings:        load time = 56969.07 ms

Linux SSD Before

llama_print_timings:        load time =  9710.73 ms
llama_print_timings:        load time =  9784.26 ms
llama_print_timings:        load time =  9526.75 ms

Linux HDD After

llama_print_timings:        load time = 58036.51 ms
llama_print_timings:        load time = 57874.68 ms
llama_print_timings:        load time = 57408.99 ms

Linux SSD After

llama_print_timings:        load time =  9664.59 ms
llama_print_timings:        load time =  9489.85 ms
llama_print_timings:        load time =  9618.77 ms

Yes, I had this idea 13 hours ago in #693 (comment) . No, I could not measure the improvement I predicted on Linux in #734 (comment) . The data I gathered shows it's not very statistically significant, and even ignoring that, that the trends for HDD and SSD are opposite for some reason.

I'm certain that this fix will help and is necessary for Windows users. I could test on Windows but it would take me a long time to set things up.

wtarreau · 2023-04-03T12:10:58Z

You may also want to try with MADV_SEQUENTIAL which can sometimes make read-ahead more aggressive.

It is also possible that the low performance on some OS is actually caused by too much read-ahead when the data are used in a random order. In this case, experimenting with MADV_RANDOM could help as it will instruct the OS to avoid reading too much ahead.

diimdeep · 2023-04-03T13:11:16Z

I observe that macOS Catalina default madvise MADV_NORMAL works best. (Intel, haswell)

MADV_NORMAL

llama_print_timings:        load time = 54875.36 ms
llama_print_timings:      sample time =    44.53 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  6328.93 ms /    16 tokens (  395.56 ms per token)
llama_print_timings:        eval time = 14420.26 ms /    31 runs   (  465.17 ms per run)
llama_print_timings:       total time = 72214.52 ms
RL: took 73,331,544µs wall time
RL: ballooned to 4,613,060kb in size
RL: needed 52,990,727µs cpu (22% kernel)
RL: caused 1,772,061 page faults (45% memcpy)
RL: 239,466 context switches (0% consensual)

llama_print_timings:        load time = 57754.30 ms
llama_print_timings:      sample time =    43.03 ms /    32 runs   (    1.34 ms per run)
llama_print_timings: prompt eval time =  5777.97 ms /    16 tokens (  361.12 ms per token)
llama_print_timings:        eval time = 12956.06 ms /    31 runs   (  417.94 ms per run)
llama_print_timings:       total time = 73558.99 ms
RL: took 77,934,852µs wall time
RL: ballooned to 4,409,088kb in size
RL: needed 53,599,758µs cpu (29% kernel)
RL: caused 1,795,243 page faults (45% memcpy)
RL: 256,893 context switches (0% consensual)

MADV_SEQUENTIAL

llama_print_timings:        load time = 119314.96 ms
llama_print_timings:      sample time =    44.42 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  7977.75 ms /    16 tokens (  498.61 ms per token)
llama_print_timings:        eval time = 13274.42 ms /    31 runs   (  428.21 ms per run)
llama_print_timings:       total time = 135789.80 ms
RL: took 136,111,828µs wall time
RL: ballooned to 4,540,508kb in size
RL: needed 62,201,584µs cpu (32% kernel)
RL: caused 2,127,817 page faults (54% memcpy)
RL: 605,273 context switches (0% consensual)

llama_print_timings:        load time = 104080.26 ms
llama_print_timings:      sample time =    48.20 ms /    32 runs   (    1.51 ms per run)
llama_print_timings: prompt eval time =  7899.04 ms /    16 tokens (  493.69 ms per token)
llama_print_timings:        eval time = 16277.43 ms /    31 runs   (  525.08 ms per run)
llama_print_timings:       total time = 124140.87 ms
RL: took 124,484,947µs wall time
RL: ballooned to 4,487,740kb in size
RL: needed 62,966,409µs cpu (24% kernel)
RL: caused 2,120,778 page faults (54% memcpy)
RL: 609,990 context switches (0% consensual)

MADV_RANDOM 

llama_print_timings:        load time = 69565.61 ms
llama_print_timings:      sample time =    44.47 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  5965.53 ms /    16 tokens (  372.85 ms per token)
llama_print_timings:        eval time = 13802.51 ms /    31 runs   (  445.24 ms per run)
llama_print_timings:       total time = 86211.98 ms
RL: took 86,585,114µs wall time
RL: ballooned to 4,276,708kb in size
RL: needed 56,930,888µs cpu (30% kernel)
RL: caused 2,532,876 page faults (61% memcpy)
RL: 1,000,187 context switches (0% consensual)

llama_print_timings:        load time = 94295.20 ms
llama_print_timings:      sample time =    47.66 ms /    32 runs   (    1.49 ms per run)
llama_print_timings: prompt eval time =  6679.52 ms /    16 tokens (  417.47 ms per token)
llama_print_timings:        eval time = 15375.85 ms /    31 runs   (  496.00 ms per run)
llama_print_timings:       total time = 113062.93 ms
RL: took 113,387,992µs wall time
RL: ballooned to 4,285,588kb in size
RL: needed 61,583,699µs cpu (30% kernel)
RL: caused 2,530,971 page faults (61% memcpy)
RL: 1,024,535 context switches (0% consensual)

CoderRC · 2023-04-03T13:25:52Z

Try:
MADV_WILLNEED

prusnak · 2023-04-03T16:27:51Z

Try:
MADV_WILLNEED

It was tried in the post above: #740 (comment)

comex · 2023-04-03T23:52:07Z

Looks good. One thing I’d change: instead of running this at load time, run it before every eval. This way, if, say, you’re running an interactive session, and the kernel decides to page out the model while it’s waiting for input, it’ll get paged back in efficiently. (That won’t help if the kernel decides to page out the model in the middle of evaluation, but there’s no way to help that without mlock.)

danielzgtg · 2023-04-04T03:19:49Z

I remembered a better option on Linux. We can use MAP_POPLUATE with mmap instead of mmap + madvise. This gets the kernel to do all of the loading in just one syscall.

We might still madvise for suspend/resume like @comex mentioned but nobody has complained about this problem on Linux. We also need to keep it for macOS, which does not support this flag.

wtarreau · 2023-04-04T06:10:31Z

I remembered a better option on Linux. We can use MAP_POPLUATE with mmap instead of mmap + madvise. This gets the kernel to do all of the loading in just one syscall.

You're totally right, I almost forgot about it! Just tried here on my ARM 4GB board and the flash read speed raised from 65-89 MB/s to 92-120! On my PC however it was the opposite, the load time doubled, from 9.9s to 20s. However about 1.7s from these 10 extra seconds were recovered in eval time, likely because the data were already where they were needed. But it's quite strange.

prusnak · 2023-04-04T08:56:16Z

Updated the commit to use MAP_POPULATE on Linux:

     int64_t length = lseek(fd, 0, SEEK_END);
+#ifdef __linux__
+    void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED | MAP_POPULATE, fd, 0);
+#else // MAP_POPULATE is only supported on Linux
     void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED, fd, 0);
+#endif
     close(fd);

prusnak · 2023-04-04T08:58:40Z

We can use MAP_POPLUATE with mmap instead of mmap + madvise.

Does it hurt to keep both MAP_POPULATE and madvise on Linux? This is what currently this PR does, but it's trivial to add another ifdef guard around madvise.

Features: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields (which improves flexibility, and will make it easier to support the new GPTQ-for-LLaMa models in the future). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Issues: - I switched from fopen/fread instead of ifstream, both to avoid the need to open the same file again to mmap it, and because I thought would be optimized to skip the buffer for large reads... XXX - VirtualLock does not work at all on the one Windows VM I tested it on (it complains about quota). Todo: figure out why. - Need to verify that fread actually is fast. - However, it doesn't work when I test it on my VM? Todo: Figure out why. Implementation notes: I tried to across several functions to make it easier to modify/refactor the code in the future. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)

Features: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Todo: - **VirtualLock does not work at all** on the one Windows machine I tested it on (it complains about quota). Figure out why. - Verify that using the `fopen` family of functions actually does what I think it does, performance-wise. - More testing. Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)

Features: - Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Todo: - **VirtualLock does not work at all** on the one Windows machine I tested it on (it complains about quota). Figure out why. - Verify that using the `fopen` family of functions actually does what I think it does, performance-wise. - More testing. Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)

- Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on ggerganov#740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)

- Support all three formats (ggml, ggmf, ggjt). (However, I didn't include the hack needed to support GPT4All files without conversion. Those can still be used after converting them with convert.py from my other PR.) - Support both mmap and read (mmap is used by default, but can be disabled with `--no-mmap`, and is automatically disabled for pre-ggjt files or on platforms where mmap is not supported). - Support multi-file models like before, but automatically determine the number of parts rather than requiring `--n_parts`. - Improve validation and error checking. - Stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields. This has no immediate benefit, but makes it easier to experiment with different formats, and should make it easier to support the new GPTQ-for-LLaMa models in the future (I have some work in progress on that front). - Support VirtualLock on Windows (using the same `--mlock` option as on Unix). - Indicate loading progress when using mmap + mlock. (Which led me to the interesting observation that on my Linux machine, with a warm file cache, mlock actually takes some time, whereas mmap without mlock starts almost instantly...) - To help implement this, move mlock support from ggml to the loading code. - madvise/PrefetchVirtualMemory support (based on #740) - Switch from ifstream to the `fopen` family of functions to avoid unnecessary copying and, when mmap is enabled, allow reusing the same file descriptor for both metadata reads and mmap (whereas the existing implementation opens the file a second time to mmap). - Quantization now produces a single-file output even with multi-file inputs (not really a feature as much as 'it was easier this way'). Implementation notes: I tried to factor the code into more discrete pieces than before. Regarding code style: I tried to follow the code style, but I'm naughty and used a few advanced C++ features repeatedly: - Destructors to make it easier to ensure everything gets cleaned up. - Exceptions. I don't even usually use exceptions when writing C++, and I can remove them if desired... but here they make the loading code much more succinct while still properly handling a variety of errors, ranging from API calls failing to integer overflow and allocation failure. The exceptions are converted to error codes at the API boundary.) Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)

prusnak · 2023-04-10T09:54:45Z

Has been merged as part of #801

- In the launcher, if an existing value is set for a file value (e.g. Model), use that file's directory the initial directory when the file dialog is opened with 'Browse'. - In the launcher always set the intial directory for 'Load' to cwd.

This was referenced Apr 3, 2023

Windows page fault disk i/o slow on first load #705

Closed

Fix model loading time through prefetching the file on another thread #734

Closed

prusnak mentioned this pull request Apr 3, 2023

Bring back the ggml model format and revert breaking mmap change (#613) #711

Closed

Advise the kernel to preload the mapped memory

888db62

prusnak force-pushed the mmap-preload branch from eaec9b6 to 888db62 Compare April 4, 2023 08:55

prusnak mentioned this pull request Apr 4, 2023

Change mmap parameters to avoid much swap thrashing #753

Closed

comex mentioned this pull request Apr 6, 2023

Rewrite loading code to try to satisfy everyone #801

Merged

prusnak closed this Apr 10, 2023

prusnak deleted the mmap-preload branch April 10, 2023 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advise the kernel to preload the mapped memory #740

Advise the kernel to preload the mapped memory #740

prusnak commented Apr 3, 2023 •

edited

Loading

prusnak commented Apr 3, 2023

diimdeep commented Apr 3, 2023 •

edited

Loading

danielzgtg commented Apr 3, 2023

wtarreau commented Apr 3, 2023

diimdeep commented Apr 3, 2023 •

edited

Loading

CoderRC commented Apr 3, 2023

prusnak commented Apr 3, 2023

comex commented Apr 3, 2023

danielzgtg commented Apr 4, 2023

wtarreau commented Apr 4, 2023

prusnak commented Apr 4, 2023

prusnak commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 10, 2023

Advise the kernel to preload the mapped memory #740

Advise the kernel to preload the mapped memory #740

Conversation

prusnak commented Apr 3, 2023 • edited Loading

prusnak commented Apr 3, 2023

diimdeep commented Apr 3, 2023 • edited Loading

danielzgtg commented Apr 3, 2023

Linux HDD Before

Linux SSD Before

Linux HDD After

Linux SSD After

wtarreau commented Apr 3, 2023

diimdeep commented Apr 3, 2023 • edited Loading

CoderRC commented Apr 3, 2023

prusnak commented Apr 3, 2023

comex commented Apr 3, 2023

danielzgtg commented Apr 4, 2023

wtarreau commented Apr 4, 2023

prusnak commented Apr 4, 2023

prusnak commented Apr 4, 2023 • edited Loading

prusnak commented Apr 10, 2023

prusnak commented Apr 3, 2023 •

edited

Loading

diimdeep commented Apr 3, 2023 •

edited

Loading

diimdeep commented Apr 3, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading