Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advise the kernel to preload the mapped memory #740

Closed
wants to merge 1 commit into from
Closed

Conversation

prusnak
Copy link
Collaborator

@prusnak prusnak commented Apr 3, 2023

Hopefully this helps with the loading times when using mmap() on Windows and Unix (Linux/macOS).

I tested only on macOS when the load time of 7B model got decreased from 7 seconds to 2 seconds with no inference performance change.

This needs further testing, so I am opening this as a draft.

One possible improvement is to call VirtualLock(addr, length) for Windows to lock the specified region of the process's virtual address space into physical memory. But I need someone to test this for me, whether this is needed and helpful.

@prusnak
Copy link
Collaborator Author

prusnak commented Apr 3, 2023

cc @comex @danielzgtg for testing since this is essentially your idea

@diimdeep
Copy link

diimdeep commented Apr 3, 2023

For pure experiment you need to use different files on disk between runs, cos prev used file is still hangs in memory and affects subseq runs.

No change for me. macos catalina, llvm 16, haswell, 2 core

no patch
% bash -c "~/Downloads/rusage ./build16/bin/main -m ./models/ggml-model-q4_0.bin --ignore-eos --keep -1 -n 32 -p \"I love you so much that I would rather die than live without you.\" --mlock -s 7 -t 2"

llama_print_timings:        load time = 58458.70 ms
llama_print_timings:      sample time =    44.36 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  5917.82 ms /    16 tokens (  369.86 ms per token)
llama_print_timings:        eval time = 14649.04 ms /    31 runs   (  472.55 ms per run)
llama_print_timings:       total time = 75936.25 ms
RL: took 76,915,464µs wall time
RL: ballooned to 4,321,284kb in size
RL: needed 55,957,104µs cpu (26% kernel)
RL: caused 1,773,952 page faults (45% memcpy)
RL: 236,491 context switches (0% consensual)

second run

llama_print_timings:        load time =  5078.12 ms
llama_print_timings:      sample time =    43.43 ms /    32 runs   (    1.36 ms per run)
llama_print_timings: prompt eval time =  5759.01 ms /    16 tokens (  359.94 ms per token)
llama_print_timings:        eval time = 13144.22 ms /    31 runs   (  424.01 ms per run)
llama_print_timings:       total time = 21058.44 ms
RL: took 22,119,822µs wall time
RL: ballooned to 4,439,024kb in size
RL: needed 40,761,705µs cpu (6% kernel)
RL: caused 1,561,818 page faults (99% memcpy)
RL: 18,908 context switches (0% consensual)
with patch
% bash -c "~/Downloads/rusage ./build_madvise/bin/main -m ./models/ggml-model-q4_0_dub.bin --ignore-eos --keep -1 -n 32 -p \"I love you so much that I would rather die than live without you.\" --mlock -s 7 -t 2"

llama_print_timings:        load time = 62299.05 ms
llama_print_timings:      sample time =    43.28 ms /    32 runs   (    1.35 ms per run)
llama_print_timings: prompt eval time =  6342.92 ms /    16 tokens (  396.43 ms per token)
llama_print_timings:        eval time = 13020.24 ms /    31 runs   (  420.01 ms per run)
llama_print_timings:       total time = 78148.81 ms
RL: took 79,165,368µs wall time
RL: ballooned to 4,523,928kb in size
RL: needed 52,584,490µs cpu (25% kernel)
RL: caused 1,793,321 page faults (45% memcpy)
RL: 254,742 context switches (0% consensual)

second run

llama_print_timings:        load time =  5132.96 ms
llama_print_timings:      sample time =    44.25 ms /    32 runs   (    1.38 ms per run)
llama_print_timings: prompt eval time =  5786.85 ms /    16 tokens (  361.68 ms per token)
llama_print_timings:        eval time = 12903.39 ms /    31 runs   (  416.24 ms per run)
llama_print_timings:       total time = 20866.49 ms
RL: took 21,957,904µs wall time
RL: ballooned to 4,516,600kb in size
RL: needed 40,377,956µs cpu (6% kernel)
RL: caused 1,561,398 page faults (99% memcpy)
RL: 18,235 context switches (0% consensual)

@danielzgtg
Copy link

Testing with ./main -m ./models/7B/ggml-model-q4_0.bin -n 1 after echo 3 > /proc/sys/vm/drop_caches. Before is 437e778 and after is eaec9b6:

Linux HDD Before

llama_print_timings:        load time = 56575.00 ms
llama_print_timings:        load time = 57326.13 ms
llama_print_timings:        load time = 56969.07 ms

Linux SSD Before

llama_print_timings:        load time =  9710.73 ms
llama_print_timings:        load time =  9784.26 ms
llama_print_timings:        load time =  9526.75 ms

Linux HDD After

llama_print_timings:        load time = 58036.51 ms
llama_print_timings:        load time = 57874.68 ms
llama_print_timings:        load time = 57408.99 ms

Linux SSD After

llama_print_timings:        load time =  9664.59 ms
llama_print_timings:        load time =  9489.85 ms
llama_print_timings:        load time =  9618.77 ms

Yes, I had this idea 13 hours ago in #693 (comment) . No, I could not measure the improvement I predicted on Linux in #734 (comment) . The data I gathered shows it's not very statistically significant, and even ignoring that, that the trends for HDD and SSD are opposite for some reason.

I'm certain that this fix will help and is necessary for Windows users. I could test on Windows but it would take me a long time to set things up.

@wtarreau
Copy link
Contributor

wtarreau commented Apr 3, 2023

You may also want to try with MADV_SEQUENTIAL which can sometimes make read-ahead more aggressive.

It is also possible that the low performance on some OS is actually caused by too much read-ahead when the data are used in a random order. In this case, experimenting with MADV_RANDOM could help as it will instruct the OS to avoid reading too much ahead.

@diimdeep
Copy link

diimdeep commented Apr 3, 2023

I observe that macOS Catalina default madvise MADV_NORMAL works best. (Intel, haswell)

MADV_NORMAL

llama_print_timings:        load time = 54875.36 ms
llama_print_timings:      sample time =    44.53 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  6328.93 ms /    16 tokens (  395.56 ms per token)
llama_print_timings:        eval time = 14420.26 ms /    31 runs   (  465.17 ms per run)
llama_print_timings:       total time = 72214.52 ms
RL: took 73,331,544µs wall time
RL: ballooned to 4,613,060kb in size
RL: needed 52,990,727µs cpu (22% kernel)
RL: caused 1,772,061 page faults (45% memcpy)
RL: 239,466 context switches (0% consensual)

llama_print_timings:        load time = 57754.30 ms
llama_print_timings:      sample time =    43.03 ms /    32 runs   (    1.34 ms per run)
llama_print_timings: prompt eval time =  5777.97 ms /    16 tokens (  361.12 ms per token)
llama_print_timings:        eval time = 12956.06 ms /    31 runs   (  417.94 ms per run)
llama_print_timings:       total time = 73558.99 ms
RL: took 77,934,852µs wall time
RL: ballooned to 4,409,088kb in size
RL: needed 53,599,758µs cpu (29% kernel)
RL: caused 1,795,243 page faults (45% memcpy)
RL: 256,893 context switches (0% consensual)

MADV_SEQUENTIAL

llama_print_timings:        load time = 119314.96 ms
llama_print_timings:      sample time =    44.42 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  7977.75 ms /    16 tokens (  498.61 ms per token)
llama_print_timings:        eval time = 13274.42 ms /    31 runs   (  428.21 ms per run)
llama_print_timings:       total time = 135789.80 ms
RL: took 136,111,828µs wall time
RL: ballooned to 4,540,508kb in size
RL: needed 62,201,584µs cpu (32% kernel)
RL: caused 2,127,817 page faults (54% memcpy)
RL: 605,273 context switches (0% consensual)

llama_print_timings:        load time = 104080.26 ms
llama_print_timings:      sample time =    48.20 ms /    32 runs   (    1.51 ms per run)
llama_print_timings: prompt eval time =  7899.04 ms /    16 tokens (  493.69 ms per token)
llama_print_timings:        eval time = 16277.43 ms /    31 runs   (  525.08 ms per run)
llama_print_timings:       total time = 124140.87 ms
RL: took 124,484,947µs wall time
RL: ballooned to 4,487,740kb in size
RL: needed 62,966,409µs cpu (24% kernel)
RL: caused 2,120,778 page faults (54% memcpy)
RL: 609,990 context switches (0% consensual)

MADV_RANDOM 

llama_print_timings:        load time = 69565.61 ms
llama_print_timings:      sample time =    44.47 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  5965.53 ms /    16 tokens (  372.85 ms per token)
llama_print_timings:        eval time = 13802.51 ms /    31 runs   (  445.24 ms per run)
llama_print_timings:       total time = 86211.98 ms
RL: took 86,585,114µs wall time
RL: ballooned to 4,276,708kb in size
RL: needed 56,930,888µs cpu (30% kernel)
RL: caused 2,532,876 page faults (61% memcpy)
RL: 1,000,187 context switches (0% consensual)

llama_print_timings:        load time = 94295.20 ms
llama_print_timings:      sample time =    47.66 ms /    32 runs   (    1.49 ms per run)
llama_print_timings: prompt eval time =  6679.52 ms /    16 tokens (  417.47 ms per token)
llama_print_timings:        eval time = 15375.85 ms /    31 runs   (  496.00 ms per run)
llama_print_timings:       total time = 113062.93 ms
RL: took 113,387,992µs wall time
RL: ballooned to 4,285,588kb in size
RL: needed 61,583,699µs cpu (30% kernel)
RL: caused 2,530,971 page faults (61% memcpy)
RL: 1,024,535 context switches (0% consensual)

@CoderRC
Copy link

CoderRC commented Apr 3, 2023

Try:
MADV_WILLNEED

@prusnak
Copy link
Collaborator Author

prusnak commented Apr 3, 2023

Try:
MADV_WILLNEED

It was tried in the post above: #740 (comment)

@comex
Copy link
Contributor

comex commented Apr 3, 2023

Looks good. One thing I’d change: instead of running this at load time, run it before every eval. This way, if, say, you’re running an interactive session, and the kernel decides to page out the model while it’s waiting for input, it’ll get paged back in efficiently. (That won’t help if the kernel decides to page out the model in the middle of evaluation, but there’s no way to help that without mlock.)

@danielzgtg
Copy link

I remembered a better option on Linux. We can use MAP_POPLUATE with mmap instead of mmap + madvise. This gets the kernel to do all of the loading in just one syscall.

We might still madvise for suspend/resume like @comex mentioned but nobody has complained about this problem on Linux. We also need to keep it for macOS, which does not support this flag.

@wtarreau
Copy link
Contributor

wtarreau commented Apr 4, 2023

I remembered a better option on Linux. We can use MAP_POPLUATE with mmap instead of mmap + madvise. This gets the kernel to do all of the loading in just one syscall.

You're totally right, I almost forgot about it! Just tried here on my ARM 4GB board and the flash read speed raised from 65-89 MB/s to 92-120! On my PC however it was the opposite, the load time doubled, from 9.9s to 20s. However about 1.7s from these 10 extra seconds were recovered in eval time, likely because the data were already where they were needed. But it's quite strange.

@prusnak
Copy link
Collaborator Author

prusnak commented Apr 4, 2023

Updated the commit to use MAP_POPULATE on Linux:

     int64_t length = lseek(fd, 0, SEEK_END);
+#ifdef __linux__
+    void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED | MAP_POPULATE, fd, 0);
+#else // MAP_POPULATE is only supported on Linux
     void *addr = mmap(NULL, length, PROT_READ, MAP_SHARED, fd, 0);
+#endif
     close(fd);

@prusnak
Copy link
Collaborator Author

prusnak commented Apr 4, 2023

We can use MAP_POPLUATE with mmap instead of mmap + madvise.

Does it hurt to keep both MAP_POPULATE and madvise on Linux? This is what currently this PR does, but it's trivial to add another ifdef guard around madvise.

comex added a commit to comex/llama.cpp that referenced this pull request Apr 6, 2023
Features:

- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields (which improves
  flexibility, and will make it easier to support the new GPTQ-for-LLaMa
  models in the future).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

- madvise/PrefetchVirtualMemory support (based on ggerganov#740)

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Issues:

- I switched from fopen/fread instead of ifstream, both to avoid the need to open the
  same file again to mmap it, and because I thought would be optimized to skip the buffer for large reads...
  XXX

- VirtualLock does not work at all on the one Windows VM I tested it on
  (it complains about quota).  Todo: figure out why.

- Need to verify that fread actually is fast.
-
  However, it doesn't work when I test it on my VM?  Todo:
  Figure out why.

Implementation notes:

I tried to across several functions to make it easier to modify/refactor the
code in the future.

Regarding code style: I tried to follow the code style, but I'm naughty and
used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
comex added a commit to comex/llama.cpp that referenced this pull request Apr 6, 2023
Features:

- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

- madvise/PrefetchVirtualMemory support (based on ggerganov#740)

- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Todo:

- **VirtualLock does not work at all** on the one Windows machine I tested
  it on (it complains about quota).  Figure out why.

- Verify that using the `fopen` family of functions actually does what I
  think it does, performance-wise.

- More testing.

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty and
used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
comex added a commit to comex/llama.cpp that referenced this pull request Apr 6, 2023
Features:

- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

- Indicate loading progress when using mmap + mlock.  (Which led me to
  the interesting observation that on my Linux machine, with a warm file
  cache, mlock actually takes some time, whereas mmap without mlock
  starts almost instantly...)

  - To help implement this, move mlock support from ggml to the loading
    code.

- madvise/PrefetchVirtualMemory support (based on ggerganov#740)

- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Todo:

- **VirtualLock does not work at all** on the one Windows machine I tested
  it on (it complains about quota).  Figure out why.

- Verify that using the `fopen` family of functions actually does what I
  think it does, performance-wise.

- More testing.

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty and
used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
comex added a commit to comex/llama.cpp that referenced this pull request Apr 8, 2023
- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

    - Indicate loading progress when using mmap + mlock.  (Which led me
      to the interesting observation that on my Linux machine, with a
      warm file cache, mlock actually takes some time, whereas mmap
      without mlock starts almost instantly...)

      - To help implement this, move mlock support from ggml to the
        loading code.

- madvise/PrefetchVirtualMemory support (based on ggerganov#740)

- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.  The exceptions are converted to error codes at the
  API boundary.)

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from ggerganov#740)
blackhole89 pushed a commit that referenced this pull request Apr 9, 2023
- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

    - Indicate loading progress when using mmap + mlock.  (Which led me
      to the interesting observation that on my Linux machine, with a
      warm file cache, mlock actually takes some time, whereas mmap
      without mlock starts almost instantly...)

      - To help implement this, move mlock support from ggml to the
        loading code.

- madvise/PrefetchVirtualMemory support (based on #740)

- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.  The exceptions are converted to error codes at the
  API boundary.)

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)
@prusnak
Copy link
Collaborator Author

prusnak commented Apr 10, 2023

Has been merged as part of #801

@prusnak prusnak closed this Apr 10, 2023
@prusnak prusnak deleted the mmap-preload branch April 10, 2023 09:54
Qeeweew pushed a commit to Qeeweew/llama.cpp that referenced this pull request May 17, 2024
- In the launcher, if an existing value is set for a file value (e.g.
Model), use that file's directory the initial directory when the
file dialog is opened with 'Browse'.
- In the launcher always set the intial directory for 'Load' to
cwd.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants