Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OLMo HF to GGUF conversion #6910

Merged
merged 5 commits into from
May 7, 2024
Merged

Conversation

nopperl
Copy link
Contributor

@nopperl nopperl commented Apr 25, 2024

Fix the HF to GGUF conversion of OLMo models:

@josharian
Copy link

I found this PR via #6712, which I am also experiencing. I patched this PR in and got a new failure. llama.cpp version b8c1476 (head as of right now).

$ python convert-hf-to-gguf.py OLMo-7B-hf --outfile olmo-7b
Loading model: OLMo-7B-hf
gguf: This GGUF file is for Little Endian only
Set model parameters
gguf: context length = 2048
gguf: embedding length = 4096
gguf: feed forward length = 11008
gguf: head count = 32
gguf: key-value head count = 32
gguf: rope theta = 10000.0
gguf: file type = 1
Set model tokenizer
chktok: [586, 1744, 33525, 186, 209, 623, 28910, 187, 50276, 187, 50275, 187, 50274, 187, 50273, 187, 14931, 237, 211, 313, 6320, 10, 49042, 116, 325, 224, 14931, 223, 106, 171, 118, 226, 313, 34263, 802, 13511, 261, 32147, 456, 10, 3384, 239, 216, 22692, 101, 236, 14931, 101, 236, 495, 5922, 30057, 495, 20084, 495, 26409, 30057, 20084, 495, 26409, 1610, 495, 26409, 20084, 495, 15, 20, 495, 537, 20, 495, 1051, 20, 209, 18081, 211, 18081, 116, 18081, 230, 39936, 222, 18081, 226, 39936, 213, 18081, 233, 18081, 117, 18081, 242, 39936, 212, 18081, 242, 18081, 97, 18081, 116, 18081, 216, 14931, 235, 212, 3736, 15367, 41197, 13610, 19934, 41869, 21275, 1012, 1047, 18795, 40120, 20422, 241, 16081, 6877, 12880, 11514, 1068, 8713, 38177, 13396, 3415, 9925, 12559, 10453, 1389, 42011, 35033, 34842, 11202, 9739, 9739, 33021, 18963, 4672, 25561, 8220, 309, 1849, 644, 686, 42618, 344, 434, 627, 13, 686, 1848, 368, 2119, 32, 686, 46, 417, 2119, 309, 1833, 1056, 352, 13, 686, 37, 368, 751, 690, 10331, 32, 844, 8, 31516, 247, 8, 77, 45, 50279]
chkhsh: 252ad757e225d729882d4763e69f762dc6311bb819eb2c0288817e7bbe9b99d9


**************************************************************************************
** WARNING: The BPE pre-tokenizer was not recognized!
**          This means that it was not added yet or you are using an older version.
**          Check convert-hf-to-gguf-update.py and update it accordingly.
**
** chkhsh:  252ad757e225d729882d4763e69f762dc6311bb819eb2c0288817e7bbe9b99d9
**************************************************************************************


Traceback (most recent call last):
  File "/Users/josh/x/llama.cpp/convert-hf-to-gguf.py", line 3569, in <module>
    main()
  File "/Users/josh/x/llama.cpp/convert-hf-to-gguf.py", line 3556, in main
    model_instance.set_vocab()
  File "/Users/josh/x/llama.cpp/convert-hf-to-gguf.py", line 103, in set_vocab
    self._set_vocab_gpt2()
  File "/Users/josh/x/llama.cpp/convert-hf-to-gguf.py", line 418, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/x/llama.cpp/convert-hf-to-gguf.py", line 321, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/josh/x/llama.cpp/convert-hf-to-gguf.py", line 408, in get_vocab_base_pre
    raise NotImplementedError(
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

@nopperl
Copy link
Contributor Author

nopperl commented Apr 29, 2024

It seems like the error comes from the BPE pre-tokenization merged in #6920.

@nopperl nopperl force-pushed the fix-olmo-conversion branch 2 times, most recently from d49e252 to 00f3fb6 Compare May 5, 2024 19:07
@nopperl nopperl changed the title Properly set clamp_qkv value in OLMo conversion Fix OLMo HF to GGUF conversion May 5, 2024
@nopperl
Copy link
Contributor Author

nopperl commented May 5, 2024

@josharian I have fixed the conversion issue, it should work now.

Copy link
Contributor

github-actions bot commented May 5, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 563 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8261.22ms p(95)=19563.9ms fails=, finish reason: stop=494 truncated=69
  • Prompt processing (pp): avg=88.17tk/s p(95)=354.58tk/s
  • Token generation (tg): avg=33.34tk/s p(95)=48.04tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fix-olmo-conversion commit=25be8f5cd5c9ea500d10588ae90c1a51816ad066

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 563 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715113189 --> 1715113817
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 749.46, 749.46, 749.46, 749.46, 749.46, 623.29, 623.29, 623.29, 623.29, 623.29, 651.35, 651.35, 651.35, 651.35, 651.35, 681.78, 681.78, 681.78, 681.78, 681.78, 725.21, 725.21, 725.21, 725.21, 725.21, 727.81, 727.81, 727.81, 727.81, 727.81, 744.98, 744.98, 744.98, 744.98, 744.98, 750.97, 750.97, 750.97, 750.97, 750.97, 766.58, 766.58, 766.58, 766.58, 766.58, 771.79, 771.79, 771.79, 771.79, 771.79, 780.26, 780.26, 780.26, 780.26, 780.26, 822.55, 822.55, 822.55, 822.55, 822.55, 870.34, 870.34, 870.34, 870.34, 870.34, 834.98, 834.98, 834.98, 834.98, 834.98, 846.6, 846.6, 846.6, 846.6, 846.6, 846.33, 846.33, 846.33, 846.33, 846.33, 859.94, 859.94, 859.94, 859.94, 859.94, 864.82, 864.82, 864.82, 864.82, 864.82, 863.76, 863.76, 863.76, 863.76, 863.76, 869.83, 869.83, 869.83, 869.83, 869.83, 871.76, 871.76, 871.76, 871.76, 871.76, 884.48, 884.48, 884.48, 884.48, 884.48, 888.7, 888.7, 888.7, 888.7, 888.7, 889.94, 889.94, 889.94, 889.94, 889.94, 890.97, 890.97, 890.97, 890.97, 890.97, 860.6, 860.6, 860.6, 860.6, 860.6, 857.91, 857.91, 857.91, 857.91, 857.91, 856.88, 856.88, 856.88, 856.88, 856.88, 858.43, 858.43, 858.43, 858.43, 858.43, 862.41, 862.41, 862.41, 862.41, 862.41, 860.39, 860.39, 860.39, 860.39, 860.39, 864.95, 864.95, 864.95, 864.95, 864.95, 869.16, 869.16, 869.16, 869.16, 869.16, 857.43, 857.43, 857.43, 857.43, 857.43, 867.76, 867.76, 867.76, 867.76, 867.76, 867.47, 867.47, 867.47, 867.47, 867.47, 866.0, 866.0, 866.0, 866.0, 866.0, 867.94, 867.94, 867.94, 867.94, 867.94, 867.74, 867.74, 867.74, 867.74, 867.74, 873.33, 873.33, 873.33, 873.33, 873.33, 855.92, 855.92, 855.92, 855.92, 855.92, 813.69, 813.69, 813.69, 813.69, 813.69, 812.38, 812.38, 812.38, 812.38, 812.38, 811.12, 811.12, 811.12, 811.12, 811.12, 813.66, 813.66, 813.66, 813.66, 813.66, 817.68, 817.68, 817.68, 817.68, 817.68, 819.13, 819.13, 819.13, 819.13, 819.13, 823.36, 823.36, 823.36, 823.36, 823.36, 827.52, 827.52, 827.52, 827.52, 827.52, 820.98, 820.98, 820.98, 820.98, 820.98, 821.03, 821.03, 821.03, 821.03, 821.03, 827.54, 827.54, 827.54, 827.54, 827.54, 827.35, 827.35, 827.35, 827.35, 827.35, 829.53, 829.53, 829.53, 829.53, 829.53, 830.26, 830.26, 830.26, 830.26, 830.26, 830.47, 830.47, 830.47, 830.47, 830.47, 832.55, 832.55, 832.55, 832.55, 832.55, 835.54, 835.54, 835.54, 835.54, 835.54, 835.97, 835.97, 835.97, 835.97, 835.97, 834.57, 834.57, 834.57, 834.57, 834.57]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 563 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715113189 --> 1715113817
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.94, 33.94, 33.94, 33.94, 33.94, 34.01, 34.01, 34.01, 34.01, 34.01, 30.75, 30.75, 30.75, 30.75, 30.75, 32.37, 32.37, 32.37, 32.37, 32.37, 33.12, 33.12, 33.12, 33.12, 33.12, 34.23, 34.23, 34.23, 34.23, 34.23, 35.46, 35.46, 35.46, 35.46, 35.46, 35.84, 35.84, 35.84, 35.84, 35.84, 36.1, 36.1, 36.1, 36.1, 36.1, 35.82, 35.82, 35.82, 35.82, 35.82, 35.55, 35.55, 35.55, 35.55, 35.55, 35.25, 35.25, 35.25, 35.25, 35.25, 34.19, 34.19, 34.19, 34.19, 34.19, 33.35, 33.35, 33.35, 33.35, 33.35, 33.01, 33.01, 33.01, 33.01, 33.01, 33.1, 33.1, 33.1, 33.1, 33.1, 33.19, 33.19, 33.19, 33.19, 33.19, 32.69, 32.69, 32.69, 32.69, 32.69, 32.67, 32.67, 32.67, 32.67, 32.67, 32.63, 32.63, 32.63, 32.63, 32.63, 32.65, 32.65, 32.65, 32.65, 32.65, 32.79, 32.79, 32.79, 32.79, 32.79, 32.72, 32.72, 32.72, 32.72, 32.72, 32.73, 32.73, 32.73, 32.73, 32.73, 32.83, 32.83, 32.83, 32.83, 32.83, 32.85, 32.85, 32.85, 32.85, 32.85, 32.44, 32.44, 32.44, 32.44, 32.44, 32.24, 32.24, 32.24, 32.24, 32.24, 32.42, 32.42, 32.42, 32.42, 32.42, 32.55, 32.55, 32.55, 32.55, 32.55, 32.68, 32.68, 32.68, 32.68, 32.68, 32.8, 32.8, 32.8, 32.8, 32.8, 32.78, 32.78, 32.78, 32.78, 32.78, 32.64, 32.64, 32.64, 32.64, 32.64, 32.49, 32.49, 32.49, 32.49, 32.49, 32.26, 32.26, 32.26, 32.26, 32.26, 32.33, 32.33, 32.33, 32.33, 32.33, 32.55, 32.55, 32.55, 32.55, 32.55, 32.66, 32.66, 32.66, 32.66, 32.66, 32.82, 32.82, 32.82, 32.82, 32.82, 32.82, 32.82, 32.82, 32.82, 32.82, 32.35, 32.35, 32.35, 32.35, 32.35, 32.25, 32.25, 32.25, 32.25, 32.25, 32.13, 32.13, 32.13, 32.13, 32.13, 30.92, 30.92, 30.92, 30.92, 30.92, 30.71, 30.71, 30.71, 30.71, 30.71, 30.91, 30.91, 30.91, 30.91, 30.91, 30.92, 30.92, 30.92, 30.92, 30.92, 31.04, 31.04, 31.04, 31.04, 31.04, 31.05, 31.05, 31.05, 31.05, 31.05, 31.05, 31.05, 31.05, 31.05, 31.05, 30.86, 30.86, 30.86, 30.86, 30.86, 30.92, 30.92, 30.92, 30.92, 30.92, 31.04, 31.04, 31.04, 31.04, 31.04, 31.23, 31.23, 31.23, 31.23, 31.23, 31.28, 31.28, 31.28, 31.28, 31.28, 31.36, 31.36, 31.36, 31.36, 31.36, 31.4, 31.4, 31.4, 31.4, 31.4, 31.38, 31.38, 31.38, 31.38, 31.38, 31.39, 31.39, 31.39, 31.39, 31.39]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 563 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715113189 --> 1715113817
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.31, 0.31, 0.31, 0.31, 0.31, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.3, 0.3, 0.3, 0.3, 0.3, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.47, 0.47, 0.47, 0.47, 0.47, 0.5, 0.5, 0.5, 0.5, 0.5, 0.53, 0.53, 0.53, 0.53, 0.53, 0.59, 0.59, 0.59, 0.59, 0.59, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 563 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715113189 --> 1715113817
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0]
                    
Loading

@lsetiawan
Copy link

@nopperl Thank you for providing this fix. I can confirm that this works. Could someone please merge this PR? I would like to share this capability at the 2024 Scipy Conference

@ggerganov
Copy link
Owner

Hm does it really work - llama.cpp does not know how to handle the "olmo" pre-tokenizer. It would crash here:

llama.cpp/llama.cpp

Lines 4392 to 4394 in 3af34c1

} else {
throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
}

@nopperl
Copy link
Contributor Author

nopperl commented May 7, 2024

@ggerganov you're right, I tested it with an older binary. I'll try to fix it.

@Galunid
Copy link
Collaborator

Galunid commented May 7, 2024

It looks alright, I'm downloading the model to test it. If it works I'll merge it, unless there's something more you want to add here?

@nopperl
Copy link
Contributor Author

nopperl commented May 7, 2024

It looks alright, I'm downloading the model to test it. If it works I'll merge it, unless there's something more you want to add here?

nice, I don't think there's anything else to add if it works.

Copy link
Collaborator

@Galunid Galunid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@Galunid Galunid merged commit b6aa670 into ggerganov:master May 7, 2024
56 of 61 checks passed
@lsetiawan
Copy link

Wow! Thank you all for your super quick response and @nopperl for working to fix this. I really appreciate everyone's input 😄 This is really exciting!

@nopperl
Copy link
Contributor Author

nopperl commented May 7, 2024

@lsetiawan no problem!

I would like to share this capability at the 2024 Scipy Conference

I'm interested in that, could you send me more info on what you're planning to do?

@lsetiawan
Copy link

For sure! For anyone interested, my team at the University of Washington Scientific Software Engineering Center has been working on creating a tutorial on RAG-based approach using OLMo as the LLM Model. Since the regular OLMo-7B-Instruct model has a very slow inference speed, we've been looking into ways to quantize and speed things up, especially on CPU, so that's how we've come across llama.cpp and the progress made with integrating OLMo so that it can be converted to GGUF and quantized. Thanks to everyone's contribution we have successfully made the OLMo-7B-Instruct to be in GGUF format and quantized to 4-bit with Q4_K_M method: https://huggingface.co/ssec-uw/OLMo-7B-Instruct-GGUF

@nopperl
Copy link
Contributor Author

nopperl commented May 8, 2024

@lsetiawan very interesting, nice to see that this contribution is useful to others.

Also great that you were able to convert the instruct model to HF format, which should be more useful for most users. However, I don't think the conversion works properly because it's missing tokenizer.json and tokenizer_config.json. You should be able to use these from allenai/OLMo-7B-hf. I also recommend setting the chat_template in tokenizer_config.json to the one from allenai/OLMo-7B-Instruct, so it can be used automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

truly opensource model called olmo
6 participants