Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use .safetensors model ? #688

Closed
lambda-science opened this issue Apr 1, 2023 · 5 comments
Closed

How to use .safetensors model ? #688

lambda-science opened this issue Apr 1, 2023 · 5 comments

Comments

@lambda-science
Copy link

I downloaded a model alpaca-30b-lora-int4 from https://huggingface.co/elinas/alpaca-30b-lora-int4/tree/main
The model is a .safetensors in GPTQ format I think
I need to convert it to GGML .bin so I used the script provided in llama.cpp with the command python convert-gptq-to-ggml.py models/30B/alpaca-30b-4bit.safetensors models/30B//tokenizer.model models/30B/alpaca-30b-4bit.bin
But I get the following error

Traceback (most recent call last):
  File "/big/meyer/expe/llama.cpp/convert-gptq-to-ggml.py", line 21, in <module>
    model = torch.load(fname_model, map_location="cpu")
  File "/big/meyer/expe/llama.cpp/.venv/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/big/meyer/expe/llama.cpp/.venv/lib/python3.10/site-packages/torch/serialization.py", line 1035, in _legacy_load
    raise RuntimeError("Invalid magic number; corrupt file?")
RuntimeError: Invalid magic number; corrupt file?

How to use .safetensors models with llama.cpp ?

@comex
Copy link
Contributor

comex commented Apr 1, 2023

My conversion script (#545) will support this soon.

@ghost
Copy link

ghost commented Apr 6, 2023

I thought I'd give it a spin on some safetensors models:

$ python llama.cpp.convert-script/convert.py --outtype q4_1 --outfile llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.bin --vocab-dir llama.cpp/models llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading model file llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading vocab file llama.cpp/models/tokenizer.model
Error: Input uses the newer GPTQ-for-LLaMa format (using g_idx), which is not yet natively supported by GGML.  For now you can still convert this model by passing `--outtype f16` to dequantize, but that will result in a much larger output file for no quality benefit.

$ python llama.cpp.convert-script/convert.py --outtype f16 --outfile llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g-f16.bin --vocab-dir llama.cpp/models llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading model file llama.cpp/models/LLaMA/7B/story-llama7b-4bit-32g.safetensors
Loading vocab file llama.cpp/models/tokenizer.model
Writing vocab...
[1/291] Writing tensor tok_embeddings.weight, size 32000 x 4096...
[2/291] Writing tensor norm.weight, size 4096...
[3/291] Writing tensor output.weight, size 32000 x 4096...
Traceback (most recent call last):
  File "llama.cpp.convert-script/convert.py", line 1053, in <module>
    main()
  File "llama.cpp.convert-script/convert.py", line 1049, in main
    OutputFile.write_all(outfile, params, model, vocab)
  File "llama.cpp.convert-script/convert.py", line 870, in write_all
    for i, ((name, lazy_tensor), ndarray) in enumerate(zip(model.items(), ndarrays)):
  File "llama.cpp.convert-script/convert.py", line 794, in bounded_parallel_map
    result = futures.pop(0).result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "llama.cpp.convert-script/convert.py", line 867, in do_item
    return lazy_tensor.load().to_ggml().ndarray
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 446, in load
    return self.load().astype(data_type)
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 525, in load
    return lazy_tensor.load().permute(n_head)
  File "llama.cpp.convert-script/convert.py", line 439, in load
    ret = self._load()
  File "llama.cpp.convert-script/convert.py", line 576, in load
    return GPTQForLLaMaQuantizedTensor(model, namebase)
  File "llama.cpp.convert-script/convert.py", line 316, in __init__
    scales = load_unquantized(model[f"{namebase}.scales"], np.float32)
  File "llama.cpp.convert-script/convert.py", line 261, in load_unquantized
    assert tensor.ndarray.dtype == expected_dtype, (tensor.ndarray.dtype, expected_dtype)
AssertionError: (dtype('float16'), <class 'numpy.float32'>)

If I didn't think I'd probably cause even more trouble with clumsy efforts, I'd have a stab at fixing it.

@comex
Copy link
Contributor

comex commented Apr 6, 2023

I’ll take a look.

@hughobrien
Copy link

#545 worked great for this, thanks @comex

@prusnak
Copy link
Collaborator

prusnak commented Apr 14, 2023

try the new convert.py script that is now in master

@prusnak prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Apr 14, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
ggerganov#688)

* Examples from ggml to gguf

* Use gguf file extension

Update examples to use filenames with gguf extension (e.g. llama-model.gguf).

---------

Co-authored-by: Andrei <abetlen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants