Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support more dtypes #256

Closed
cztomsik opened this issue May 24, 2023 · 14 comments
Closed

support more dtypes #256

cztomsik opened this issue May 24, 2023 · 14 comments
Labels

Comments

@cztomsik
Copy link

It looks like the format is tightly coupled to what can be returned as TensorView.

This is problematic if you want to introduce new dtypes, like Q8_0 from ggml, which is not strictly just &[u8] (it also includes one f16 for every 32 bytes). Other QX_X formats would be even more problematic.

If safetensors is supposed to be the one, shared format, then it needs to support unknown dtypes.

This is a little bit related to #254 but the discussion there went in a very different way so I think it's more appropriate to open a new issue.

@Narsil
Copy link
Collaborator

Narsil commented May 25, 2023

Hey thanks,

First of all yes, I think support for QX_Y_Z would be nice. There's even a PR open for it #197.

I think safetensors needs AI specific dtypes, one of the reason to justify its existence was the absence of bfloat16 support in traditional files.

So adding dtypes is definitely on the table.

Why isn't it already merged:

  1. GGML changed 2 times their layout, hence broke compatibility, which is something we can hopefully avoid altogether.

  2. GPTQ which is also 4bit quantization, already works out of the box with safetensors: https://github.com/qwopqwop200/GPTQ-for-LLaMa. It uses only native types (int32 instead of actual q4, that's the only trickery, zeroes and scales are fully represented in their native type meaning they could be either f16 or f32). I think this approach has merits compared to Qx_y_z. First of all it's really using correct dtypes (int32 for q4 is arguable, but since torch doesn't know q4, you wouldn't be able to use a q4 tensor anyway with torch). That makes everything like alignment and jumps work out of the box. I think it's a nice property.
    Furthermore, it allows to do TP sharding which is pretty cool on GPU [WIP] Adding GPTQ support for llama text-generation-inference#267 (We're working on a cleaner integration). This sharding allows us to get same latency as f16 inference, with 4 times less RAM being used. Overall this reduced our inference costs by half. Here is specifically the sharding code: https://github.com/huggingface/text-generation-inference/pull/267/files#diff-dfc5ec09e8b044b6547f6be43fce7be9bf8af736997fdae0c8e18fa0ac87445bR197-R246
    Working out sharding on qx_y_z seems much harder.

  3. qx_y_z doesn't exist in any of the python frameworks (PT, TF and so on). This is not a real blocker by itself, just something to be aware of, we would be effectively sacrificing usability on some plateforms for some models.

  4. qx_y_z is not 1 dtype, it's a family of dtypes where x=0 or 1 but y can be pretty much any number, 2, 3, 4 or 5 bits. And worse there's z can be pretty much any number (mostly 32 or 128 but anything could go, in GPTQ you can specify nothing which means full row, which is defined by the shape..). The problem with that is it makes reasoning about such types much more complex, and not really resilient imo. We could be 3 weeks away from some new method doing 2d packing where we would need 2 sizes (32, 128) to represent the rectangles of packed numbers. None of this is a blocker by itself, just something that makes me think a format aimed at being sustainable, should at least wait slightly before implementing something (think a few months). It means years in AI world I know, but it should help in the future not having to deal with 200 dtypes not used anymore.

The fact that 2/ exists (and yields much better qualitative results than qx_y_z in our internal tests) and that 4/ is true seems to suggest to me that maybe qx_y_z in general is a bad idea. If we could lay things out like GPTQ and only have to deal with INT{2,3,4,5} it would make everything much simpler. (I didn't have the time to implement and benchmark yet, but I'm curious to see if packing zeros and scales along with the numbers has performance benefits in ggml)

That's my current personal reasoning around delaying full support for QX_Y_Z

Now arbitrary dtype.

Allowing arbitrary dtype means:

  • Breaking the check that shape x dtype = buffer length. Since the unknown types takes arbitrary size, there's no way to know if a buffer is valid for a given size. This is pretty big IMHO, since it now means you can have files which are super hard to "understand" and reason about. You could have a single tensor file of shape (2, 2) of several GB. This opens up cans of worms imo. I would much rather deal with combinatorics types with proper checking rather than opening that. I could be wrong, but this is my first impression.
  • Having unknown as a dtype in the file, means the file is not usable without exterior knowledge. This is true of every file, since if you don't know how to sequence the operation of a given ML model you can't use them effectively. However, here it would be worse, since you could potentially know what the model is and how the compute is supposed to go, but you wouldn't know what is the actual dtype (qx_y_z) because it's a model you found randomly, meaning you have no way of knowing if it's q4_0_128 or q5_1_32. You might be able to figure it out because of shapes and whatnot, but you could end up in an ambiguity where it's undecidable. To release that ambiguity we would need some way of specifying some arbitrary 'real' dtype along with the 'unknown' one. But if a user founds custom dtype f3_128_254, then what ? This is also something I feel very uneasy about maintaining since I see lots of opportunities for abuse.

Overall I think supporting qx_y_z makes more sense than arbitrary dtypes.

TL;DR

GPTQ does quantization with a different layout than GGML, which works already out of the box with safetensors. Unless there's a proved sizable benefit to doing qx_y_z instead of GPTQ way (like performance) I would much rather suggest everyone did like GPTQ, since it would be actual much simpler for everything.

NB: GPTQ lies about having int32 tensors which are in fact uint4, adding int4 would be OK in my book.

NBB: Also for adding dtypes, I think having several distinct widely used libraries helps with justifying adding new dtypes. Currently safetensors doesn't support any sparse tensor, there's also a lot of different layouts in different libraries, with none (afaik) having super widely used models on top.

@cztomsik
Copy link
Author

cztomsik commented May 25, 2023

First, thanks for your time and for a very detailed answer (lots of interesting information which I was not aware of). I see where you are heading and I sort of agree but what I'm requesting should be fine with everything you said.

Let me explain briefly (and honestly).

  • I am using GGML, and that is not going to change, no matter what quantization format is better or more popular
  • I need a way to load the data into GGML somehow, preferrably without introducing custom format
  • initially, I've almost re-invented safetensors, because I thought let's just slap a JSON metadata at the start of the file and read/write all the tensor data as bytes after that, which is basically what this library does (for python and for rust)
  • I am not using python, neither rust for loading, so I had to implement custom parsing anyway
  • Therefore, I am not using safetensors library, I am only using the format/specification
  • I wanted to follow the spec, because I think it's better than introducing another format
  • The "spec" (which is currently defined by the rust code) does not support GGML dtypes, so if I create such file, it will not be valid safetensors file.
  • I just wanted to ask, if it's possible to extend the spec

To be a bit more clear, I don't request any changes in the rust implementation except extending the enum and that should be it. You can even fail if someone tries to read that data, I don't request full support.

EDIT: I have a feeling that people in the other thread might be fine with this solution as well. I believe they also never manipulate the tensors and only do the pass-through because GGML does everything.

@cztomsik
Copy link
Author

cztomsik commented May 31, 2023

@Narsil do you have any comments on this? GGML just got support for its own format in the core C lib, so the choice is currently:

  • write quantized .safetensors file with custom dtype, which is invalid by the spec
  • use .ggml, which supports quantization

I would love to use safetensors format, but I cannot just sit and wait...

@cztomsik
Copy link
Author

cztomsik commented Jun 1, 2023

FYI ggerganov/ggml#220

The only thing which is supported right now is the export/import of the computational graph. Everything else (versions, hparams, vocabs) is getting discussed in the thread (but it will probably happen).

I think we can close this issue.

@KerfuffleV2
Copy link

write quantized .safetensors file with custom dtype, which is invalid by the spec

You can do this without breaking the spec, although tools that deal with the data would have to be aware of the approach I'm about to describe.

You can just put your data in u8 tensors and then add extra information (like describing the dtype) in the string key/value metadata the SafeTensors format allows. You could even use that approach to add stuff like vocabulary.

@cztomsik
Copy link
Author

cztomsik commented Jun 2, 2023

Yes, that should work but it's a hack, not a solution :)

@KerfuffleV2
Copy link

Yes, that should work but it's a hack, not a solution :)

True... I'd say adding dtypes for stuff like GGML would probably have to involve something similar though. I mean, formats get added/removed relatively often. Also Q5_1 today doesn't necessarily mean the same tensor data as it did a month ago.

Seems like it would be impractical for container formats to try to manage something like that. I feel like the only way to do it in a practical way would be to do something like make the "hack" official and allow the dtype to be a string and the data to be whatever.

You might skip having to rely on the K/V metadata stuff that way, but generally it wouldn't really be a much different concept.

@cztomsik
Copy link
Author

cztomsik commented Jun 2, 2023

make the "hack" official and allow the dtype to be a string and the data to be whatever.

Yes, I believe this is the best approach. I think whoever is using GGML is simply passing that pointer somewhere else, and because of that, I don't think it's necessary to be able to obtain "valid" views.

I know rust people care a lot about safety but I think this is inevitable in the long run.

@KerfuffleV2
Copy link

I know rust people care a lot about safety but I think this is inevitable in the long run.

If I'm understanding correctly, I don't really think this makes a difference from a Rust safety perspective.

Whether there's an official dtype DType::Ggml_Q5_1, an official string dtype like DType::String("ggml.q5_1") or something like { "tensorname": { "dtype": "ggml.q5_1" } } in the metadata section, if that metadata in any form is lying, you're going to have a bad time no matter what. If the format is mangled, you're going to have problems, etc.

So that's not a Rust safety thing.

@cztomsik
Copy link
Author

cztomsik commented Jun 2, 2023

Oh, I've just checked the code and you are probably right, I thought that the dtype was bound to the TensorView, but looking at it now, it only returns &[u8] and not the actual type, so that should be fine.

Sorry for the confusion.

@akx
Copy link
Contributor

akx commented Aug 1, 2023

Support for torch's q* dtypes would be neat too (torch.qint32, torch.qint8, torch.quint8, torch.quint4x2, torch.quint2x4). These are characterized by additional quantizer configuration ("q_per_channel_axis", "q_per_channel_scales", "q_per_channel_zero_points", "q_scale", "q_zero_point", "qscheme"; possibly more in the future), so that would probably belong in the tensor data header.

@Narsil
Copy link
Collaborator

Narsil commented Aug 2, 2023

Very interesting I wasn't aware of those !

Seems quantization is in beta: https://pytorch.org/docs/stable/quantization.html.
I cannot find any proper docs for those.
However it's good signal that it's become more mature (mostly q4).

I see need to figure out where all those items are stored currently and stuff like this.

@akx
Copy link
Contributor

akx commented Aug 2, 2023

Afaiu they're currently just metadata on the tensors you can access with the abovementioned functions.

I started some preliminary work on supporting them before realizing how hard-coded the current dtypes are and that you can't easily attach metadata to each tensor.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 13, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants