Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHA256 checksums correctness #374

Closed
anzz1 opened this issue Mar 21, 2023 · 12 comments
Closed

SHA256 checksums correctness #374

anzz1 opened this issue Mar 21, 2023 · 12 comments
Labels
bug Something isn't working model Model specific

Comments

@anzz1
Copy link
Contributor

anzz1 commented Mar 21, 2023

Not all of these checksums seem to be correct. Are they calculated with the "v2" new model format after the tokenizer change? PR: #252 Issue: #324

For example, "models/alpaca-7B/ggml-model-q4_0.bin"

v1: 1f582babc2bd56bb63b33141898748657d369fd110c4358b2bc280907882bf13
v2: 8d5562ec1d8a7cfdcf8985a9ddf353339d942c7cf52855a92c9ff59f03b541bc

The SHA256SUMS file has the old v1 hash.
Maybe using a naming scheme like "ggml2-model-q4_0.bin" would be good to differentiate between the versions and avoid confusion.

Originally posted by @anzz1 in #338 (comment)

edit: After converting the models to the new format, I found out that the "v2" hash above is also incorrect.
The sha256 for ./models/alpaca-7B-ggml/ggml-model-q4_0.bin is supposed to be 2fe0cd21df9c235c0d917c14e1b18d2d7320ed5d8abe48545518e96bb4227524

@gjmulder
Copy link
Collaborator

I'm still in the process of finding/converting the 7B and 13B alpaca models to ggml2

I'll then recompute all the hashes with the latest build, and also provide a file with the magic numbers and versions for each.

@gjmulder gjmulder added bug Something isn't working model Model specific labels Mar 22, 2023
@Green-Sky
Copy link
Collaborator

the new ggml file format has the version number 1. calling it ggml2 or "v2" is going to cause confusion. the new file format switched the file magic from "ggml" to "ggmf", maybe we should lean into that.

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 23, 2023

Some checksums (q4_0 and gptq-4b quantizations, new tokenizer format)

ggml-q4-checksums.zip

e: added more checksums

anzz1 added a commit that referenced this issue Mar 23, 2023
Delete this for now to avoid confusion since it contains some wrong checksums from the old tokenizer format
Re-add after #374 is resolved
sw pushed a commit that referenced this issue Mar 23, 2023
Delete this for now to avoid confusion since it contains some wrong checksums from the old tokenizer format
Re-add after #374 is resolved
@gjmulder
Copy link
Collaborator

Some checksums (q4_0 quantization, new tokenizer format)

ggml-q4_0-checksums.zip

I'd trust your checksums for the alpaca models over mine.

$ cat SHA256SUMS.gary
alpaca-13B-ggml/ggml-model-q4_0.bin: FAILED
alpaca-13B-ggml/params.json: FAILED open or read
alpaca-13B-ggml/tokenizer.model: FAILED open or read
alpaca-30B-ggml/ggml-model-q4_0.bin: OK
alpaca-30B-ggml/params.json: OK
alpaca-30B-ggml/tokenizer.model: FAILED open or read
alpaca-7B-ggml/ggml-model-q4_0.bin: FAILED
alpaca-7B-ggml/params.json: FAILED open or read
alpaca-7B-ggml/tokenizer.model: FAILED open or read
llama-13B-ggml/ggml-model-q4_0.bin: OK
llama-13B-ggml/ggml-model-q4_0.bin.1: OK
llama-13B-ggml/params.json: OK
llama-13B-ggml/tokenizer.model: FAILED open or read
llama-30B-ggml/ggml-model-q4_0.bin: OK
llama-30B-ggml/ggml-model-q4_0.bin.1: OK
llama-30B-ggml/ggml-model-q4_0.bin.2: OK
llama-30B-ggml/ggml-model-q4_0.bin.3: OK
llama-30B-ggml/params.json: OK
llama-30B-ggml/tokenizer.model: FAILED open or read
llama-65B-ggml/ggml-model-q4_0.bin: OK
llama-65B-ggml/ggml-model-q4_0.bin.1: OK
llama-65B-ggml/ggml-model-q4_0.bin.2: OK
llama-65B-ggml/ggml-model-q4_0.bin.3: OK
llama-65B-ggml/ggml-model-q4_0.bin.4: OK
llama-65B-ggml/ggml-model-q4_0.bin.5: OK
llama-65B-ggml/ggml-model-q4_0.bin.6: OK
llama-65B-ggml/ggml-model-q4_0.bin.7: OK
llama-65B-ggml/params.json: OK
llama-65B-ggml/tokenizer.model: FAILED open or read
llama-7B-ggml/ggml-model-q4_0.bin: OK
llama-7B-ggml/params.json: OK
llama-7B-ggml/tokenizer.model: FAILED open or read

@Green-Sky
Copy link
Collaborator

the problem with the alpaca models is, that there are alot of different once, by different peoples.

@gjmulder
Copy link
Collaborator

the problem with the alpaca models is, that there are alot of different once, by different peoples.

Yes. However we're supporting them, so we need to decide what we can support.

@gjmulder
Copy link
Collaborator

Upvote for @anzz1's new naming convention for the various model subdirs.

@Green-Sky
Copy link
Collaborator

@anzz1 why is the tokenizer.model duplicated everywhere, afaik there is only 1

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 23, 2023

@Green-Sky Yeah there is only one, i might be thinking ahead too much. 😄

also added some more checksums for gptq-4b models above #374 (comment)

@Green-Sky
Copy link
Collaborator

IMHO, I think we should move the alpaca checksums to a discussion, with a thread for each indiviual model, with source and credits and converted checksums.
I don't think we can tame the diverse 🦙 hoard otherwise.

@gjmulder
Copy link
Collaborator

How about an individual SHA256SUMS.model_type file per model type?

That way we have some granularity and it is self-documenting for new users who don't know a llama from an alpaca.

@anzz1
Copy link
Contributor Author

anzz1 commented Mar 23, 2023

yes it might be good to differentiate ones as some have short fur and some long and some are more friendly than others.
but llamas will always be the llamas and alpacas will be many. llamas are stable, but alpacas are wild cards. i don't see much value in documenting a million different alpaca variations, there should be a standard set to test against but otherwise no point in trying to document every grain of sand at the beach

1 "standard" sum per 1 model type seems to make the most sense. i cant see why they would need to be their own files though, as i'm not big fan of the idea of littering a repo with dozens of files when the same thing can be achieved with dozens of lines in a single file.

i agree this should be moved to discussions as it will be a ongoing thing

Repository owner locked and limited conversation to collaborators Mar 23, 2023
@anzz1 anzz1 converted this issue into discussion #433 Mar 23, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working model Model specific
Projects
None yet
Development

No branches or pull requests

3 participants