Merge with llamacpp master #2

calvintwr · 2024-01-20T03:29:49Z

No description provided.

* Metal memory: Small memory leak on init, dangling pointer, and unused autorelease pool in graph compute * SPM header potential fix * Reverting symlinks

* winogrande: simple implementation It doesn't look like it is working - why? For Mistral-7B it is barely better than random chance (score ~60% for 1267 tasks), while I see Mistral-7B scoring 78.4% on the HF leader board. 1-sigma statistical uncertainty for 1267 tasks is ~1.4, so no way the difference is due to statistics. * winogrande: somewhat better Score for Mistrali7-B is now 68.9 on the validation set of winogrande_debiased. Still far from the reported 78.4, but better than what I had before. * winogrande: improving Mistral-7B score is now 73.56. Still not quite 78.4 but getting there. We are also getting a lower score on HellaSwag compared to HF leader board, so I'm not expecting we will get up to 78.4 anyway. It looks like it is better to skip the choice word(s) when evaluating the average log-likelihood. This kind of makes sense because a more common word (in Winogrande this is often a name) will have a higher probability without knowing about the follow up context, and this will skew the log-likelihood towards the more common word. We can only do this if the choice words are not last in the sentence. It also looks like it is better to skip the punctuation at the end of the sentence, provided the choice words are not last. * winogrande: add dataset instructions --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* perplexity : faster HellaSwag ggml-ci * perplexity : clean-up ggml-ci * perplexity : no need for decode_helper ggml-ci * perplexity : add comments * perplexity : option to specify max batched tasks via `n_parallel` * perplexity : remove HellaSwag restruction for n_batch

For Mistral-7B and fp16, time on my system goes down from 536 seconds to 423 seconds for the full evaluation dataset (10042 tasks). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30). Without the fix, llama2 models can't be converted. The error is: `ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`

* server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>

* falcon arch fix for tied output embeddings * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* perplexity : faster Winogrande via batching ggml-ci * perplexity : remove unused function * perplexity : only tokenize selected tasks for Winogrande

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* llama: add codeshell support * llama.cpp: fix codeshell with NeoX rope Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

This is a relatively minor performance tweak resulting in ~10% speedup on my system. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue #4791

ptsochantaris and others added 21 commits January 18, 2024 10:47

metal : fix memory leak, dangling pointer and unused autorel (#5007)

1e605f4

* Metal memory: Small memory leak on init, dangling pointer, and unused autorelease pool in graph compute * SPM header potential fix * Reverting symlinks

scritps : add helper script to get hellaswag data in txt format

dcad445

HellaSwag: speed up by parallelizing log-prob evaluation (#5020)

3e945cc

For Mistral-7B and fp16, time on my system goes down from 536 seconds to 423 seconds for the full evaluation dataset (10042 tasks). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

scripts : add get-winogrande.sh

e9240cd

perplexity : fix winogrande N tasks option

d391ae9

imatrix : fix assert for src0 non-cont check

2d5419d

llama : fix mlock with no-mmap with Metal (#5025)

96d7f56

server : defer tasks when "slot unavailable" (#5018)

821f0a2

* server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>

cmake : add ggml public headers (#5011)

9b6ea42

perplexity : faster Winogrande via batching (#5024)

8b20858

* perplexity : faster Winogrande via batching ggml-ci * perplexity : remove unused function * perplexity : only tokenize selected tasks for Winogrande

perplexity: avoid unnecessary alloocations and logit copies (#5035)

993fba8

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

llama : add CodeShell support (#5016)

2b3b999

* llama: add codeshell support * llama.cpp: fix codeshell with NeoX rope Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

winogrande: evaluate log-probs in parallel (#5036)

7051aac

This is a relatively minor performance tweak resulting in ~10% speedup on my system. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

py : fix flake8 lint

de9a147

llama : support upcoming Qwen2 (#5037)

9b75cb2

imatrix : add README.md

a5cacb2

finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)

381ee19

* Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue #4791

calvintwr merged commit c3bc00c into Pints-AI:master Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge with llamacpp master #2

Merge with llamacpp master #2

calvintwr commented Jan 20, 2024

Merge with llamacpp master #2

Merge with llamacpp master #2

Conversation

calvintwr commented Jan 20, 2024