Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama compatibility with new ggml quantization #642

Merged
merged 8 commits into from
May 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
[submodule "llama.cpp"]
path = gpt4all-backend/llama.cpp
[submodule "llama.cpp-230519"]
path = gpt4all-backend/llama.cpp-230519
url = https://github.com/ggerganov/llama.cpp.git
[submodule "llama.cpp-230511"]
path = gpt4all-backend/llama.cpp-230511
url = https://github.com/manyoso/llama.cpp.git
[submodule "llama.cpp-mainline"]
path = gpt4all-backend/llama.cpp-mainline
url = https://github.com/ggerganov/llama.cpp.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, ok, i get ya, but this isn't actually pinning them. Also, I think I still want all of them to use the 'manyoso' fork as this gives us further control,right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, the manyoso fork hasn't been updated to latest llama.cpp, it's 132 commits behind...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that fork only adds alibi, which is only needed for MPT

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we should update that fork, and point to it I believe. lemme do that now.

28 changes: 22 additions & 6 deletions gpt4all-backend/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,9 @@ foreach(BUILD_VARIANT IN LISTS BUILD_VARIANTS)
set(LLAMA_FMA ${GPT4ALL_ALLOW_NON_AVX})

# Include GGML
include_ggml(llama.cpp -${BUILD_VARIANT} ON)
include_ggml(llama.cpp-mainline -mainline-${BUILD_VARIANT} ON)
include_ggml(llama.cpp-230511 -230511-${BUILD_VARIANT} ON)
include_ggml(llama.cpp-230519 -230519-${BUILD_VARIANT} ON)

# Function for preparing individual implementations
function(prepare_target TARGET_NAME BASE_LIB)
Expand All @@ -71,18 +73,32 @@ foreach(BUILD_VARIANT IN LISTS BUILD_VARIANTS)
PROPERTY INTERPROCEDURAL_OPTIMIZATION ${IPO_SUPPORTED})
endfunction()

# Add each individual implementation
add_library(llamamodel-${BUILD_VARIANT} SHARED
# Add each individual implementations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, you don't want the plural here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that as well, but decided to leave it as is since it's not worth a commit. Will batch this with further things that may come up.

add_library(llamamodel-mainline-${BUILD_VARIANT} SHARED
llamamodel.cpp)
prepare_target(llamamodel llama)
target_compile_definitions(llamamodel-mainline-${BUILD_VARIANT} PRIVATE
LLAMA_VERSIONS=>=3 LLAMA_DATE=999999)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=>= oh man cmake.. you're kiling me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yup. Looks confusing, is confusing, but does what we need quite flexibly.

Copy link
Contributor

@imaami imaami May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That conditional should probably be changed to a slightly less cursed variant:

#if LLAMA_VERSION <= 123456
// ...
#elif LLAMA_VERSION >= 654321
// ...
#endif

At least then it would be a readily recognizable pattern of tragic stylistic compromise instead of a confusing entirely new way to crush one's hopes and dreams. Would also shrink the cmake side a little.

Pardon the gallows humour, can't help it whenever pre-processor macros seem necessary. ;)

prepare_target(llamamodel-mainline llama-mainline)

add_library(llamamodel-230519-${BUILD_VARIANT} SHARED
llamamodel.cpp)
target_compile_definitions(llamamodel-230519-${BUILD_VARIANT} PRIVATE
LLAMA_VERSIONS===2 LLAMA_DATE=230519)
prepare_target(llamamodel-230519 llama-230519)

add_library(llamamodel-230511-${BUILD_VARIANT} SHARED
llamamodel.cpp)
target_compile_definitions(llamamodel-230511-${BUILD_VARIANT} PRIVATE
LLAMA_VERSIONS=<=1 LLAMA_DATE=230511)
prepare_target(llamamodel-230511 llama-230511)

add_library(gptj-${BUILD_VARIANT} SHARED
gptj.cpp)
prepare_target(gptj ggml)
prepare_target(gptj ggml-230511)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, where are you tagging the actual ggml with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama.cpp.cmake adds the given suffix to ggml as well.


add_library(mpt-${BUILD_VARIANT} SHARED
mpt.cpp)
prepare_target(mpt ggml)
prepare_target(mpt ggml-230511)
endforeach()

add_library(llmodel
Expand Down
6 changes: 4 additions & 2 deletions gpt4all-backend/gptj.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
#define GPTJ_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE
#include "gptj_impl.h"
#include "llama.cpp/ggml.h"

#include "utils.h"

Expand All @@ -26,6 +25,7 @@
#endif
#include <sstream>
#include <unordered_set>
#include <ggml.h>


namespace {
Expand Down Expand Up @@ -1133,7 +1133,9 @@ const char *get_build_variant() {
return GGML_BUILD_VARIANT;
}

bool magic_match(uint32_t magic) {
bool magic_match(std::istream& f) {
uint32_t magic = 0;
f.read(reinterpret_cast<char*>(&magic), sizeof(magic));
return magic == 0x67676d6c;
}

Expand Down
1 change: 1 addition & 0 deletions gpt4all-backend/llama.cpp-230519
Submodule llama.cpp-230519 added at 5ea433
1 change: 1 addition & 0 deletions gpt4all-backend/llama.cpp-mainline
Submodule llama.cpp-mainline added at ea6000
8 changes: 7 additions & 1 deletion gpt4all-backend/llama.cpp.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -332,10 +332,16 @@ function(include_ggml DIRECTORY SUFFIX WITH_LLAMA)
endif()

if (WITH_LLAMA)
# Backwards compatibility with old llama.cpp versions
set(LLAMA_UTIL_SOURCE_FILE llama-util.h)
if (NOT EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${DIRECTORY}/${LLAMA_UTIL_SOURCE_FILE})
set(LLAMA_UTIL_SOURCE_FILE llama_util.h)
endif()

add_library(llama${SUFFIX}
${DIRECTORY}/llama.cpp
${DIRECTORY}/llama.h
${DIRECTORY}/llama_util.h)
${DIRECTORY}/${LLAMA_UTIL_SOURCE_FILE})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch doesn't actually introduce this file, right? It exists upstream in one of the pinned submodules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename was changed.


target_include_directories(llama${SUFFIX} PUBLIC ${DIRECTORY})
target_compile_features(llama${SUFFIX} PUBLIC cxx_std_11) # don't bump
Expand Down
77 changes: 60 additions & 17 deletions gpt4all-backend/llamamodel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,23 @@
#include <llama.h>
#include <ggml.h>


namespace {
const char *modelType_ = "LLaMA";
}

struct gpt_params {
int32_t seed = -1; // RNG seed
int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
int32_t n_keep = 0; // number of tokens to keep from initial prompt
#if LLAMA_DATE <= 230511
int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crux of it. We're going to use macros...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our other option would be to have an extensive collection of almost-identical llamamodel.cpp files for different llama.cpp versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this is the right choice of a bunch of bad choices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also CRTP and C++ template magic, but I agree it's not the time to go there yet.


#if LLAMA_DATE >= 230519
// sampling parameters
float tfs_z = 1.0f; // 1.0 = disabled
float typical_p = 1.0f; // 1.0 = disabled
#endif

std::string prompt = "";

Expand All @@ -45,25 +54,45 @@ struct gpt_params {
bool use_mlock = false; // use mlock to keep model in memory
};

#if LLAMA_DATE >= 230519
static int llama_sample_top_p_top_k(
llama_context *ctx,
const llama_token *last_n_tokens_data,
int last_n_tokens_size,
int top_k,
float top_p,
float temp,
float repeat_penalty) {
auto logits = llama_get_logits(ctx);
auto n_vocab = llama_n_vocab(ctx);
// Populate initial list of all candidates
std::vector<llama_token_data> candidates;
candidates.reserve(n_vocab);
for (int token_id = 0; token_id < n_vocab; token_id++) {
candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
}
llama_token_data_array candidates_p = {candidates.data(), candidates.size(), false};
// Sample repeat penalty
llama_sample_repetition_penalty(nullptr, &candidates_p, last_n_tokens_data, last_n_tokens_size, repeat_penalty);
// Temperature sampling
llama_sample_top_k(ctx, &candidates_p, top_k, 1);
llama_sample_tail_free(ctx, &candidates_p, 1.0f, 1);
llama_sample_typical(ctx, &candidates_p, 1.0f, 1);
llama_sample_top_p(ctx, &candidates_p, top_p, 1);
llama_sample_temperature(ctx, &candidates_p, temp);
return llama_sample_token(ctx, &candidates_p);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to assume this is giving you sane results? Have you made sure to go through and test models with each of the pinned variants and file formats? Man, we almost want regression or unit tests here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! I did. Man was my harddrive full..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also how it's done in the llama.cpp main example.

#endif

struct LLamaPrivate {
const std::string modelPath;
bool modelLoaded;
llama_context *ctx = nullptr;
llama_context_params params;
int64_t n_threads = 0;
bool empty = true;
};


static std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
// initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
std::vector<llama_token> res(text.size() + (int)add_bos);
int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
assert(n >= 0);
res.resize(n);

return res;
}

LLamaModel::LLamaModel()
: d_ptr(new LLamaPrivate) {
modelType = modelType_;
Expand All @@ -78,11 +107,13 @@ bool LLamaModel::loadModel(const std::string &modelPath)

gpt_params params;
d_ptr->params.n_ctx = 2048;
d_ptr->params.n_parts = params.n_parts;
d_ptr->params.seed = params.seed;
d_ptr->params.f16_kv = params.memory_f16;
d_ptr->params.use_mmap = params.use_mmap;
d_ptr->params.use_mlock = params.use_mlock;
#if LLAMA_DATE <= 230511
d_ptr->params.n_parts = params.n_parts;
#endif

d_ptr->ctx = llama_init_from_file(modelPath.c_str(), d_ptr->params);
if (!d_ptr->ctx) {
Expand Down Expand Up @@ -126,7 +157,8 @@ size_t LLamaModel::saveState(uint8_t *dest) const

size_t LLamaModel::restoreState(const uint8_t *src)
{
return llama_set_state_data(d_ptr->ctx, src);
// const_cast is required, see: https://github.com/ggerganov/llama.cpp/pull/1540
return llama_set_state_data(d_ptr->ctx, const_cast<uint8_t*>(src));
}

void LLamaModel::prompt(const std::string &prompt,
Expand All @@ -147,7 +179,11 @@ void LLamaModel::prompt(const std::string &prompt,
params.prompt.insert(0, 1, ' ');

// tokenize the prompt
auto embd_inp = ::llama_tokenize(d_ptr->ctx, params.prompt, false);
std::vector<llama_token> embd_inp(params.prompt.size() + 4);
int n = llama_tokenize(d_ptr->ctx, params.prompt.c_str(), embd_inp.data(), embd_inp.size(), d_ptr->empty);
assert(n >= 0);
embd_inp.resize(n);
d_ptr->empty = false;
niansa marked this conversation as resolved.
Show resolved Hide resolved

// save the context size
promptCtx.n_ctx = llama_n_ctx(d_ptr->ctx);
Expand Down Expand Up @@ -313,8 +349,15 @@ const char *get_build_variant() {
return GGML_BUILD_VARIANT;
}

bool magic_match(uint32_t magic) {
return magic == 0x67676a74;
bool magic_match(std::istream& f) {
// Check magic
uint32_t magic = 0;
f.read(reinterpret_cast<char*>(&magic), sizeof(magic));
if (magic != 0x67676a74) return false;
// Check version
uint32_t version = 0;
f.read(reinterpret_cast<char*>(&version), sizeof(version));
return version LLAMA_VERSIONS;
}

LLModel *construct() {
Expand Down
16 changes: 7 additions & 9 deletions gpt4all-backend/llmodel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@


static
Dlhandle *get_implementation(uint32_t magic, const std::string& buildVariant) {
Dlhandle *get_implementation(std::ifstream& f, const std::string& buildVariant) {
// Collect all model implementation libraries
static auto libs = [] () {
std::vector<Dlhandle> fres;
Expand All @@ -31,9 +31,10 @@ Dlhandle *get_implementation(uint32_t magic, const std::string& buildVariant) {
}();
// Iterate over all libraries
for (auto& dl : libs) {
f.seekg(0);
// Check that magic matches
auto magic_match = dl.get<bool(uint32_t)>("magic_match");
if (!magic_match || !magic_match(magic)) {
auto magic_match = dl.get<bool(std::ifstream&)>("magic_match");
if (!magic_match || !magic_match(f)) {
continue;
}
// Check that build variant is correct
Expand All @@ -55,14 +56,11 @@ LLModel *LLModel::construct(const std::string &modelPath, std::string buildVaria
}
// Read magic
std::ifstream f(modelPath, std::ios::binary);
uint32_t magic;
if (!f.read(reinterpret_cast<char*>(&magic), sizeof(magic))) {
return nullptr;
}
f.close();
if (!f) return nullptr;
// Get correct implementation
auto impl = get_implementation(magic, buildVariant);
auto impl = get_implementation(f, buildVariant);
if (!impl) return nullptr;
f.close();
// Get inference constructor
auto constructor = impl->get<LLModel *()>("construct");
if (!constructor) return nullptr;
Expand Down
6 changes: 4 additions & 2 deletions gpt4all-backend/mpt.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
#define MPT_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE
#include "mpt_impl.h"
#include "llama.cpp/ggml.h"

#include "utils.h"

Expand Down Expand Up @@ -29,6 +28,7 @@
#include <thread>
#include <unordered_set>
#include <regex>
#include <ggml.h>


namespace {
Expand Down Expand Up @@ -1062,7 +1062,9 @@ const char *get_build_variant() {
return GGML_BUILD_VARIANT;
}

bool magic_match(uint32_t magic) {
bool magic_match(std::istream& f) {
uint32_t magic = 0;
f.read(reinterpret_cast<char*>(&magic), sizeof(magic));
return magic == 0x67676d6d;
}

Expand Down