Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Commit

Permalink
docs: update README with info + instructions (#73)
Browse files Browse the repository at this point in the history
  • Loading branch information
philpax authored Mar 26, 2023
1 parent 08b875c commit e7e7e8a
Showing 1 changed file with 108 additions and 51 deletions.
159 changes: 108 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# LLaMA-rs

<!-- markdownlint-disable-file MD026 -->

> Do the LLaMA thing, but now in Rust 🦀🚀🦙
![A llama riding a crab, AI-generated](./doc/resources/logo2.png)
Expand All @@ -21,31 +23,70 @@ model on a CPU with good performance using full precision, f16 or 4-bit
quantized versions of the model.

Just like its C++ counterpart, it is powered by the
[`ggml`](https://github.com/ggerganov/ggml) tensor library, achieving the same performance as the original code.
[`ggml`](https://github.com/ggerganov/ggml) tensor library, achieving the same
performance as the original code.

## Getting started

Make sure you have a rust toolchain set up.
Make sure you have a Rust 1.65.0 or above and C toolchain[^1] set up, and get a
copy of the model's weights[^2].

`llama-rs` is a Rust library, while `llama-cli` is a CLI application that wraps
`llama-rs` and offers basic inference capabilities.

The following instructions explain how to build `llama-cli`.

**NOTE**: For best results, make sure to build and run in release mode.
Debug builds are going to be very slow.

### Building using `cargo`

Run

```shell
cargo install --git https://github.com/rustformers/llama-rs llama-cli
```

to install `llama-cli` to your Cargo `bin` directory, which `rustup` is likely to
have added to your `PATH`.

1. Get a copy of the model's weights[^1]
2. Clone the repository
3. Build (`cargo build --release`)
4. Run with `cargo run --release -- <ARGS>`
It can then be run through `llama-cli`.

**NOTE**: For best results, make sure to build and run in release mode. Debug builds are going to be very slow.
### Building from repository

For example, you try the following prompt:
Clone the repository, and then build it through

```shell
cargo run --release -- -m /data/Llama/LLaMA/7B/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
cargo build --release
```

The resulting binary will be at `target/release/llama-cli[.exe]`.

It can also be run directly through Cargo, using

```shell
cargo run --release -- <ARGS>
```

This is useful for development.

### Running

For example, try the following prompt:

```shell
llama-cli -m <path>/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
```

Some additional things to try:

- Use `--help` to see a list of available options.
- If you have the [alpaca-lora](https://github.com/tloen/alpaca-lora) weights,
try `--repl` mode! `cargo run --release -- -m <path>/ggml-alpaca-7b-q4.bin
-f examples/alpaca_prompt.txt --repl`.
try `--repl` mode!

```shell
llama-cli -m <path>/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt --repl
```

![Gif showcasing alpaca repl mode](./doc/resources/alpaca_repl_screencap.gif)

Expand All @@ -55,46 +96,62 @@ Some additional things to try:

![Gif showcasing prompt caching](./doc/resources/prompt_caching_screencap.gif)

[^1]: The only legal source to get the weights at the time of writing is [this repository](https://github.com/facebookresearch/llama/blob/main/README.md#llama). The choice of words also may or may not hint at the existence of other kinds of sources.
(This GIF shows an older version of the flags, but the mechanics are still the same.)

[^1]:
A modern-ish C toolchain is required to compile `ggml`. A C++ toolchain
should not be necessary.

[^2]:
The only legal source to get the weights at the time of writing is
[this repository](https://github.com/facebookresearch/llama/blob/main/README.md#llama).
The choice of words also may or may not hint at the existence of other
kinds of sources.

## Q&A

- **Q: Why did you do this?**
- **A:** It was not my choice. Ferris appeared to me in my dreams and asked me
to rewrite this in the name of the Holy crab.

- **Q: Seriously now**
- **A:** Come on! I don't want to get into a flame war. You know how it goes,
_something something_ memory _something something_ cargo is nice, don't make
me say it, everybody knows this already.

- **Q: I insist.**
- **A:** _Sheesh! Okaaay_. After seeing the huge potential for **llama.cpp**,
the first thing I did was to see how hard would it be to turn it into a
library to embed in my projects. I started digging into the code, and realized
the heavy lifting is done by `ggml` (a C library, easy to bind to Rust) and
the whole project was just around ~2k lines of C++ code (not so easy to bind).
After a couple of (failed) attempts to build an HTTP server into the tool, I
realized I'd be much more productive if I just ported the code to Rust, where
I'm more comfortable.

- **Q: Is this the real reason?**
- **A:** Haha. Of course _not_. I just like collecting imaginary internet
points, in the form of little stars, that people seem to give to me whenever I
embark on pointless quests for _rewriting X thing, but in Rust_.

## Known issues / To-dos

Contributions welcome! Here's a few pressing issues:

- [ ] The quantization code has not been ported (yet). You can still use the
quantized models with llama.cpp.
- [ ] No crates.io release. The name `llama-rs` is reserved and I plan to do
this soon-ish.
- [ ] Any improvements from the original C++ code. (See https://github.com/setzer22/llama-rs/issues/15)
- [x] Debug builds are currently broken.
- [x] The code needs to be "library"-fied. It is nice as a showcase binary, but
the real potential for this tool is to allow embedding in other services.
- [x] The code only sets the right CFLAGS on Linux. The `build.rs` script in
`ggml_raw` needs to be fixed, so inference _will be very slow on every
other OS_.
### Why did you do this?

It was not my choice. Ferris appeared to me in my dreams and asked me
to rewrite this in the name of the Holy crab.

### Seriously now.

Come on! I don't want to get into a flame war. You know how it goes,
_something something_ memory _something something_ cargo is nice, don't make
me say it, everybody knows this already.

### I insist.

_Sheesh! Okaaay_. After seeing the huge potential for **llama.cpp**,
the first thing I did was to see how hard would it be to turn it into a
library to embed in my projects. I started digging into the code, and realized
the heavy lifting is done by `ggml` (a C library, easy to bind to Rust) and
the whole project was just around ~2k lines of C++ code (not so easy to bind).
After a couple of (failed) attempts to build an HTTP server into the tool, I
realized I'd be much more productive if I just ported the code to Rust, where
I'm more comfortable.

### Is this the real reason?

Haha. Of course _not_. I just like collecting imaginary internet
points, in the form of little stars, that people seem to give to me whenever I
embark on pointless quests for _rewriting X thing, but in Rust_.

### How is this different from `llama.cpp`?

This is a reimplementation of `llama.cpp` that does not share any code with it
outside of `ggml`. This was done for a variety of reasons:

- `llama.cpp` requires a C++ compiler, which can cause problems for
cross-compilation to more esoteric platforms. An example of such a platform
is WebAssembly, which can require a non-standard compiler SDK.
- Rust is easier to work with from a development and open-source perspective;
it offers better tooling for writing "code in the large" with many other
authors. Additionally, we can benefit from the larger Rust ecosystem with
ease.
- We would like to make `ggml` an optional backend
(see [this issue](https://github.com/rustformers/llama-rs/issues/31)).
In general, we hope to build a solution for model inferencing that is as easy
to use and deploy as any other Rust crate.

0 comments on commit e7e7e8a

Please sign in to comment.