docs: update README with info + instructions (#73)

rustformers · Mar 26, 2023 · e7e7e8a · e7e7e8a
1 parent 08b875c
commit e7e7e8a
Showing 1 changed file with 108 additions and 51 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,7 @@
 # LLaMA-rs
 
+<!-- markdownlint-disable-file MD026 -->
+
 > Do the LLaMA thing, but now in Rust 🦀🚀🦙
 
 ![A llama riding a crab, AI-generated](./doc/resources/logo2.png)
@@ -21,31 +23,70 @@ model on a CPU with good performance using full precision, f16 or 4-bit
 quantized versions of the model.
 
 Just like its C++ counterpart, it is powered by the
-[`ggml`](https://github.com/ggerganov/ggml) tensor library, achieving the same performance as the original code.
+[`ggml`](https://github.com/ggerganov/ggml) tensor library, achieving the same
+performance as the original code.
 
 ## Getting started
 
-Make sure you have a rust toolchain set up.
+Make sure you have a Rust 1.65.0 or above and C toolchain[^1] set up, and get a
+copy of the model's weights[^2].
+
+`llama-rs` is a Rust library, while `llama-cli` is a CLI application that wraps
+`llama-rs` and offers basic inference capabilities.
+
+The following instructions explain how to build `llama-cli`.
+
+**NOTE**: For best results, make sure to build and run in release mode.
+Debug builds are going to be very slow.
+
+### Building using `cargo`
+
+Run
+
+```shell
+cargo install --git https://github.com/rustformers/llama-rs llama-cli
+```
+
+to install `llama-cli` to your Cargo `bin` directory, which `rustup` is likely to
+have added to your `PATH`.
 
-1. Get a copy of the model's weights[^1]
-2. Clone the repository
-3. Build (`cargo build --release`)
-4. Run with `cargo run --release -- <ARGS>`
+It can then be run through `llama-cli`.
 
-**NOTE**: For best results, make sure to build and run in release mode. Debug builds are going to be very slow.
+### Building from repository
 
-For example, you try the following prompt:
+Clone the repository, and then build it through
 
 ```shell
-cargo run --release -- -m /data/Llama/LLaMA/7B/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
+cargo build --release
+```
+
+The resulting binary will be at `target/release/llama-cli[.exe]`.
+
+It can also be run directly through Cargo, using
+
+```shell
+cargo run --release -- <ARGS>
+```
+
+This is useful for development.
+
+### Running
+
+For example, try the following prompt:
+
+```shell
+llama-cli -m <path>/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
 ```
 
 Some additional things to try:
 
 - Use `--help` to see a list of available options.
 - If you have the [alpaca-lora](https://github.com/tloen/alpaca-lora) weights,
-  try `--repl` mode! `cargo run --release -- -m <path>/ggml-alpaca-7b-q4.bin
--f examples/alpaca_prompt.txt --repl`.
+  try `--repl` mode!
+
+  ```shell
+  llama-cli -m <path>/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt --repl
+  ```
 
   ![Gif showcasing alpaca repl mode](./doc/resources/alpaca_repl_screencap.gif)
 
@@ -55,46 +96,62 @@ Some additional things to try:
 
   ![Gif showcasing prompt caching](./doc/resources/prompt_caching_screencap.gif)
 
-[^1]: The only legal source to get the weights at the time of writing is [this repository](https://github.com/facebookresearch/llama/blob/main/README.md#llama). The choice of words also may or may not hint at the existence of other kinds of sources.
+  (This GIF shows an older version of the flags, but the mechanics are still the same.)
+
+[^1]:
+    A modern-ish C toolchain is required to compile `ggml`. A C++ toolchain
+    should not be necessary.
+
+[^2]:
+    The only legal source to get the weights at the time of writing is
+    [this repository](https://github.com/facebookresearch/llama/blob/main/README.md#llama).
+    The choice of words also may or may not hint at the existence of other
+    kinds of sources.
 
 ## Q&A
 
-- **Q: Why did you do this?**
-- **A:** It was not my choice. Ferris appeared to me in my dreams and asked me
-  to rewrite this in the name of the Holy crab.
-
-- **Q: Seriously now**
-- **A:** Come on! I don't want to get into a flame war. You know how it goes,
-  _something something_ memory _something something_ cargo is nice, don't make
-  me say it, everybody knows this already.
-
-- **Q: I insist.**
-- **A:** _Sheesh! Okaaay_. After seeing the huge potential for **llama.cpp**,
-  the first thing I did was to see how hard would it be to turn it into a
-  library to embed in my projects. I started digging into the code, and realized
-  the heavy lifting is done by `ggml` (a C library, easy to bind to Rust) and
-  the whole project was just around ~2k lines of C++ code (not so easy to bind).
-  After a couple of (failed) attempts to build an HTTP server into the tool, I
-  realized I'd be much more productive if I just ported the code to Rust, where
-  I'm more comfortable.
-
-- **Q: Is this the real reason?**
-- **A:** Haha. Of course _not_. I just like collecting imaginary internet
-  points, in the form of little stars, that people seem to give to me whenever I
-  embark on pointless quests for _rewriting X thing, but in Rust_.
-
-## Known issues / To-dos
-
-Contributions welcome! Here's a few pressing issues:
-
-- [ ] The quantization code has not been ported (yet). You can still use the
-      quantized models with llama.cpp.
-- [ ] No crates.io release. The name `llama-rs` is reserved and I plan to do
-      this soon-ish.
-- [ ] Any improvements from the original C++ code. (See https://github.com/setzer22/llama-rs/issues/15)
-- [x] Debug builds are currently broken.
-- [x] The code needs to be "library"-fied. It is nice as a showcase binary, but
-      the real potential for this tool is to allow embedding in other services.
-- [x] The code only sets the right CFLAGS on Linux. The `build.rs` script in
-      `ggml_raw` needs to be fixed, so inference _will be very slow on every
-      other OS_.
+### Why did you do this?
+
+It was not my choice. Ferris appeared to me in my dreams and asked me
+to rewrite this in the name of the Holy crab.
+
+### Seriously now.
+
+Come on! I don't want to get into a flame war. You know how it goes,
+_something something_ memory _something something_ cargo is nice, don't make
+me say it, everybody knows this already.
+
+### I insist.
+
+_Sheesh! Okaaay_. After seeing the huge potential for **llama.cpp**,
+the first thing I did was to see how hard would it be to turn it into a
+library to embed in my projects. I started digging into the code, and realized
+the heavy lifting is done by `ggml` (a C library, easy to bind to Rust) and
+the whole project was just around ~2k lines of C++ code (not so easy to bind).
+After a couple of (failed) attempts to build an HTTP server into the tool, I
+realized I'd be much more productive if I just ported the code to Rust, where
+I'm more comfortable.
+
+### Is this the real reason?
+
+Haha. Of course _not_. I just like collecting imaginary internet
+points, in the form of little stars, that people seem to give to me whenever I
+embark on pointless quests for _rewriting X thing, but in Rust_.
+
+### How is this different from `llama.cpp`?
+
+This is a reimplementation of `llama.cpp` that does not share any code with it
+outside of `ggml`. This was done for a variety of reasons:
+
+- `llama.cpp` requires a C++ compiler, which can cause problems for
+  cross-compilation to more esoteric platforms. An example of such a platform
+  is WebAssembly, which can require a non-standard compiler SDK.
+- Rust is easier to work with from a development and open-source perspective;
+  it offers better tooling for writing "code in the large" with many other
+  authors. Additionally, we can benefit from the larger Rust ecosystem with
+  ease.
+- We would like to make `ggml` an optional backend
+  (see [this issue](https://github.com/rustformers/llama-rs/issues/31)).
+
+In general, we hope to build a solution for model inferencing that is as easy
+to use and deploy as any other Rust crate.