Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Correct typos in FAQ and improve readability #1487

Merged
merged 5 commits into from
Mar 19, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 19 additions & 19 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,22 @@

Gradio 4.18.0+ fails to work for streaming audio from UI. No audio is generated. Waiting for bug fix: https://github.com/gradio-app/gradio/issues/7497.

Work-around: Use gradio 4.17.0 or lower:
Workaround: Use gradio 4.17.0 or lower:
```bash
pip uninstall gradio gradio_client -y
pip install gradio==4.17.0
```

### nginx and k8 multi-pod support
### nginx and K8s multi-pod support

Gradio 4.x.y fails to support k8 multi-pod use. Basically gradio client on one pod can't reach gradio server on nearby pod. See: https://github.com/gradio-app/gradio/issues/6920 and https://github.com/gradio-app/gradio/issues/7317.
Gradio 4.x.y fails to support K8s multi-pod use. Specifically, the Gradio client on one pod can't reach a Gradio server on a nearby pod. For more information, see https://github.com/gradio-app/gradio/issues/6920 and https://github.com/gradio-app/gradio/issues/7317.

Work-around: Use gradio 3.50.2 and gradio_client 0.6.1 by commenting-in/out relevant lines in `requirements.txt`, `reqs_optional/reqs_constraints.txt`, and comment-out `gradio_pdf` in `reqs_optional/requirements_optional_langchain.txt`, i.e.
Workaround: Use gradio 3.50.2 and `gradio_client` 0.6.1 by commenting in or out relevant lines in `requirements.txt` and `reqs_optional/reqs_constraints.txt`, and comment out `gradio_pdf` in `reqs_optional/requirements_optional_langchain.txt`, i.e.
```bash
pip uninstall gradio gradio_client gradio_pdf -y
pip install gradio==3.50.2
```
If you see spontaneous crashes via OS killer, then use gradio 3.50.1 instead:
If you experience spontaneous crashes via OS killer, then use gradio 3.50.1 instead:
```bash
pip uninstall gradio gradio_client gradio_pdf -y
pip install gradio==3.50.1
Expand All @@ -33,13 +33,13 @@ CUDA error: an illegal memory access was encountered

With upgrade to llama_cpp_python 0.2.56 for faster performance and other bug fixes, thread safety is worse. So cannot do audio streaming + GGUF streaming at same time. See: https://github.com/ggerganov/llama.cpp/issues/3960.

A temporary work-around is present in h2oGPT, whereby XTTS model (not microsoft TTS model) and llama.cpp models are not used at same time. Leads to more delays in streaming for text+audio, but not too bad result.
A temporary workaround is present in h2oGPT, whereby the XTTS model (not the Microsoft TTS model) and llama.cpp models are not used at the same time. This leads to more delays in streaming for text + audio, but not too bad a result.

Other work-arounds:
Other workarounds:

* Work-around 1: Use inference server like oLLaMa, vLLM, gradio inference server, etc. as described [below](FAQ.md#running-ollama-vs-h2ogpt-as-inference-server).
* Workaround 1: Use inference server like oLLaMa, vLLM, gradio inference server, etc. as described [below](FAQ.md#running-ollama-vs-h2ogpt-as-inference-server).

* Work-around 2: Follow normal directions for installation, but replace 0.2.56 with 0.2.26, e.g. for CUDA with Linux:
* Workaround 2: Follow normal directions for installation, but replace 0.2.56 with 0.2.26, e.g. for CUDA with Linux:
```bash
pip uninstall llama_cpp_python llama_cpp_python_cuda -y
export LLAMA_CUBLAS=1
Expand All @@ -51,7 +51,7 @@ Other work-arounds:

## Frequently asked questions

### Running oLLaMa vs. h2oGPT as inference server.
### Running oLLaMa vs. h2oGPT as inference server

* Run oLLaMa as server for h2oGPT frontend.

Expand Down Expand Up @@ -271,7 +271,7 @@ ulimit -n 1048576

export H2OGPT_LLAVA_MODEL=http://xxx.xxx.xxx.144:7860/
```
Be careful with gradio and secret files. h2oGPT sets `allowed_paths` to include `.`, unless public instance when `GPT_H2O_AI=1` is set. So if you put your key file in `.` and didn't set to be public instance, it'll be possible to access your key file even if have a soft link to secret location.
Exercise caution with gradio and secret files. h2oGPT sets `allowed_paths` to include `.`, unless public instance when `GPT_H2O_AI=1` is set. So if you put your key file in `.` and didn't set to be public instance, it'll be possible to access your key file even if have a soft link to secret location.

Then running:
```
Expand Down Expand Up @@ -1008,7 +1008,7 @@ For Twitter, one can right-click on Twitter video, copy video address, then past

For fast performance, one can use `distil-whisper/distil-large-v2` as the model, which is about 10x faster for similar accuracy.

In addition, faster_whisper package can be used if using large v2 or v3, which is about 4x faster and 2x less memory for similar accuracy.
In addition, `faster_whisper` package can be used if using large v2 or v3, which is about 4x faster and 2x less memory for similar accuracy.

### Voice Cloning

Expand Down Expand Up @@ -1427,15 +1427,15 @@ We take care of this for distilgpt2, but other similar models might fail in same

### Adding Models

You can choose any Hugging Face model or quantized GGUF model file in h2oGPT. Hugging Face models are automatically downloaded to the Hugging Face .cache folder (in home folder).
You can choose any Hugging Face model or quantized GGUF model file in h2oGPT. Hugging Face models are automatically downloaded to the Hugging Face `.cache` folder (in home folder).

#### Hugging Face

Hugging Face models are passed via `--base_model` in all cases, with fine-control using `hf_model_dict`.

#### TheBloke

For models by [TheBloke](https://huggingface.co/TheBloke), h2oGPT tries to automatically handle all types of models (AWQ, GGUF, GGML, GPTQ, with or without safetensors) automatically all passed with `--base_model` only (CLI or UI both). For example, these models all can be passed just with `--base_model` without any extra model options:
For models by [TheBloke](https://huggingface.co/TheBloke), h2oGPT tries to automatically handle all types of models (AWQ, GGUF, GGML, and GPTQ, with or without [safetensors](https://huggingface.co/docs/safetensors/index#safetensors)). These models can all be passed using only the `--base_model` option (CLI or UI both). For example, the following models can all be passed with just the `--base_model` option without any additional model options:
```text
python generate.py --base_model=h2oai/h2ogpt-oig-oasst1-512-6_9b
python generate.py --base_model=TheBloke/Xwin-LM-13B-V0.1-GPTQ
Expand All @@ -1446,15 +1446,15 @@ python generate.py --base_model=TheBloke/zephyr-7B-beta-AWQ
python generate.py --base_model=zephyr-7b-beta.Q5_K_M.gguf --prompt_type=zephyr
python generate.py --base_model=https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguf?download=true
```
Some are these are non-quantized models with links HF links, some specific files on local disk ending in `.gguf`. Given `TheBloke` HF names, if a quantized model, h2oGPT pulls the recommended model from his repository. You can also provide a resolved web link directly, or a file.
Some are these are non-quantized models with HF links, and some are specific files on local disk ending in `.gguf`. Given `TheBloke` HF names, if it is a quantized model, h2oGPT pulls the recommended model from his repository. You can also provide a resolved web link directly, or a file.

Watch out for typos. h2oGPT broadly detects if the URL is valid, but Hugging Face just returns a redirect for resolved links, leading to page containing `Entry not found` if one makes a mistake in the file name, e.g. `https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguffoo`.
Watch out for typos. h2oGPT broadly detects if the URL is valid, but Hugging Face just returns a redirect for resolved links, leading to a page containing `Entry not found` if there is a mistake in the file name, e.g. `https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguffoo`.

For AWQ, GPTQ, we try the required safe tensors or other options, and by default use transformers's GPTQ unless one specifies `--use_autogptq=True`.
For AWQ, GPTQ, we try the required safe tensors or other options, and by default use transformers' GPTQ unless one specifies `--use_autogptq=True`.

#### AWQ & GPTQ

For full control over AWQ, GPTQ models, one can use an extra `--load_gptq` and `gptq_dict` for GPTQ models or an extra `--load_awq` for AWQ models.
For full control over AWQ and GPTQ models, one can use an extra `--load_gptq` and `gptq_dict` for GPTQ models or an extra `--load_awq` for AWQ models.

##### GPTQ

Expand Down Expand Up @@ -1489,7 +1489,7 @@ For full control (e.g. for non-TheBloke models), use `--base_model=llama` and sp

#### GGUF

GGUF models are supported (can run either CPU and GPU in same install), see installation instructions for installing the separate GPU and CPU packages.
GGUF (GPT-Generated Unified Format) models are supported (can run either CPU and GPU in same install), see installation instructions for installing the separate GPU and CPU packages.

GGUF using Mistral:
```bash
Expand Down
Loading