Name		Name	Last commit message	Last commit date
parent directory ..
vllm		vllm
README.md		README.md
requirements.txt		requirements.txt

README.md

Getting Started with Arctic

If you want to run Arctic with pure-HF to explore the model see the instructions below. For a more performant deployment we have provided instructions on using Arctic with vLLM.

Hugging Face

Dependencies

Install the following packages, they can all be found under requirements.txt as well.

deepspeed>=0.14.2
transformers>=4.39.0
huggingface_hub[hf_transfer]

We highly recommend using hf_transfer to download the Arctic weights, this will greatly reduce the time you are sitting waiting for the checkpoint shards to download.

Run Arctic Example

Due to the model size we recommend using a single 8xH100-80GB instance from your favorite cloud provider such as: AWS p5.48xlarge, Azure ND96isr_H100_v5, etc. We have only tested this setup with 8xH100-80GB, however 8xA100-80GB should also work.

In this example we are using FP8 quantization provided by DeepSpeed in the backend, we can also use FP6 quantization by specifying q_bits=6 in the QuantizationConfig config. The "150GiB" setting for max_memory is required until we can get DeepSpeed's FP quantization supported natively as a HFQuantizer which we are actively working on.

import os
# enable hf_transfer for faster ckpt download
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepspeed.linear.config import QuantizationConfig

tokenizer = AutoTokenizer.from_pretrained(
    "Snowflake/snowflake-arctic-instruct",
    trust_remote_code=True
)

quant_config = QuantizationConfig(q_bits=8)

# The 150GiB number is a workaround until we have HFQuantizer support, must be ~1.9x of the available GPU memory
model = AutoModelForCausalLM.from_pretrained(
    "Snowflake/snowflake-arctic-instruct",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map="auto",
    ds_quantization_config=quant_config,
    max_memory={i: "150GiB" for i in range(8)},
    torch_dtype=torch.bfloat16)

messages = [{"role": "user", "content": "What is 1 + 1 "}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids=input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference

inference

README.md

Getting Started with Arctic

Hugging Face

Dependencies

Run Arctic Example

Files

inference

Directory actions

More options

Directory actions

More options

Latest commit

History

inference

Folders and files

parent directory

README.md

Getting Started with Arctic

Hugging Face

Dependencies

Run Arctic Example