[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) #6611

mgoin · 2024-07-20T23:15:53Z

Based off huggingface/transformers#31699 - Nemotron-3 loads and produces reasonable output. Nemotron-4 and the most recently released Minitron works and evals can be reproduced.

For CI, a Minitron-4B-Base GSM8k eval has been added to the lm-eval test suite.

The architecture is pretty similar to Llama, with these changes:

There is no gate_proj, just up_proj
Normal LayerNorm (with a +1 to the weights) instead of RMSNorm
Squared ReLU instead of SwiGLU
Adds a rotary_percent to RoPE

Collection of checkpoints (Nemotron-3, Nemotron-4, Minitron): https://huggingface.co/collections/mgoin/nemotron-in-vllm-66a151b4240bcd9c28735ec5

Loading nvidia/Minitron-4B-Base:

>>> from vllm import LLM
>>> model = LLM("nvidia/Minitron-4B-Base")
INFO 07-23 14:18:22 llm_engine.py:175] Initializing an LLM engine (v0.5.2) with config: model='nvidia/Minitron-4B-Base', speculative_config=None, tokenizer='nvidia/Minitron-4B-Base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=nvidia/Minitron-4B-Base, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-23 14:18:26 weight_utils.py:219] Using model weights format ['*.bin']
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.38G/8.38G [02:13<00:00, 62.9MB/s]
INFO 07-23 14:20:44 model_runner.py:563] Loading model weights took 7.8059 GB
INFO 07-23 14:20:44 gpu_executor.py:102] # GPU blocks: 31261, # CPU blocks: 2048
INFO 07-23 14:20:45 model_runner.py:851] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-23 14:20:45 model_runner.py:855] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-23 14:20:49 model_runner.py:1052] Graph capturing finished in 3 secs.
>>> model.generate("I love you!")
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.51it/s, est. speed input: 30.07 toks/s, output: 120.28 toks/s]
[RequestOutput(request_id=0, prompt='I love you!', prompt_token_ids=[1317, 2624, 1346, 252397], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' I just wish things were about more than one of us going to be tragically injured', token_ids=(1317, 1722, 5962, 2644, 1671, 1632, 1602, 1742, 1610, 1299, 1512, 2433, 1294, 1354, 133186, 13956), cumulative_logprob=-56.41404923796654, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1721744492.323585, last_token_time=1721744492.323585, first_scheduled_time=1721744492.3276443, first_token_time=1721744492.3389344, time_in_queue=0.004059314727783203, finished_time=1721744492.4604568), lora_request=None)]

Loading nemotron3-8b-base:

>>> from vllm import LLM, SamplingParams
>>> model = LLM("thhaus/nemotron3-8b")
INFO 07-21 00:55:19 llm_engine.py:175] Initializing an LLM engine (v0.5.2) with config: model='thhaus/nemotron3-8b', speculative_config=None, tokenizer='thhaus/nemotron3-8b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=thhaus/nemotron3-8b, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-21 00:55:23 weight_utils.py:219] Using model weights format ['*.safetensors']
INFO 07-21 00:55:26 model_runner.py:563] Loading model weights took 15.9077 GB
INFO 07-21 00:55:27 gpu_executor.py:102] # GPU blocks: 6770, # CPU blocks: 512
INFO 07-21 00:55:28 model_runner.py:851] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-21 00:55:28 model_runner.py:855] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-21 00:55:32 model_runner.py:1052] Graph capturing finished in 4 secs.
>>> output = model.generate("1, 2, 3, ", SamplingParams(temperature=0, max_tokens=100))
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.26s/it, est. speed input: 8.75 toks/s, output: 79.50 toks/s]
>>> print(output)
[RequestOutput(request_id=0, prompt='1, 2, 3, ', prompt_token_ids=[2, 251490, 251525, 251514, 251490, 251527, 251514, 251490, 251556, 251514, 251490], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30', token_ids=(251564, 251514, 251490, 251557, 251514, 251490, 251574, 251514, 251490, 251581, 251514, 251490, 251577, 251514, 251490, 251561, 251514, 251490, 251525, 251521, 251514, 251490, 251525, 251525, 251514, 251490, 251525, 251527, 251514, 251490, 251525, 251556, 251514, 251490, 251525, 251564, 251514, 251490, 251525, 251557, 251514, 251490, 251525, 251574, 251514, 251490, 251525, 251581, 251514, 251490, 251525, 251577, 251514, 251490, 251525, 251561, 251514, 251490, 251527, 251521, 251514, 251490, 251527, 251525, 251514, 251490, 251527, 251527, 251514, 251490, 251527, 251556, 251514, 251490, 251527, 251564, 251514, 251490, 251527, 251557, 251514, 251490, 251527, 251574, 251514, 251490, 251527, 251581, 251514, 251490, 251527, 251577, 251514, 251490, 251527, 251561, 251514, 251490, 251556, 251521), cumulative_logprob=-4.905927623214666, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1721523332.4632323, last_token_time=1721523332.4632323, first_scheduled_time=1721523332.4670587, first_token_time=1721523332.4824336, time_in_queue=0.0038263797760009766, finished_time=1721523333.724725), lora_request=None)]

Loading mgoin/Nemotron-4-340B-Instruct-FP8-Dynamic on 8xA100:

>>> from vllm import LLM, SamplingParams
>>> model = LLM("mgoin/Nemotron-4-340B-Instruct-FP8-Dynamic", tensor_parallel_size=8, distributed_executor_backend="ray", enforce_eager=True)
2024-07-21 03:09:56,222 INFO worker.py:1749 -- Started a local Ray instance.
INFO 07-21 03:09:57 llm_engine.py:175] Initializing an LLM engine (v0.5.2) with config: model='mgoin/Nemotron-4-340B-Instruct-FP8-Dynamic', speculative_config=None, tokenizer='mgoin/Nemotron-4-340B-Instruct-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/mgoin/models/Nemotron-4-340B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-21 03:10:14 utils.py:779] Found nccl from library libnccl.so.2
INFO 07-21 03:10:14 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=1233230) INFO 07-21 03:10:14 utils.py:779] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=1233230) INFO 07-21 03:10:14 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-21 03:10:17 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 07-21 03:10:17 shm_broadcast.py:240] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fdf89ac73d0>, local_subscribe_port=56965, local_sync_port=40185, remote_subscribe_port=None, remote_sync_port=None)
(RayWorkerWrapper pid=1233230) INFO 07-21 03:10:17 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
WARNING 07-21 03:10:17 fp8.py:38] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(RayWorkerWrapper pid=1233230) WARNING 07-21 03:10:17 fp8.py:38] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
WARNING 07-21 03:10:35 utils.py:569] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
INFO 07-21 03:10:36 model_runner.py:563] Loading model weights took 40.8975 GB
(RayWorkerWrapper pid=1233718) WARNING 07-21 03:10:45 utils.py:569] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(RayWorkerWrapper pid=1234002) INFO 07-21 03:10:14 utils.py:779] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=1234002) INFO 07-21 03:10:14 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 6x across cluster]
(RayWorkerWrapper pid=1234002) INFO 07-21 03:10:17 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 6x across cluster]
(RayWorkerWrapper pid=1234002) WARNING 07-21 03:10:17 fp8.py:38] Detected fp8 checkpoint. Please note that the format is experimental and subject to change. [repeated 6x across cluster]
(RayWorkerWrapper pid=1233718) INFO 07-21 03:10:46 model_runner.py:563] Loading model weights took 40.8975 GB
INFO 07-21 03:10:54 distributed_gpu_executor.py:56] # GPU blocks: 22421, # CPU blocks: 3640
>>> model.generate("Hello!")
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.19s/it, est. speed input: 1.69 toks/s, output: 13.49 toks/s]
[RequestOutput(request_id=0, prompt='Hello!', prompt_token_ids=[14716, 252397], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='8493045638869919', token_ids=(252377, 252372, 252370, 252365, 252334, 252372, 252366, 252376, 252365, 252377, 252377, 252376, 252370, 252370, 252338, 252370), cumulative_logprob=-38.44283938407898, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1721531506.653993, last_token_time=1721531506.653993, first_scheduled_time=1721531506.6587515, first_token_time=1721531506.8336384, time_in_queue=0.004758596420288086, finished_time=1721531507.8438315), lora_request=None)]

github-actions · 2024-07-20T23:16:04Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

mgoin · 2024-07-21T02:37:30Z

Able to load Nemotron-4-340B-Instruct (had to make a lot of edits to this checkpoint, will upload) with cpu offloading:

>>> from vllm import LLM, SamplingParams
>>> model = LLM("/home/mgoin/models/Nemotron-4-340B-Instruct", tensor_parallel_size=8, distributed_executor_backend="ray", cpu_offload_gb=20, enforce_eager=True)
2024-07-21 02:32:18,002 INFO worker.py:1749 -- Started a local Ray instance.
INFO 07-21 02:32:19 llm_engine.py:175] Initializing an LLM engine (v0.5.2) with config: model='/home/mgoin/models/Nemotron-4-340B-Instruct', speculative_config=None, tokenizer='/home/mgoin/models/Nemotron-4-340B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/mgoin/models/Nemotron-4-340B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-21 02:32:36 utils.py:779] Found nccl from library libnccl.so.2
INFO 07-21 02:32:36 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=1139686) INFO 07-21 02:32:36 utils.py:779] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=1139686) INFO 07-21 02:32:36 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-21 02:32:39 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 07-21 02:32:39 shm_broadcast.py:240] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fda1b3b33d0>, local_subscribe_port=55417, local_sync_port=53103, remote_subscribe_port=None, remote_sync_port=None)
(RayWorkerWrapper pid=1139686) INFO 07-21 02:32:39 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 07-21 02:33:14 model_runner.py:563] Loading model weights took 59.3774 GB
(RayWorkerWrapper pid=1140131) INFO 07-21 02:33:34 model_runner.py:563] Loading model weights took 59.3774 GB
(RayWorkerWrapper pid=1140410) INFO 07-21 02:32:36 utils.py:779] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=1140410) INFO 07-21 02:32:36 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 6x across cluster]
(RayWorkerWrapper pid=1140410) INFO 07-21 02:32:39 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 6x across cluster]
INFO 07-21 02:33:42 distributed_gpu_executor.py:56] # GPU blocks: 4867, # CPU blocks: 3640
>>> model.generate("Hello!")
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.29s/it, est. speed input: 0.12 toks/s, output: 0.98 toks/s]
[RequestOutput(request_id=0, prompt='Hello!', prompt_token_ids=[14716, 252397], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='8493045638869919', token_ids=(252377, 252372, 252370, 252365, 252334, 252372, 252366, 252376, 252365, 252377, 252377, 252376, 252370, 252370, 252338, 252370), cumulative_logprob=-38.362853050231934, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1721529255.4465594, last_token_time=1721529255.4465594, first_scheduled_time=1721529255.4538574, first_token_time=1721529256.6029613, time_in_queue=0.007297992706298828, finished_time=1721529271.7377276), lora_request=None)]

mgoin · 2024-07-21T03:19:16Z

I made an FP8 W8 quantized checkpoint based on the above and it produces the same output. The tokenizer is not right for this model so that is the next step.

>>> from vllm import LLM, SamplingParams
>>> model = LLM("/home/mgoin/models/Nemotron-4-340B-Instruct-FP8", tensor_parallel_size=8, distributed_executor_backend="ray", enforce_eager=True)
2024-07-21 03:09:56,222 INFO worker.py:1749 -- Started a local Ray instance.
INFO 07-21 03:09:57 llm_engine.py:175] Initializing an LLM engine (v0.5.2) with config: model='/home/mgoin/models/Nemotron-4-340B-Instruct-FP8', speculative_config=None, tokenizer='/home/mgoin/models/Nemotron-4-340B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/mgoin/models/Nemotron-4-340B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-21 03:10:14 utils.py:779] Found nccl from library libnccl.so.2
INFO 07-21 03:10:14 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=1233230) INFO 07-21 03:10:14 utils.py:779] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=1233230) INFO 07-21 03:10:14 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-21 03:10:17 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 07-21 03:10:17 shm_broadcast.py:240] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fdf89ac73d0>, local_subscribe_port=56965, local_sync_port=40185, remote_subscribe_port=None, remote_sync_port=None)
(RayWorkerWrapper pid=1233230) INFO 07-21 03:10:17 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
WARNING 07-21 03:10:17 fp8.py:38] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(RayWorkerWrapper pid=1233230) WARNING 07-21 03:10:17 fp8.py:38] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
WARNING 07-21 03:10:35 utils.py:569] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
INFO 07-21 03:10:36 model_runner.py:563] Loading model weights took 40.8975 GB
(RayWorkerWrapper pid=1233718) WARNING 07-21 03:10:45 utils.py:569] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(RayWorkerWrapper pid=1234002) INFO 07-21 03:10:14 utils.py:779] Found nccl from library libnccl.so.2 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=1234002) INFO 07-21 03:10:14 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 6x across cluster]
(RayWorkerWrapper pid=1234002) INFO 07-21 03:10:17 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /home/mgoin/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 6x across cluster]
(RayWorkerWrapper pid=1234002) WARNING 07-21 03:10:17 fp8.py:38] Detected fp8 checkpoint. Please note that the format is experimental and subject to change. [repeated 6x across cluster]
(RayWorkerWrapper pid=1233718) INFO 07-21 03:10:46 model_runner.py:563] Loading model weights took 40.8975 GB
INFO 07-21 03:10:54 distributed_gpu_executor.py:56] # GPU blocks: 22421, # CPU blocks: 3640
>>> model.generate("Hello!")
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.19s/it, est. speed input: 1.69 toks/s, output: 13.49 toks/s]
[RequestOutput(request_id=0, prompt='Hello!', prompt_token_ids=[14716, 252397], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='8493045638869919', token_ids=(252377, 252372, 252370, 252365, 252334, 252372, 252366, 252376, 252365, 252377, 252377, 252376, 252370, 252370, 252338, 252370), cumulative_logprob=-38.44283938407898, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1721531506.653993, last_token_time=1721531506.653993, first_scheduled_time=1721531506.6587515, first_token_time=1721531506.8336384, time_in_queue=0.004758596420288086, finished_time=1721531507.8438315), lora_request=None)]

WoosukKwon · 2024-07-22T18:24:37Z

vllm/model_executor/models/nemotron.py

Just curious: how is the architecture different from Llama?

I'm fairly certain it could be implemented inside our Llama implementation. I'm not sure how to deal with the absence of gate_proj though.

The key differences are:

There is no gate_proj, just up_proj

Normal LayerNorm (with a +1 to the weights) instead of RMSNorm

Squared ReLU instead of SwiGLU

Adds a rotary_percentage to RoPE

This is a good overview of main changes: https://twitter.com/danielhanchen/status/1801671106266599770

simon-mo

Update documentation as well?

ywang96 · 2024-07-26T18:15:55Z

vllm/model_executor/layers/activation.py

+class ReLUSquaredActivation(CustomOp):
+    """
+    Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2
+    """


~~I think you might also need to add this op to CPU similar to what I did here previously with QuickGELU.~~

Per offline discussion, no CPU op should need to be added since it's just calling torch API. @bigPYJ1151 It would be great if you can confirm that, thanks!

vllm/model_executor/models/nemotron.py

…llm-project#6611)

zhangzx-uiuc · 2024-07-29T23:40:24Z

This looks great! Just wondering has anyone tried serving the bf16 version on 2 8xA100 nodes, with ray?

…llm-project#6611)

Nemotron3 modeling (it loads!)

4ac18c4

mgoin added 2 commits July 21, 2024 00:54

Correctness

0276291

Lint

db95ed4

Merge branch 'main' into nemotron-support

76071ae

WoosukKwon reviewed Jul 22, 2024

View reviewed changes

mgoin added 2 commits July 23, 2024 14:35

Update

5996b6e

Add fallback for rope_percentage

764d6cc

mgoin changed the title ~~[WIP][Model] Support Nemotron~~ [Model] Support Nemotron Jul 23, 2024

mgoin marked this pull request as ready for review July 23, 2024 14:39

Add lm-eval test

a89fc53

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2024

mgoin added 4 commits July 24, 2024 20:53

Merge branch 'main' into nemotron-support

1c3f24b

Update model

f9d61a9

Format

f819cc9

Format

e34b4f4

mgoin changed the title ~~[Model] Support Nemotron~~ [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) Jul 26, 2024

mgoin added 2 commits July 26, 2024 08:44

Merge branch 'main' into nemotron-support

307be89

Update tolerance

1e76d1c

simon-mo approved these changes Jul 26, 2024

View reviewed changes

ywang96 reviewed Jul 26, 2024

View reviewed changes

mgoin merged commit 07278c3 into main Jul 26, 2024
72 checks passed

andoorve reviewed Jul 27, 2024

View reviewed changes

vllm/model_executor/models/nemotron.py Show resolved Hide resolved

cadedaniel pushed a commit to cadedaniel/vllm-public that referenced this pull request Jul 27, 2024

[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) (v…

0a5ca54

…llm-project#6611)

mgoin mentioned this pull request Jul 27, 2024

Add Nemotron to PP_SUPPORTED_MODELS #6863

Merged

mgoin deleted the nemotron-support branch July 27, 2024 15:55

dtrifiro mentioned this pull request Aug 5, 2024

Sync with upstream@v0.5.4-7-g9118217f opendatahub-io/vllm#120

Closed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) (v…

42e7e62

…llm-project#6611)

natolambert mentioned this pull request Aug 23, 2024

[Bug]: Nemotron 340B does not generated EOS token #7828

Open

1 task

nayohan mentioned this pull request Aug 28, 2024

[MODEL] add Exaone model support #7819

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) #6611

[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) #6611

mgoin commented Jul 20, 2024 •

edited

Loading

github-actions bot commented Jul 20, 2024

mgoin commented Jul 21, 2024

mgoin commented Jul 21, 2024

WoosukKwon Jul 22, 2024

mgoin Jul 22, 2024 •

edited

Loading

simon-mo left a comment

ywang96 Jul 26, 2024 •

edited

Loading

zhangzx-uiuc commented Jul 29, 2024

[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) #6611

[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) #6611

Conversation

mgoin commented Jul 20, 2024 • edited Loading

github-actions bot commented Jul 20, 2024

mgoin commented Jul 21, 2024

mgoin commented Jul 21, 2024

WoosukKwon Jul 22, 2024

Choose a reason for hiding this comment

mgoin Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

simon-mo left a comment

Choose a reason for hiding this comment

ywang96 Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

zhangzx-uiuc commented Jul 29, 2024

mgoin commented Jul 20, 2024 •

edited

Loading

mgoin Jul 22, 2024 •

edited

Loading

ywang96 Jul 26, 2024 •

edited

Loading