Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest Docker image fails while initializing gemma2 #2275

Open
2 of 4 tasks
jorado opened this issue Jul 22, 2024 · 4 comments
Open
2 of 4 tasks

Latest Docker image fails while initializing gemma2 #2275

jorado opened this issue Jul 22, 2024 · 4 comments

Comments

@jorado
Copy link

jorado commented Jul 22, 2024

System Info

I tried the following systems, both with the same exception:

  • ghcr.io/huggingface/text-generation-inference:sha-6aebf44 locally with docker on nvidia rtx 3600
  • ghcr.io/huggingface/text-generation-inference:sha-6aebf44 on kubernetes cluster with nvidia a40

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-generation-inference:sha-6aebf44 --model-id google/gemma-2-9b-it

2024-07-22T15:30:59.895904Z  INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-2-9b-it
2024-07-22T15:30:59.897225Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-22T15:31:09.917300Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-07-22T15:31:10.682538Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 653, in __getitem__
    raise KeyError(key)
KeyError: 'gemma2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 749, in get_model
    return FlashCausalLM(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 878, in __init__
    config = config_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gemma2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
2024-07-22T15:31:11.520122Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-07-22 15:31:01.561 | INFO     | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 653, in __getitem__
    raise KeyError(key)

KeyError: 'gemma2'


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 749, in get_model
    return FlashCausalLM(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 878, in __init__
    config = config_class.from_pretrained(

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained
    raise ValueError(

ValueError: The checkpoint you are trying to load has model type `gemma2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
 rank=0
Error: ShardCannotStart
2024-07-22T15:31:11.616687Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-22T15:31:11.616777Z  INFO text_generation_launcher: Shutting down shards

Using text-generation-inference:2.1.1, it works correctly, even though both share the same transformers version.

Expected behavior

initializing the model correctly

@ErikKaum
Copy link
Member

Thanks for reporting this 👍

There were some issues with gemma. I think this patch might address this as well.

Could you confirm if this is the case?

@jorado
Copy link
Author

jorado commented Aug 8, 2024

I now get a different error, both in the latest image and in 2.2.0.

2024-08-08T14:13:11.002710Z  INFO text_generation_launcher: Args {
    model_id: "google/gemma-2-9b-it",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "text-generation-inference2-755f5778bf-k9b86",
    port: 8080,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    disable_usage_stats: false,
    disable_crash_reports: false,
}
2024-08-08T14:13:11.002898Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-08-08T14:13:11.175329Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-08-08T14:13:11.175401Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-08-08T14:13:11.175410Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-08-08T14:13:11.175416Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-08-08T14:13:11.175944Z  INFO download: text_generation_launcher: Starting check and download process for google/gemma-2-9b-it
2024-08-08T14:13:14.195643Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-08-08T14:13:15.387056Z  INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-2-9b-it
2024-08-08T14:13:15.387857Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-08-08T14:13:25.483545Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:35.578015Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:45.585828Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:55.588556Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:05.686872Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:15.778758Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:25.780629Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:35.785662Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:40.223288Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-08-08T14:14:40.277449Z  INFO shard-manager: text_generation_launcher: Shard ready in 84.886492494s rank=0
2024-08-08T14:14:40.283759Z  INFO text_generation_launcher: Starting Webserver
2024-08-08T14:14:40.320297Z  INFO text_generation_router: router/src/main.rs:228: Using the Hugging Face API
2024-08-08T14:14:40.320346Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-08-08T14:14:40.839350Z  INFO text_generation_router: router/src/main.rs:577: Serving revision 4efc01a1a58107f8c7f68027f5d8e475dfc34a6f of model google/gemma-2-9b-it
2024-08-08T14:14:41.433827Z  INFO text_generation_router: router/src/main.rs:357: Using config Some(Gemma2)
2024-08-08T14:14:41.433853Z  WARN text_generation_router: router/src/main.rs:384: Invalid hostname, defaulting to 0.0.0.0
2024-08-08T14:14:41.439120Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-08-08T14:14:42.767141Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-08-08T14:14:42.772236Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1196, in warmup
    self.cuda_graph_warmup(bs, max_s, max_bt)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1051, in cuda_graph_warmup
    self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 490, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 427, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 355, in forward
    attn_output = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 254, in forward
    attn_output = paged_attention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py", line 115, in paged_attention
    raise RuntimeError("Paged attention doesn't support softcapping")
RuntimeError: Paged attention doesn't support softcapping
2024-08-08T14:14:42.995889Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-08-08T14:14:43.119606Z ERROR text_generation_launcher: Webserver Crashed
2024-08-08T14:14:43.119690Z  INFO text_generation_launcher: Shutting down shards

@ErikKaum
Copy link
Member

ErikKaum commented Aug 9, 2024

Ah it seem like this one doesn't have softcapping (#2273).
I'd recommend using the latest TGI version.

Would that work for you?

@SMAntony
Copy link

Ah it seem like this one doesn't have softcapping (#2273). I'd recommend using the latest TGI version.

Would that work for you?

Can you take a look at #2763 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants