Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can inference be run on consumer hardware? #8

Open
GrahamboJangles opened this issue Jul 2, 2023 · 8 comments
Open

Can inference be run on consumer hardware? #8

GrahamboJangles opened this issue Jul 2, 2023 · 8 comments

Comments

@GrahamboJangles
Copy link

GrahamboJangles commented Jul 2, 2023

AMD? CPU? Single GPU?

Is this all possible via FastChat?

@DachengLi1
Copy link
Owner

@GrahamboJangles It is already in FastChat. https://github.com/lm-sys/FastChat#longchat

We currently test it in A100 single GPU and it works pretty well. We are adding more support to let it run more efficiently. Let me know whether it works for your hardware, and we can improve the system support!

@GrahamboJangles
Copy link
Author

@DachengLi1 I have 2 RX6800s, I'm guessing that they are not yet supported?

@DachengLi1
Copy link
Owner

Regarding RX Series, please see the discussion here. The inference is backed by FastChat, and it seems people can AMD card working. Can you run (there is no load-8-bit yet):

python3 -m fastchat.serve.cli --model-path lmsys/longchat-7b-16k

and let me know if it works for you? Also feel free to submit an issue in FastChat regarding this.

@GrahamboJangles

@GrahamboJangles
Copy link
Author

@DachengLi1 thank you for your help and quick responses.

I ran that command and this was the output:

python -m fastchat.serve.cli --model-path longchat-7b-16k

Traceback (most recent call last):
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\utils\import_utils.py", line 1146, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\models\llama\modeling_llama.py", line 31, in <module>
    from ...modeling_utils import PreTrainedModel
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\modeling_utils.py", line 83, in <module>
    from accelerate import __version__ as accelerate_version
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\accelerate\__init__.py", line 7, in <module>
    from .accelerator import Accelerator
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\accelerate\accelerator.py", line 33, in <module>
    from .tracking import LOGGER_TYPE_TO_CLASS, GeneralTracker, filter_trackers
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\accelerate\tracking.py", line 45, in <module>
    import wandb
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\__init__.py", line 26, in <module>
    from wandb import sdk as wandb_sdk
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\sdk\__init__.py", line 5, in <module>
    from . import wandb_helper as helper  # noqa: F401
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\sdk\wandb_helper.py", line 6, in <module>
    from .lib import config_util
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\sdk\lib\config_util.py", line 7, in <module>
    from wandb.util import load_yaml
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\wandb\util.py", line 52, in <module>
    import sentry_sdk  # type: ignore
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\__init__.py", line 1, in <module>
    from sentry_sdk.hub import Hub, init
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\hub.py", line 8, in <module>
    from sentry_sdk.scope import Scope
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\scope.py", line 7, in <module>
    from sentry_sdk.utils import logger, capture_internal_exceptions
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\utils.py", line 887, in <module>
    HAS_REAL_CONTEXTVARS, ContextVar = _get_contextvars()
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\utils.py", line 857, in _get_contextvars
    if not _is_contextvars_broken():
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\sentry_sdk\utils.py", line 791, in _is_contextvars_broken
    import gevent  # type: ignore
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\gevent\__init__.py", line 86, in <module>
    from gevent._hub_local import get_hub
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\gevent\_hub_local.py", line 101, in <module>
    import_c_accel(globals(), 'gevent.__hub_local')
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\gevent\_util.py", line 148, in import_c_accel
    mod = importlib.import_module(cname)
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "src\\gevent\\_hub_local.py", line 1, in init gevent._gevent_c_hub_local
ValueError: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 160 from C header, got 40 from PyObject

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\fastchat\serve\cli.py", line 26, in <module>
    from fastchat.model.model_adapter import add_model_args
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\fastchat\model\__init__.py", line 1, in <module>
    from fastchat.model.model_adapter import (
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\fastchat\model\model_adapter.py", line 16, in <module>
    from transformers import (
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\utils\import_utils.py", line 1137, in __getattr__
    value = getattr(module, name)
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\utils\import_utils.py", line 1136, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "E:\Coding_and_Scripting\Pyhon3.10.4\lib\site-packages\transformers\utils\import_utils.py", line 1148, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
greenlet.greenlet size changed, may indicate binary incompatibility. Expected 160 from C header, got 40 from PyObject

@DachengLi1
Copy link
Owner

@GrahamboJangles Thanks for trying it out! Can you submit this to FastChat system? I will also ask the FastChat team to look into it there.

@GrahamboJangles
Copy link
Author

@DachengLi1 Absolutely! Thanks again for your help.

@sejalchopra97
Copy link

@DachengLi1 I was trying to run inference using Longchat-7b-16k on an A100 machine comprising a 40GB GPU. I get a cuda out-of-memory error as the memory was not sufficient. The texts I was using as input from a parquet file were around 9k tokens each. Can you tell me about the upcoming roadmap for efficiency gains and any ETA for it so that I can run inference using lesser resources?

@DachengLi1
Copy link
Owner

@sejalchopra97 For now you can run 9k tokens with flash attention support (but that does not support kv cache so it will be slow). We just got a member working on it on the vLLM side, once she got it done, we can update here. @LiuXiaoxuanPKU, let me know if you have any suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants