Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable common device abstraction for 8bits/4bits #898

Conversation

jianan-gu
Copy link

@jianan-gu jianan-gu commented Dec 5, 2023

As one of the plans in the RFC #894. This PR intends to add a device abstraction that allows the supports of non-CUDA devices to be added into bitsandbytes (for 8bits/4bits functionality) more commonly and easily.

To create a device backend abstraction, this PR creates a common backend folder, which contains the key kernel interfaces class to implement by each backend, a backend registration interface to add new device backends, and a kernel dispatching mechanism.

Key interfaces from bitsandbytes.functional  that are for the scope of 8bits and 4bits:
| F.igemmlt | F.double_quant| F.mm_dequant| F.transform| F.extract_outliers| F.quantize_4bit| F.dequantize_4bit |

image

The common backends will further be used in bitsandbytes.functional to implement the 8bits/4bits functionality.

Currently, the backend only contains CUDA as it is without this PR.

@jgong5
Copy link

jgong5 commented Dec 5, 2023

@yao-matrix @jiqing-feng

@jianan-gu
Copy link
Author

Hi @TimDettmers,
Could you please have a review on this PR for any comment/suggestion. Thanks!

@Titus-von-Koeller
Copy link
Collaborator

Thanks for your contribution! We'll look into it and get back to you soon.

@TimDettmers
Copy link
Collaborator

Thank you so much for this contribution. We discussed internally how to best integrate this and other libraries. We think it is best to abstract the extern C interface so that none (or very little) of the Python code needs to be changed. This has the advantage that no new tests would need to be written to test the functionality of other devices.

In the next weeks, we will work on splitting the extern C "god interface" into more manageable sub-interfaces for (1) 4-bit, (2) 8-bit, (3) 8-bit optimizers. Each vendor can then implement this specific extern C interface to implement the desired sub-functionality.

What do you think about this approach? @rickardp, @abhilash1910, and @arlo-phoenix: this is also relevant for your integration. Please give us feedback on the extern C approach so we can work out the details

@TimDettmers TimDettmers added high priority (first issues that will be worked on) waiting for info labels Jan 2, 2024
@rickardp
Copy link
Contributor

rickardp commented Jan 2, 2024

Thank you so much for this contribution. We discussed internally how to best integrate this and other libraries. We think it is best to abstract the extern C interface so that none (or very little) of the Python code needs to be changed. This has the advantage that no new tests would need to be written to test the functionality of other devices.

In the next weeks, we will work on splitting the extern C "god interface" into more manageable sub-interfaces for (1) 4-bit, (2) 8-bit, (3) 8-bit optimizers. Each vendor can then implement this specific extern C interface to implement the desired sub-functionality.

What do you think about this approach? @rickardp, @abhilash1910, and @arlo-phoenix: this is also relevant for your integration. Please give us feedback on the extern C approach so we can work out the details

That sounds like a good approach to me. The only thing there I remember is that there was some specific code around device selection currently. Not sure if it is needed or not.

@jiqing-feng
Copy link
Contributor

jiqing-feng commented Mar 11, 2024

Hi @Titus-von-Koeller . It's exciting to see that BNB is going to support more platforms! Both device abstraction designs are acceptable if one can be applied soon, so we can start the following work to optimize the CPU. Thanks : )

@abhilash1910
Copy link
Contributor

I think Common Device abstraction would be applicable in this case. Although a hybrid combination of solutions would be ultimately needed. This is because from the compiler side , CUDA runtime should be agnostic to SYCL runtime (although common headers/utility methods should be used). Maybe linking up separate compiler specific kernels to separate pybinds/cmakes keeping the interface common can be used? Seeing the progress in #1077

@Titus-von-Koeller
Copy link
Collaborator

@jiqing-feng Just to give you a short interim update. I've spent quite some time looking into the matter this week and am in the middle of discussions with Tim this. There are still a few open questions. I'll get back to you with something concrete asap.

@TimDettmers
Copy link
Collaborator

Cross-posted reply to #898

  1. Device common abstraction interface

The four sub-interfaces look good to me. We probably want to create higher-level abstractions for 4-bit quantization/dequantization and some other functions should be deprecated such as percentile clipping, and the non-tensor-core 8-bit matrix multiplication.

It seems more people like the tensor-driven dispatch. I find it difficult to read and debug and it can lead to more accidental bugs compare to a singleton. There are also further problems. For example, for paged tensor, the device is cpu but it needs to be executed on a cuda device. We can create special branching functions for this case. But then we already have a special branch for AMD and paged tensors. A singleton would have no branching at all and bugs can be avoided it the user forgot to push the tensor to the correct device. I think a singleton is much superior from reducing bugs and enhancing the user experience.

For a tensor type dispatch, how would you handle the case where the user uses the wrong device? If a tensor is on the CPU, I guess it would silently be executed on the CPU (if supported). This would be slow without the user knowing that something is wrong and we have no way of detecting if something is wrong because we just dispatching based on device type.

  1. Device-specific functions/class

Some functions specific to the device would be specializations in the base class and would not be provided in the interface. So these should be "private" functions for the base class.

I think a runtime singleton would be overkill. The case that someone has multiple accelerators of different device type would be very rare (probably only developers have this). We should have a special manual hack for this rather than degrading the overall system for this use-case. So, a static one-time singleton would be preferable from my view.

  1. Device common tool functions/classes

Agreed. I am not sure if all device type would work the same way. For example, does get_ptr() generalize to all devices? It might be a special case though and if something is wrong we can fix it later. So I think moving everything to something like utils.py might work well.

I've noticed that the other implementation #1077 which has the same scope has just started to be reviewed. Have you decided which design will be applied? It would be nice if we could get your feedback soon, we would love to help make progress for both designs. Thx!

The PR #1107 was based on a discussion that I had with @Titus-von-Koeller. It was my preferred view at the time, but we should discuss what would be best for everyone and decide based on that.

Cross-posted reply to #1077

I think there's already a need to track a device index, so this seems feasible to add device type, unless there's something I'm misunderstanding. E.g. I already see the signature void cprefetch(void *ptr, size_t bytes, int device) along with a page_deviceid property. Would we just be able to look at when tensor.is_paged to resolve that?

I think this might be the way to go. If we use a singleton we do not need this, but for a dispatch scenario we need to do this and it is probably the best solution to do an if-else on tensor.is_paged.

On the other hand, even with NVIDIAs "virtual UMA", wouldn't it be possible to first allocate the tensor on the device it is meant to go to, then being able to access it from the CPU?

This would be more general, and we could even abstract it to disk -> unified memory architectures like Apple silicon. It requires more work, but would be a great feature for Apple silicon to work with larger models on smaller Macbooks and the iPhone etc. The main problem here is that it requires quite a bit of work, but I think it could well be worth it to create this abstraction as it is more general and quite powerful (it can be used beyond optimizers, for example also for matrix multiplication / weights).

My main concern with this vs the original PR is the singleton backend. It feels like a fundamental architectural decision. I say this because I think in addition to the actual code changes, assumptions to this will be made in many places (as well as in code using this library, making backtracking on this decision "breaking").. Is it certain that this is not coming back to bite us in the end? Are we sure we do not want to support heterogeneous device setups?

I think this is the main advantage of non-static singleton. I would optimize for the user-side rather than developer-side. A singleton can prevent a lot of accidental bugs and enhances the user-experience for cases where the user forgot to move to the correct device. This can become a problem in particular if we support CPU devices.

I think the main discussion point should be which architectural design will help with the best user experience. That is really what I want to optimize for with the design. Lets discuss.

PS: @Titus-von-Koeller had more details on the sub-interfaces and what will be deprecated and we will post soon.

@jianan-gu
Copy link
Author

Thanks for your replies! :) @TimDettmers

Cross-posted reply to #898

  1. Device common abstraction interface

The four sub-interfaces look good to me. We probably want to create higher-level abstractions for 4-bit quantization/dequantization and some other functions should be deprecated such as percentile clipping, and the non-tensor-core 8-bit matrix multiplication.

Got it, thus we would follow these four sub-interfaces first.

It seems more people like the tensor-driven dispatch. I find it difficult to read and debug and it can lead to more accidental bugs compare to a singleton. There are also further problems. For example, for paged tensor, the device is cpu but it needs to be executed on a cuda device. We can create special branching functions for this case. But then we already have a special branch for AMD and paged tensors. A singleton would have no branching at all and bugs can be avoided it the user forgot to push the tensor to the correct device. I think a singleton is much superior from reducing bugs and enhancing the user experience.

Yes, when dispatching on the wrong device, singleton is to be more clear than tensor-driven. However, is it proper to take an assumption of what is wrong or correct devices from the BNB side? For example, for Huggingface users, choosing/setting a correct device for BNB runtime shall already be taken care of from the transformers library (to(device)), and BNB is serving as a quantization backend to support multi-platforms. Besides, for special branching functions to support cases like paged tensor, those shall be no extra effort for tensor-driven, since singleton also has to take care of them (branching them or not).

For a tensor type dispatch, how would you handle the case where the user uses the wrong device? If a tensor is on the CPU, I guess it would silently be executed on the CPU (if supported). This would be slow without the user knowing that something is wrong and we have no way of detecting if something is wrong because we just dispatching based on device type.

Any backend library (like PyTorch) that supports multi-platforms would face the same question. The key here is if it is necessary or do we have the ability to define the wrong or correct devices for users from the view of the library serving for multi-backends? This is the also main limitation of using singleton, we would not know what is the correct device that users intend to run and init the backend singleton when importing BNB and init.

Is there any great approach to handle such limitation of init backend as the singleton? The PR #1107 implemets one static init, which seems not sufficient.
For example, for CPU+CUDA system, we would take the assumption (what if the assumption is not right?) that CUDA is the correct device (init singleton with CUDA in BNB), and if users (forget to push the device CUDA) run into CPU, BNB stops them and reports that now CPU is wrong (if users do intend to check CPU perf here, it also loses user experience).

Tensor-driven would be clear from the view of serving as quantization backends that support multi-platforms under the use cases of calling BNB from the transformers library by the info of to(device). Besides, to improve user experience, the scope of tensor-driven dispatch shall make sure there will be no conflict that the inputs tensor on device A (cpu) but have to be executed on device B (cuda). And we could also use some checkings to avoid any potential conflicts.

  1. Device-specific functions/class

Some functions specific to the device would be specializations in the base class and would not be provided in the interface. So these should be "private" functions for the base class.

I think a runtime singleton would be overkill. The case that someone has multiple accelerators of different device type would be very rare (probably only developers have this). We should have a special manual hack for this rather than degrading the overall system for this use-case. So, a static one-time singleton would be preferable from my view.

Got it, so for the scope of device-specific functions/class, we would go with a static one-time singleton.

  1. Device common tool functions/classes

Agreed. I am not sure if all device type would work the same way. For example, does get_ptr() generalize to all devices? It might be a special case though and if something is wrong we can fix it later. So I think moving everything to something like utils.py might work well.

Got it, thanks, will refactor the utils.py for device common tool functions/classes.

@jiqing-feng
Copy link
Contributor

jiqing-feng commented Mar 19, 2024

I agree with it. The BNB has the same level as PyTorch because they are both fundamental libraries that support the OP level. Here are my opinions:

  1. It would be more acceptable for users if we followed the PyTorch design.
  2. If we want to support data and execution on the different devices totally decided by users, we might need 2 singletons, one for the data device and one for the execution device which will be more confusing to users.
  3. For most cases, the op executed on the data device won't have any risk. Only a few Ops like paged buffer need multi-device support, we'd better handle it inside the API and report it to users. Users need to be explicitly aware of these multi-device ops, so they can know what they are doing and if they used the correct device.

In conclusion, following the PyTorch design is the safest choice for us. We can apply it without struggle. Would like to hear the opinion from all of you @TimDettmers @Titus-von-Koeller @rickardp @akx @abhilash1910 @jgong5 @matthewdouglas @arlo-phoenix @younesbelkada

@jiqing-feng
Copy link
Contributor

jiqing-feng commented Mar 25, 2024

Hi @Titus-von-Koeller @TimDettmers

Refer to the comment. I am glad we finally agreed and decided to apply tensor-driven dispatch.

This PR would be a good start.

Do you think we should merge this PR since it has been open and discussed for 4 months...
I think the best way is to develop while using instead of talking to figure out the best method.

Hope we can see the progress soon. Thanks to all of you : )

@matthewdouglas
Copy link
Member

I need to go back and review the changes in the PR again, but in general I just want to be sure we're not making changes to the public API that could cause any problems.

Some of this may be redundant here, but here's my notes on what I'd consider the main part of the "public" API to be. I'm focusing only on bitsandbytes.functional as F as this is where I see most of the changes with this PR.

  • F.batched_igemm
  • F.dequantize_4bit
    • F.dequantize_fp4, F.dequantize_nf4
  • F.dequantize_blockwise
  • F.double_quant
  • F.estimate_quantiles
  • F.extract_outliers
  • F.gemv_4bit
  • F.igemm
  • F.igemmlt
  • F.mm_dequant
  • F.quantize_4bit
    • F.quantize_fp4, F.quantize_nf4
  • F.quantize_blockwise
  • F.QuantState I noticed this one is moved in the PR, so we need to be careful about that.

Then there's these which are more internal/backend-specific. I think they are probably safe to move around, but maybe worth exporting from bitsandbytes.functional with a deprecation warning if that's going to be done.

  • F.check_matmul
  • F.elementwise_func
    • F.fill, F.arange, F._mul
  • F.get_colrow_absmax
  • F.get_paged
  • F.get_ptr - I see some discussion around applicability/portability. Is on the RTD page.
  • F.get_transform_buffer
  • F.get_transform_func
  • F.histogram_scatter_add_2d
  • F.is_on_gpu
  • F.nvidia_transform
  • F.pipeline_test
  • F.post_call
  • F.pre_call
  • F.prefetch_tensor
  • F.transform
  • F.CUBLAS_Context
  • F.Cusparse_Context

Some that have appeared in public documentation:

  • F.create_dynamic_map
  • F.create_linear_map
  • F.create_quantile_map
  • F.dequantize
    • F.dequantize_no_absmax
  • F.optimizer_update_32bit
  • F.optimizer_update_8bit
  • F.optimizer_update_8bit_blockwise
  • F.percentile_clipping
  • F.quantize
    • F.quantize_no_absmax

Some that I'm not too sure about where they fit:

  • F.dequant_min_max
  • F.get_4bit_type
  • F.spm_coo
  • F.spm_coo_very_sparse
  • F.COOSparseTensor
  • F.CSRSparseTensor
  • F.CSCSparseTensor

Not every backend has to necessarily implement everything in bitsandbytes.functional to be useful. Nor should it need to implement every optimization under the hood: i.e. F.igemmlt vs F.igemm vs F.batched_igemm. The main thing is to continue to expose the same Python interface for compatibility.

@Titus-von-Koeller Titus-von-Koeller changed the base branch from main to multi-backend-refactor March 27, 2024 18:21
@Titus-von-Koeller
Copy link
Collaborator

Titus-von-Koeller commented Mar 27, 2024

Hi @jiqing-feng,

I'd like to clarify that the decision to go with tensor-driven dispatch is not final, and there's been arguments for and against it across the community. I've personally tried championing it with Tim, which you can see in his last response in this PR.

Tim is the owner of bitsandbytes and will need to give the final ok, once we have concretized the approach sufficiently. I’m confident that we accomplish this and we still need to put in significant work before.


To summarize, I do think it makes sense to go forward with a tensor-driven dispatch, and we should work together to make sure this refactor doesn’t break any existing functionality and that we're all happy with the outcome. To this end, I feel we need to collaborate on a series of follow up PRs.


Let me explain:

My main and absolute concern regarding BNB as its lead maintainer is stability above everything else, and I still have concerns on that end which need addressing before this can be fully finalized.

Due to the complexity of the codebase and the knowledge it requires, and the fact that this introduces a fundamental design change, we need to take it slowly so I can fully understand the changes and their implications.

I’m very happy with the work you’ve already done on this PR and therefore I will merge it. However, the merge will be to a distinct branch off of main to make sure we can work on this without affecting BNBs releasability. This branch is called multi-backend-refactor.

Please resolve any conflicts between this branch and multi-backend-refactor, so that we can go ahead with the merge.

We’ll merge changes to main back into multi-backend-refactor regularly and, whenever we’re at a place where we’re certain the current status of the refactor is non-disruptive to main, we’ll also merge multi-backend-refactor into main to avoid the branches deviating too much from each other. When there’s no more work to be done on the underlying device abstraction, we’ll delete multi-backend-refactor altogether.


Until then, all backend related PRs should be addressed at the multi-backend-refactor branch. PRs to multi-backend-refactor should be as atomic as possible, i.e. focused on a single type of improvement; so that they are easy to review and discuss in isolation.


I'm still working out details of what I believe needs changing, which I will share with you shortly in this Github Discussion, where I would like us to centralize all general discussions around the refactoring there as much as possible. My feedback will also address further comments from @jianan-gu @matthewdouglas and you.

I’m looking forward to our further collaboration with the community and making sure we deal with enabling multi_backend as soon as we can. To this end we would like to please ask Intel to allocate an engineer to work with us to set up a BNB Github runner with Intel hardware (cc @yao-matrix). We’re currently also setting up Nvidia GPU enabled runners and this will be crucial for the repo to have immediate feedback on code correctness across platforms.

@Titus-von-Koeller
Copy link
Collaborator

Titus-von-Koeller commented Mar 27, 2024

P.S. Be sure to install the pre-commit hooks as they will make sure formatting is applied as specified in the repo.

This likely accounts for the majority of the merge conflicts, due to the recently merged PR #1081.

@jiqing-feng
Copy link
Contributor

jiqing-feng commented Mar 28, 2024

Hi @jiqing-feng,

I'd like to clarify that the decision to go with tensor-driven dispatch is not final, and there's been arguments for and against it across the community. I've personally tried championing it with Tim, which you can see in his last response in this PR.

Tim is the owner of bitsandbytes and will need to give the final ok, once we have concretized the approach sufficiently. I’m confident that we accomplish this and we still need to put in significant work before.

To summarize, I do think it makes sense to go forward with a tensor-driven dispatch, and we should work together to make sure this refactor doesn’t break any existing functionality and that we're all happy with the outcome. To this end, I feel we need to collaborate on a series of follow up PRs.

Let me explain:

My main and absolute concern regarding BNB as its lead maintainer is stability above everything else, and I still have concerns on that end which need addressing before this can be fully finalized.

Due to the complexity of the codebase and the knowledge it requires, and the fact that this introduces a fundamental design change, we need to take it slowly so I can fully understand the changes and their implications.

I’m very happy with the work you’ve already done on this PR and therefore I will merge it. However, the merge will be to a distinct branch off of main to make sure we can work on this without affecting BNBs releasability. This branch is called multi-backend-refactor.

Please resolve any conflicts between this branch and multi-backend-refactor, so that we can go ahead with the merge.

We’ll merge changes to main back into multi-backend-refactor regularly and, whenever we’re at a place where we’re certain the current status of the refactor is non-disruptive to main, we’ll also merge multi-backend-refactor into main to avoid the branches deviating too much from each other. When there’s no more work to be done on the underlying device abstraction, we’ll delete multi-backend-refactor altogether.

Until then, all backend related PRs should be addressed at the multi-backend-refactor branch. PRs to multi-backend-refactor should be as atomic as possible, i.e. focused on a single type of improvement; so that they are easy to review and discuss in isolation.

I'm still working out details of what I believe needs changing, which I will share with you shortly in this Github Discussion, where I would like us to centralize all general discussions around the refactoring there as much as possible. My feedback will also address further comments from @jianan-gu @matthewdouglas and you.

I’m looking forward to our further collaboration with the community and making sure we deal with enabling multi_backend as soon as we can. To this end we would like to please ask Intel to allocate an engineer to work with us to set up a BNB Github runner with Intel hardware (cc @yao-matrix). We’re currently also setting up Nvidia GPU enabled runners and this will be crucial for the repo to have immediate feedback on code correctness across platforms.

Thanks! Merging into the multi-backend-refactor is okay; we can build from the source to enable it. I've already tested it on language models and found no breakdown or regression. BTW do you think we should release a pip package for multi-backend after the CPU ops are done? It would be better to have the preview package so users can use it, and we can find more issues before the official release.

@jianan-gu
Copy link
Author

jianan-gu commented Mar 28, 2024

I need to go back and review the changes in the PR again, but in general I just want to be sure we're not making changes to the public API that could cause any problems.

Some of this may be redundant here, but here's my notes on what I'd consider the main part of the "public" API to be. I'm focusing only on bitsandbytes.functional as F as this is where I see most of the changes with this PR.

  • F.batched_igemm

  • F.dequantize_4bit

    • F.dequantize_fp4, F.dequantize_nf4
  • F.dequantize_blockwise

  • F.double_quant

  • F.estimate_quantiles

  • F.extract_outliers

  • F.gemv_4bit

  • F.igemm

  • F.igemmlt

  • F.mm_dequant

  • F.quantize_4bit

    • F.quantize_fp4, F.quantize_nf4
  • F.quantize_blockwise

  • F.QuantState I noticed this one is moved in the PR, so we need to be careful about that.

Then there's these which are more internal/backend-specific. I think they are probably safe to move around, but maybe worth exporting from bitsandbytes.functional with a deprecation warning if that's going to be done.

  • F.check_matmul

  • F.elementwise_func

    • F.fill, F.arange, F._mul
  • F.get_colrow_absmax

  • F.get_paged

  • F.get_ptr - I see some discussion around applicability/portability. Is on the RTD page.

  • F.get_transform_buffer

  • F.get_transform_func

  • F.histogram_scatter_add_2d

  • F.is_on_gpu

  • F.nvidia_transform

  • F.pipeline_test

  • F.post_call

  • F.pre_call

  • F.prefetch_tensor

  • F.transform

  • F.CUBLAS_Context

  • F.Cusparse_Context

Some that have appeared in public documentation:

  • F.create_dynamic_map

  • F.create_linear_map

  • F.create_quantile_map

  • F.dequantize

    • F.dequantize_no_absmax
  • F.optimizer_update_32bit

  • F.optimizer_update_8bit

  • F.optimizer_update_8bit_blockwise

  • F.percentile_clipping

  • F.quantize

    • F.quantize_no_absmax

Some that I'm not too sure about where they fit:

  • F.dequant_min_max
  • F.get_4bit_type
  • F.spm_coo
  • F.spm_coo_very_sparse
  • F.COOSparseTensor
  • F.CSRSparseTensor
  • F.CSCSparseTensor

Not every backend has to necessarily implement everything in bitsandbytes.functional to be useful. Nor should it need to implement every optimization under the hood: i.e. F.igemmlt vs F.igemm vs F.batched_igemm. The main thing is to continue to expose the same Python interface for compatibility.

Hi, thanks for your comments, will cross check with the scope we listed as a follow up. Besides, QuantState is linked back for F. compatibility.

@jiqing-feng
Copy link
Contributor

Hi @Titus-von-Koeller . Would you please help to check if there is anything we need to change before merging? Thx!

@Titus-von-Koeller Titus-von-Koeller merged commit 2ffa367 into bitsandbytes-foundation:multi-backend-refactor Apr 3, 2024
2 checks passed
@Verdagon
Copy link

Verdagon commented Apr 3, 2024

Thanks for all your hard work on this @jianan-gu! Really excited for this, and looking forward to what it might unlock down the line =)

@Titus-von-Koeller
Copy link
Collaborator

BTW do you think we should release a pip package for multi-backend after the CPU ops are done? It would be better to have the preview package so users can use it, and we can find more issues before the official release.

Yes, I think we should release a nightly release both of main and multi-backend-refactor. I'm looking into this.

Thanks for all your hard work on this @jianan-gu!

Yes, agreed, thank you very much ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cross-platform high priority (first issues that will be worked on) Intel Integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.