llama : offload to RPC in addition to other backends #7640

rgerganov · 2024-05-30T13:33:49Z

This patch adds support for offloading layers to RPC servers in addition to other non-RPC backends. For example if you build with -DLLAMA_CUDA=ON -DLLAMA_RPC=ON, then you can offload to local GPU and remote server(s):

$ bin/main -m ../models/ggml-model-f16.gguf -p "Hello, my name is" -n 64 -ngl 99 -s 1236 --rpc localhost:50052
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0,31 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:        CPU buffer size =   125,00 MiB
llm_load_tensors:      CUDA0 buffer size =  1008,19 MiB
llm_load_tensors:        RPC buffer size =   965,16 MiB
..........................................................................................

I have tried to follow the existing patterns in llama.cpp and introduced device numbers for RPC servers which always come last.

When copying tensors, we need to handle the case when src and dst are not on the same backend. For CUDA I had to build with -DLLAMA_CUDA_NO_PEER_COPY=ON to make it work.

github-actions · 2024-05-30T16:14:18Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 540 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8675.69ms p(95)=21051.8ms fails=, finish reason: stop=494 truncated=46
Prompt processing (pp): avg=108.21tk/s p(95)=502.93tk/s
Token generation (tg): avg=32.13tk/s p(95)=47.06tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=rpc-offload commit=243a3e4bb2ffb04248104fb375e61c55e5e42028

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717453421 --> 1717454047
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 325.6, 325.6, 325.6, 325.6, 325.6, 656.89, 656.89, 656.89, 656.89, 656.89, 642.68, 642.68, 642.68, 642.68, 642.68, 682.34, 682.34, 682.34, 682.34, 682.34, 767.91, 767.91, 767.91, 767.91, 767.91, 768.53, 768.53, 768.53, 768.53, 768.53, 787.43, 787.43, 787.43, 787.43, 787.43, 805.09, 805.09, 805.09, 805.09, 805.09, 823.29, 823.29, 823.29, 823.29, 823.29, 820.93, 820.93, 820.93, 820.93, 820.93, 824.63, 824.63, 824.63, 824.63, 824.63, 842.6, 842.6, 842.6, 842.6, 842.6, 846.88, 846.88, 846.88, 846.88, 846.88, 832.55, 832.55, 832.55, 832.55, 832.55, 838.03, 838.03, 838.03, 838.03, 838.03, 843.57, 843.57, 843.57, 843.57, 843.57, 843.0, 843.0, 843.0, 843.0, 843.0, 858.66, 858.66, 858.66, 858.66, 858.66, 861.85, 861.85, 861.85, 861.85, 861.85, 864.08, 864.08, 864.08, 864.08, 864.08, 869.79, 869.79, 869.79, 869.79, 869.79, 866.79, 866.79, 866.79, 866.79, 866.79, 871.86, 871.86, 871.86, 871.86, 871.86, 865.67, 865.67, 865.67, 865.67, 865.67, 865.32, 865.32, 865.32, 865.32, 865.32, 867.17, 867.17, 867.17, 867.17, 867.17, 880.33, 880.33, 880.33, 880.33, 880.33, 880.79, 880.79, 880.79, 880.79, 880.79, 876.98, 876.98, 876.98, 876.98, 876.98, 880.02, 880.02, 880.02, 880.02, 880.02, 877.9, 877.9, 877.9, 877.9, 877.9, 882.36, 882.36, 882.36, 882.36, 882.36, 891.23, 891.23, 891.23, 891.23, 891.23, 895.02, 895.02, 895.02, 895.02, 895.02, 898.44, 898.44, 898.44, 898.44, 898.44, 898.02, 898.02, 898.02, 898.02, 898.02, 896.17, 896.17, 896.17, 896.17, 896.17, 893.82, 893.82, 893.82, 893.82, 893.82, 895.74, 895.74, 895.74, 895.74, 895.74, 897.06, 897.06, 897.06, 897.06, 897.06, 896.71, 896.71, 896.71, 896.71, 896.71, 896.26, 896.26, 896.26, 896.26, 896.26, 894.59, 894.59, 894.59, 894.59, 894.59, 890.58, 890.58, 890.58, 890.58, 890.58, 888.28, 888.28, 888.28, 888.28, 888.28, 885.65, 885.65, 885.65, 885.65, 885.65, 887.92, 887.92, 887.92, 887.92, 887.92, 889.18, 889.18, 889.18, 889.18, 889.18, 887.59, 887.59, 887.59, 887.59, 887.59, 891.37, 891.37, 891.37, 891.37, 891.37, 890.16, 890.16, 890.16, 890.16, 890.16, 887.35, 887.35, 887.35, 887.35, 887.35, 889.83, 889.83, 889.83, 889.83, 889.83, 889.06, 889.06, 889.06, 889.06, 889.06, 892.23, 892.23, 892.23, 892.23, 892.23, 893.21, 893.21, 893.21, 893.21, 893.21, 891.71, 891.71, 891.71, 891.71, 891.71, 891.28, 891.28, 891.28, 891.28, 891.28, 891.71, 891.71, 891.71, 891.71, 891.71, 892.27, 892.27, 892.27, 892.27, 892.27, 892.68, 892.68, 892.68, 892.68]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717453421 --> 1717454047
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.72, 41.72, 41.72, 41.72, 41.72, 40.12, 40.12, 40.12, 40.12, 40.12, 31.49, 31.49, 31.49, 31.49, 31.49, 32.56, 32.56, 32.56, 32.56, 32.56, 33.56, 33.56, 33.56, 33.56, 33.56, 34.88, 34.88, 34.88, 34.88, 34.88, 35.37, 35.37, 35.37, 35.37, 35.37, 35.66, 35.66, 35.66, 35.66, 35.66, 36.21, 36.21, 36.21, 36.21, 36.21, 36.28, 36.28, 36.28, 36.28, 36.28, 35.95, 35.95, 35.95, 35.95, 35.95, 34.27, 34.27, 34.27, 34.27, 34.27, 34.15, 34.15, 34.15, 34.15, 34.15, 33.99, 33.99, 33.99, 33.99, 33.99, 32.23, 32.23, 32.23, 32.23, 32.23, 31.23, 31.23, 31.23, 31.23, 31.23, 31.21, 31.21, 31.21, 31.21, 31.21, 31.29, 31.29, 31.29, 31.29, 31.29, 30.74, 30.74, 30.74, 30.74, 30.74, 30.75, 30.75, 30.75, 30.75, 30.75, 30.72, 30.72, 30.72, 30.72, 30.72, 30.49, 30.49, 30.49, 30.49, 30.49, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.66, 30.87, 30.87, 30.87, 30.87, 30.87, 31.13, 31.13, 31.13, 31.13, 31.13, 31.05, 31.05, 31.05, 31.05, 31.05, 31.18, 31.18, 31.18, 31.18, 31.18, 31.48, 31.48, 31.48, 31.48, 31.48, 31.65, 31.65, 31.65, 31.65, 31.65, 31.85, 31.85, 31.85, 31.85, 31.85, 32.08, 32.08, 32.08, 32.08, 32.08, 32.1, 32.1, 32.1, 32.1, 32.1, 31.96, 31.96, 31.96, 31.96, 31.96, 31.8, 31.8, 31.8, 31.8, 31.8, 31.46, 31.46, 31.46, 31.46, 31.46, 31.11, 31.11, 31.11, 31.11, 31.11, 30.89, 30.89, 30.89, 30.89, 30.89, 31.02, 31.02, 31.02, 31.02, 31.02, 31.11, 31.11, 31.11, 31.11, 31.11, 31.21, 31.21, 31.21, 31.21, 31.21, 31.25, 31.25, 31.25, 31.25, 31.25, 31.09, 31.09, 31.09, 31.09, 31.09, 30.47, 30.47, 30.47, 30.47, 30.47, 30.32, 30.32, 30.32, 30.32, 30.32, 29.28, 29.28, 29.28, 29.28, 29.28, 29.12, 29.12, 29.12, 29.12, 29.12, 28.95, 28.95, 28.95, 28.95, 28.95, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 28.96, 29.05, 29.05, 29.05, 29.05, 29.05, 29.06, 29.06, 29.06, 29.06, 29.06, 29.09, 29.09, 29.09, 29.09, 29.09, 28.92, 28.92, 28.92, 28.92, 28.92, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.83, 28.88, 28.88, 28.88, 28.88, 28.88, 29.05, 29.05, 29.05, 29.05, 29.05, 29.08, 29.08, 29.08, 29.08, 29.08, 29.23, 29.23, 29.23, 29.23, 29.23, 29.31, 29.31, 29.31, 29.31]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717453421 --> 1717454047
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.45, 0.45, 0.45, 0.45, 0.45, 0.18, 0.18, 0.18, 0.18, 0.18, 0.08, 0.08, 0.08, 0.08, 0.08, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.47, 0.47, 0.47, 0.47, 0.47, 0.26, 0.26, 0.26, 0.26, 0.26, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.38, 0.38, 0.38, 0.38, 0.38, 0.4, 0.4, 0.4, 0.4, 0.4, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.34, 0.34, 0.34, 0.34, 0.34, 0.57, 0.57, 0.57, 0.57, 0.57, 0.62, 0.62, 0.62, 0.62, 0.62, 0.56, 0.56, 0.56, 0.56, 0.56, 0.43, 0.43, 0.43, 0.43, 0.43, 0.26, 0.26, 0.26, 0.26, 0.26, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717453421 --> 1717454047
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0]

slaren · 2024-05-30T16:14:20Z

When copying tensors, we need to handle the case when src and dst are not on the same backend. For CUDA I had to build with -DLLAMA_CUDA_NO_PEER_COPY=ON to make it work.

The way this is supposed to work is that backends need to check the buffer type of the tensors to determine if they can perform the copy, and return false otherwise. The CUDA backend should already do this. What happens exactly if you don't use LLAMA_CUDA_NO_PEER_COPY, does it crash somewhere or does it produce wrong results?

ggml-rpc.h

rgerganov · 2024-05-31T08:24:16Z

The way this is supposed to work is that backends need to check the buffer type of the tensors to determine if they can perform the copy, and return false otherwise. The CUDA backend should already do this. What happens exactly if you don't use LLAMA_CUDA_NO_PEER_COPY, does it crash somewhere or does it produce wrong results?

It crashes in ggml_backend_cuda_buffer_cpy_tensor() because it assumes that dst is allocated by the CUDA backend which is not the case. I think we should add a check in ggml_backend_tensor_copy() if both tensors are on the same backend and perform the slow copy if they are not.

slaren · 2024-05-31T12:13:37Z

The tensor_copy function is meant to allow efficient copies between different backends. For example, it is used to copy tensors between different CUDA devices (each of which is a different backend). It could also be used in the RPC backend to copy tensors between different servers directly, without passing first through the host. This is a bug in ggml-backend, dst should always be allocated in the buffer of the called tensor_copy function, I will look into it.

slaren · 2024-05-31T15:10:54Z

rgerganov#1 should fix it.

rgerganov · 2024-06-03T06:53:24Z

@slaren thanks, I have verified your fix and merged it

…uffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names

llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

rgerganov · 2024-06-04T06:47:29Z

could you do a more carefully check before your PR merged to master branch?

My changes have been reviewed and the CI run was successful.

and you have some privileges and this is the key reason why your PR was approved so quickly although the quality of your PR is might-be need more check.

I don't have any privileges in this project and my code is being reviewed as everybody's else. You'd better have some arguments when saying something about the quality of my changes.

I'm not sure whether you might did wrong deletions in llama.cpp and cause the troubles for rebase operation by the other community backend developer

When someone is proposing a change it is their responsibility to address review comments and make sure that the change can be applied on current master. The changes I have done in llama.cpp are trivial and rebasing other changes on top of them is also trivial.

I hope you can do more carefully check before your PR was merged to master branch.

My change has been reviewed and approved by a core maintainer and the CI run was successful.

airMeng · 2024-06-12T07:24:33Z

ggml-backend.c

+void ggml_backend_view_init(struct ggml_tensor * tensor) {
    GGML_ASSERT(tensor->buffer == NULL);
    GGML_ASSERT(tensor->view_src != NULL);
    GGML_ASSERT(tensor->view_src->buffer != NULL);
    GGML_ASSERT(tensor->view_src->data != NULL);

-    tensor->buffer = buffer;
+    tensor->buffer = tensor->view_src->buffer;
    tensor->data = (char *)tensor->view_src->data + tensor->view_offs;
-    ggml_backend_buffer_init_tensor(buffer, tensor);
+    ggml_backend_buffer_init_tensor(tensor->buffer, tensor);


This breaks the latest SYCL support. Could I know if all the other backends assumes the buffer already bound to the right tensor? Of cause I know it is the issue of SYCL itself, we are maintaining SYCL during spare time and are still begging for official support from the company :) I will look into it and try to fix it soon.

@slaren @rgerganov for awareness

It's the same issue that was fixed for the Vulkan backend in #7806, there are more details about why this happens and why the change was necessary there. The best way to fix this for the SYCL backend would be to remove the extras entirely in the same way they were removed from the CUDA backend.

we are maintaining SYCL during spare time and are still begging for official support from the company

Having a dedicated machine which runs the CI with the SYCL backend would be very helpful

This reverts commit bde7cd3.

…#7981) This reverts commit bde7cd3.

…g#7640)" (ggml-org#7981) This reverts commit bde7cd3.

metal3d · 2024-07-13T14:16:37Z

I see that you reverted this merge. Does it mean that it will not work with others backend? I'm using Vulkan instead of CUDA (for many reasons), and actually I can see that rpc-server binary is linked to libvulkan, but the binary says "create_backend: using CPU backend".

belog2867 · 2025-02-17T11:02:09Z

I also encountered such a problem. I compiled the rpc of opencl backend for Android, but it was output when I run it .
started.create_backend: using CPU backend

But the independent llama-cli can call the opencl backend.

slaren · 2025-02-27T19:15:39Z

@zhouwg ok, I have deleted the comment.

zhouwg · 2025-02-28T00:14:55Z

@zhouwg ok, I have deleted the comment.

@slaren, sincerely thanks too much,I'll never forget your kindness.

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 30, 2024

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 30, 2024

slaren reviewed May 30, 2024

View reviewed changes

ggml-rpc.h Outdated Show resolved Hide resolved

slaren mentioned this pull request May 30, 2024

llama_supports_rpc() function #7647

Closed

rgerganov force-pushed the rpc-offload branch from a0e5fa4 to 6c276de Compare May 31, 2024 08:46

rgerganov marked this pull request as ready for review May 31, 2024 08:46

rgerganov self-assigned this May 31, 2024

rgerganov and others added 3 commits June 3, 2024 10:19

llama : offload to RPC in addition to other backends

369213e

- fix copy_tensor being called on the src buffer instead of the dst b…

805cd78

…uffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names

add rpc-server to Makefile

464c75c

rgerganov force-pushed the rpc-offload branch from 6152756 to 464c75c Compare June 3, 2024 07:20

slaren mentioned this pull request Jun 3, 2024

Bug: test run on stories15M-q4_0.gguf result in Segmentation fault. #7711

Closed

slaren reviewed Jun 3, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

Update llama.cpp

243a3e4

Co-authored-by: slaren <slarengh@gmail.com>

slaren linked an issue Jun 3, 2024 that may be closed by this pull request

Bug: test run on stories15M-q4_0.gguf result in Segmentation fault. #7711

Closed

slaren approved these changes Jun 3, 2024

View reviewed changes

rgerganov merged commit bde7cd3 into ggml-org:master Jun 3, 2024
70 checks passed

zhouwg mentioned this pull request Jun 4, 2024

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

Closed

4 tasks

stduhpf mentioned this pull request Jun 6, 2024

Bug: server (at least) craches using VULKAN #7769

Closed

rhjdvsgsgks mentioned this pull request Jun 6, 2024

Bug: gpu hang after bde7cd3cd949c1a85d3a199498ac98e78039d46f #7730

Closed

airMeng reviewed Jun 12, 2024

View reviewed changes

This was referenced Jun 13, 2024

sycl: always set the main device after initialization #7909

Closed

[SYCL] remove global variables #7710

Merged

joeatodd added a commit that referenced this pull request Jun 17, 2024

Revert "llama : offload to RPC in addition to other backends (#7640)"

a0cbf2c

This reverts commit bde7cd3.

joeatodd mentioned this pull request Jun 17, 2024

sycl-exp : Temporarily revert RPC offload (#7640) #7981

Merged

2 tasks

joeatodd added a commit that referenced this pull request Jun 18, 2024

Revert "llama : offload to RPC in addition to other backends (#7640)" (…

e9aa742

…#7981) This reverts commit bde7cd3.

Alcpz pushed a commit to Alcpz/llama.cpp that referenced this pull request Jun 20, 2024

Revert "llama : offload to RPC in addition to other backends (ggml-or…

8ee4082

…g#7640)" (ggml-org#7981) This reverts commit bde7cd3.

chraac mentioned this pull request Feb 25, 2025

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : offload to RPC in addition to other backends #7640

llama : offload to RPC in addition to other backends #7640

rgerganov commented May 30, 2024

github-actions bot commented May 30, 2024 •

edited

Loading

slaren commented May 30, 2024

rgerganov commented May 31, 2024

slaren commented May 31, 2024

slaren commented May 31, 2024

rgerganov commented Jun 3, 2024

rgerganov commented Jun 4, 2024

airMeng Jun 12, 2024 •

edited

Loading

slaren Jun 12, 2024

rgerganov Jun 12, 2024

metal3d commented Jul 13, 2024

belog2867 commented Feb 17, 2025

slaren commented Feb 27, 2025

zhouwg commented Feb 28, 2025

llama : offload to RPC in addition to other backends #7640

llama : offload to RPC in addition to other backends #7640

Conversation

rgerganov commented May 30, 2024

github-actions bot commented May 30, 2024 • edited Loading

slaren commented May 30, 2024

rgerganov commented May 31, 2024

slaren commented May 31, 2024

slaren commented May 31, 2024

rgerganov commented Jun 3, 2024

rgerganov commented Jun 4, 2024

airMeng Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

slaren Jun 12, 2024

Choose a reason for hiding this comment

rgerganov Jun 12, 2024

Choose a reason for hiding this comment

metal3d commented Jul 13, 2024

belog2867 commented Feb 17, 2025

slaren commented Feb 27, 2025

zhouwg commented Feb 28, 2025

github-actions bot commented May 30, 2024 •

edited

Loading

airMeng Jun 12, 2024 •

edited

Loading