Cold Start Time

Description

This experiment measures the cold start time of a function. The cold start time is the time it takes for a function to be invoked for the first time. The experiment is repeated for a number of iterations. The result is the average of the cold start times of all iterations.

Steps

Build the docker image and enter the function container
```
cd $PATH_TO_STREAMBOX/experiment/time
bash run.sh
```

In the container, tun the script

bash run.sh
bash stat.sh

> bash run.sh
Running iteration 1/10...
Running iteration 2/10...
...
Running iteration 10/10...
> bash stat.sh
Stats for context_init_times_torch:
Median: 4481.4
Average: 4472.76
Min: 4107.215404510498
Max: 4896.044015884399
...

Analysis

According to our experimental results, the cold start latency of the GPU runtime exceeds 5 seconds, which aligns with our observations made in the paper. In this experiment, we analyzed the cold start time of a function, referring to the time it takes for the function to be invoked for the first time. The results show that a significant portion of the cold start time is consumed by the initialization of the GPU runtime. The average time taken was 4472.76 milliseconds, with the maximum time reaching up to 4896.04 milliseconds. These time durations are unacceptable.

Memory Footprint

Description

In this experiment, we evaluate the GPU memory footprint of serverless inference systems, particularly the function isolation approach.

Steps

Run the script

cd $PATH_TO_STREAMBOX/experiment/memory
sudo bash run.sh

Output:

CUDA_VISIBLE_DEVICES=3
...
Compute Mode < Exclusive Process >
...

nvidia-cuda-mps-control daemon started
Set MPS thread percentage to 10

Successfully built a1b2c3d4e5f6
Successfully tagged torch:streambox

// run the docker containers
a1b2c3d4e5f6

CONTAINER ID   IMAGE           COMMAND                  CREATED          STATUS          PORTS     NAMES
a1b2c3d4e5f6   torch:streambox   "python app.py"    3 seconds ago    Up 2 seconds              torch-mps-0
a1b2c3d4e5f7   torch:streambox   "python app.py"    3 seconds ago    Up 2 seconds              torch-mps-1
...
a1b2c3d4e5fn   torch:streambox   "python app.py"    3 seconds ago    Up 2 seconds              torch-mps-n

// Logs from torch-mps-0:
...
Driver Version: 465.19.01
CUDA_VISIBLE_DEVICES: 3
--Begin-- Current GPU memory info: Used: 500MB, Free: 15500MB, Total: 16000MB
torch.cuda.current_device(): 0
Running on Tesla V100-PCIE-16GB
Number of GPUs: 1
--After torch init-- Current GPU memory info: Used: 1900MB, Free: 15200MB, Total: 16000MB
--After model init-- Current GPU memory info: Used: 2000MB, Free: 13500MB, Total: 16000MB
Pytorch Context GPU Memory Usage: 1500MB

// Logs from torch-mps-n:
...

// Resetting MPS and Compute Mode...

Analysis

Utilizing the DNN inference function (ResNet), we discover that the GPU runtime occupies over 95% of the overall memory footprint, which results in significant redundancy and low deployment density. Each function occpies 2GB of GPU memory, which is unacceptable for a serverless system. The memory footprint of the GPU runtime is a major bottleneck for the function isolation approach.

Communication

Function-IPC

Description

In this experiment, we evaluate the communication overhead of the function-IPC approach. You can measure the latency of the function-IPC approach under different message sizes.

Steps

Build the docker image and enter the function container

cd $PATH_TO_STREAMBOX/experiment/communication/function-ipc
bash run.sh

Output looks like:

> bash run.sh
...
torch.cuda.current_device(): 0
Running on Tesla V100-SXM2-32GB
Number of GPUs: 1
Warming up GPU...
GPU warmup done!
Time taken to transfer data from device to host: xxx ms
...

CUDA IPC

Description

In this experiment, we evaluate the communication overhead of the CUDA IPC approach. CUDA IPC is a mechanism that allows multiple processes to access a single GPU. CUDA Runtime API

Steps

Build the docker image and enter the function container

cd $PATH_TO_STREAMBOX/experiment/communication/cudaipc
bash run.sh

Output file looks like:

app1.txt
---
get IPC handle1 time = 25.095000 us
get IPC handle2 time = 15.128000 us
waiting for app2
data1 size = 268.435455 MB
end-to-end time = 17632.712000 us
...
app2.txt
---
IPC transfer 2 times cost = 0.000000 us
IPC open handle 1 time = 334.944000 us
IPC open handle 2 time = 292.726013 us
NO1 kernel + handle1 time = 40.992001 us
NO2 kernel + handle1 time = 22.560001 us
NO3 kernel + handle1 time = 23.360001 us
NO1 kernel + handle2 time = 24.224001 us
NO2 kernel + handle2 time = 21.472000 us
NO3 kernel + handle2 time = 20.799999 us
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Cold Start Time

Description

Steps

Analysis

Memory Footprint

Description

Steps

Analysis

Communication

Function-IPC

Description

Steps

CUDA IPC

Description

Steps

Files

README.md

Latest commit

History

README.md

File metadata and controls

Cold Start Time

Description

Steps

Analysis

Memory Footprint

Description

Steps

Analysis

Communication

Function-IPC

Description

Steps

CUDA IPC

Description

Steps