This guide will explain how to run this benchmarking tool for DataFrame operations in cuDF vs. pandas. The user has the option to run this tool through a Command Line Interface (CLI) or a Jupyter Notebook. At the end of this guide, the reader will be able to view benchmarking times, speedups, and various performance visualizations for pandas and cuDF operations.
What do self-driving cars, recommender systems, and spam filters have in common? The answer is that they are all powered by Artificial Intelligence. In recent years, data scientists and engineers have been tasked with “cleaning up” large datasets so that they can feed this data into Artificial Intelligence models. One of the most popular technologies to “clean up” datasets today is pandas, a CPU based DataFrame library. However, one huge drawback of pandas is that it has poor performance on large datasets.
In response to this clear need, NVIDIA released cuDF, a GPU-based DataFrame library. Since cuDF and pandas have identical syntax (refer to graphic below), it is very easy for data scientists and engineers to adapt to this new library. In order to demonstrate cuDF’s computational advantage, we have developed a tool that automatically benchmarks the time(s) that cuDF and pandas take for common DataFrame operations. The data we are using for this tool is from the New York City taxi dataset, divided into month by month chunks. This is done in 2 formats, a Command Line Interface (CLI) and a Jupyter Notebook.
To proceed to the next steps, these instructions that you have the following already installed:
GPU: NVIDIA Pascal™ or better with compute capability 6.0+
CUDA: Version 10.0+ (with NVIDIA drivers)
OS: Ubuntu 16.04/18.04
Docker: Docker CE v19.03+ and nvidia-container-toolkit (if using docker container)
-
Go to https://rapids.ai/start.html and scroll down to the RAPIDS release selector
Method: select Docker + Examples
Release: select Stable Release
Packages: select All Packages or cuDF
Linux: Run lsb_release -a to find out your computer or VM’s Ubuntu version.
Python / CUDA: Select the versions on your computer or VM -
Copy the command generated from the RAPIDS release selector into the terminal in order to build your Docker container (refer to image below)
-
Once in the docker container, navigate to the outermost directory (using cd ..) and then run bash /rapids/utils/start_jupyter.sh (refer to image below)
-
Visit localhost:8888 or (ip address of your VM):8888 in your browser. An instance of JupyterLab should show up.
-
Click Terminal (under other). In the Terminal window, install the gcsfs python package using pip install gcsfs. If other packages used in the tool are missing as its run, use pip install (packagename) to install those packages (this is highly unlikely though).
-
From this repository, download the folder cli_files. The folder contains cudf_benchmarking_v2.py, generate_graphs_v2.py, load_data_v2.py, and rcid_to_meaning.csv. On the left hand side of the JupyterLab window, there is a menu of files. Drag the cli_files folder into this menu.
-
Open a Terminal window in JupyterLab by navigating to the menu bar at the top and selecting File → New → Terminal.
-
Before running the benchmarking tool, two parameters are required. These are the number of months of data you want to look at and the operation that you want to benchmark against. The number of months of data can be anywhere from 1 month to 36 months. The operations are represented in shorthand notation below:
Notation | Description |
---|---|
af | Apply numerical function on a column |
aggfunc | Apply aggregate function on a column |
gb | Group by category + Apply Aggregate function |
gt5 | Sort rows based on column value |
dd | Drop duplicate rows |
dc | Drop column from DataFrame |
fc | Filter rows from DataFrame ( by column value ) |
ld | Load data into DataFrame |
rtc | Rewrite DataFrame to .csv file |
cdf | Concatenate DataFrames together |
fnv | Fill null (missing) values in DataFrame |
merge | Perform an inner join on two DataFrames |
all | Perform all operations above (w/ the exception of cdf) |
- Run the following command in your terminal. (months) and (operation) are the months and operation you decided on in the previous step. Note: You need to navigate into the cli_files directory before running this command.
python cudf_benchmarking_v2.py -num_months (months) -operation (operation)
- After running the command, the results show up on the command line. If you entered ‘all’ for the operation, visualizations are also generated. The image filenames of the visualizations as well as what each visualization contains is below:
Filename | Visualization |
---|---|
cudf_vs_pandas_p1.png | This visualization is a double bar chart that compares the time it takes for less intensive DataFrame operations to finish in pandas versus cuDF. |
cudf_vs_pandas_p2.png | This visualization is a double bar chart that compares the time it takes for more intensive DataFrame operations to finish in pandas versus cuDF. |
speedups.png | This visualization is a bar chart that looks at how many times faster cuDF is (as compared to pandas) for a given operation. |
-
From this repository, download the notebook cudf_vs_pandas_benchmarking.ipynb. Drag the file into the files menu (on the left side of the interface) in JupyterLab.
-
Open up the file by double clicking on the file name. Run each cell by pressing the ▷ button in the menu inside the notebook. Do this until the “Loading the Data for Benchmarking” section. This is necessary because running each cell ensures that the functions that are used for benchmarking are in the program’s memory.
-
In this cell (image below) enter the number of months of data and the operation you want to benchmark against. Refer to the Running the Tool - Command Line (CLI) section to double check that your inputs are valid.
As an example, lets run the benchmarking tool on 1 Month of taxi data (~2 GB) and look at how all of the operations perform.
CLI
Using the CLI, the command would look like:
python cudf_benchmarking_v2.py -num_months 1 -operation all
Jupyter Notebook
In the Jupyter Notebook, set num_months = 1 and operation = 'all' in the appropriate cell (refer to Step 3 in Running the Tool - Jupyter Notebook)
One this step is done, the performance numbers and subsequent visualizations are visible.
Performance Numbers
Loading data through cudf ...
Elapsed time: 49.0966 seconds
Loading data through pandas ...
Elapsed time: 143.9255 seconds
Data Loading Speedup (cuDF): 2.931X
Pandas: Applying a numerical function (converting mi to km):
Elapsed time: 2.3508 seconds
CuDF: Applying a numerical function (converting mi to km):
Elapsed time: 0.0802 seconds
Speedup: 29.307X
Pandas: Applying an aggregate function:
Elapsed time: 0.0594 seconds
CuDF: Applying an aggregate function:
Elapsed time: 0.0008 seconds
Speedup (cuDF): 70.023X
.
.
.
(and so on)
Visualizations
The visualizations are below. The description of each visualization is in Step 5 of the Running the Tool - Command Line (CLI) section: