-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Speed up RF -> FIL conversion for inference #2399
Comments
The time for the first prediction includes conversion to FIL format. The FIL-converted tree is cached after the first call. If you have a tiny dataset and don't want to pay the setup cost, you can use the CPU-based predict call, which is much lower throughput but lower latency. |
I'm going to rephrase this as a feature request to speed up FIL translation and leave it in the FEA queue. |
Thanks @JohnZed ! That sounds great |
Note that #2263 was merged yesterday, which speeds up serialization of RF objects. We should run the benchmark again to obtain the new measurement for RF->FIL conversion. |
@hcho3 looks like the serialization changes made a meaningful improvement. The below example is from the 2020-07-01 nightly as of 3 PM EDT. Looks to be about 4 seconds shaved off, or about 1/3 of the time. import cupy as cp
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as gpu_rf
X, y = make_classification(
n_samples=1000000,
n_features=20,
n_informative=18,
n_classes=2,
random_state=0,
)
n_trees = 300
X = X.astype("float32")
y = y.astype("int32")
gX, gy = cp.asarray(X), cp.asarray(y)
clf1 = gpu_rf(n_estimators=n_trees)
clf1.fit(gX, gy)
%time clf1.predict(gX)
%time clf1.predict(gX)
CPU times: user 7.57 s, sys: 1.21 s, total: 8.77 s
Wall time: 7.87 s
CPU times: user 878 ms, sys: 345 ms, total: 1.22 s
Wall time: 1.22 s
array([1, 0, 1, ..., 0, 1, 1], dtype=int32) The conversion slowdown appears strongly related to the number of features. While not surprising, it's interesting to see it play out. I wonder if there is an inflection point. import cupy as cp
from sklearn.datasets import make_classification
from cuml.ensemble import RandomForestClassifier as gpu_rf
n_trees = 300
for nfeat in [5, 10, 15, 20]:
X, y = make_classification(
n_samples=1000000,
n_features=nfeat,
n_informative=nfeat-2,
n_classes=2,
random_state=0,
)
X = X.astype("float32")
y = y.astype("int32")
gX, gy = cp.asarray(X), cp.asarray(y)
clf1 = gpu_rf(n_estimators=n_trees)
clf1.fit(gX, gy)
print(f"{nfeat} Features")
%time clf1.predict(gX)
print()
5 Features
CPU times: user 1.33 s, sys: 35.9 ms, total: 1.36 s
Wall time: 404 ms
10 Features
CPU times: user 5.45 s, sys: 687 ms, total: 6.14 s
Wall time: 5.24 s
15 Features
CPU times: user 6.19 s, sys: 785 ms, total: 6.97 s
Wall time: 6.43 s
20 Features
CPU times: user 7.23 s, sys: 570 ms, total: 7.8 s
Wall time: 6.88 s Paying the one-time cost is probably more impactful in a cross-validation workflow, in which potentially many unique models call predict over the lifecycle. We'd end up with a linear lower bound on total time of |
Update: Pickle protocol 5 speeds up RF -> FIL conversion further. It uses a technique called "out-of-band serialization" to speed up conversion between NumPy arrays and bytes. Benchmark setup
Benchmark Results
As noted in #2263, most of the run time is consumed by RF->FIL conversion. How to opt into Pickle 5There are two options:
conda install -c rapidsai -c nvidia -c rapidsai-nightly -c conda-forge cloudpickle pickle5
# Install development version of Dask and Distributed
conda remove --force distributed dask
git clone https://github.com/dask/dask.git
cd dask
python -m pip install .
cd ..
git clone https://github.com/dask/distributed.git
cd distributed
python setup.py install Special thanks to @jakirkham who brought Pickle 5 to Dask. |
As a note, the Distributed change ( dask/distributed#3849 ) will be part of the 2.21.0 release. |
Awesome benchmark and summary @hcho3 . Do you have a sense of how 73 seconds for prediction compares to sklearn's random forest on the same data? |
i saw there's only one core was used during the conversion, maybe a multiprocess task could speedup the inference? |
Probably not. Most of the savings here is avoiding copies. Once that is done, which I believe is already the case here, we are just passing pointers and metadata around until it goes over the wire. Though feel free to correct me if I'm missing anything here Philip 🙂 |
We need more perf benchmark before we can conclusively say what's causing the slowdown. |
okay... I have another question: Is it possible to inference without conversion? In some financial scenarios, we compare the labels and preds for just a few times(even just one time). |
This conversion step is an essential part of our inference code at the moment, but speeding it up is currently my focus and number one priority. The short version of my profiling findings is that our use of |
A brief update on profiling and where we're at with this: Profiling SetupI'm currently profiling the single-GPU case only; I'll be looking at the MNMG class independently later. All reported results below are for randomly-generated (but consistent) data with 100,000 samples and 20 features. An RF classifier is trained with 300 trees, and prediction is run once on the same data used for training. I've done some investigation with other parameters, but I will stick to reporting these (apparently fairly representative) results unless otherwise noted. The relevant method for this issue is Treelite mainline ResultsOn current Treelite mainline,
For the moment, we will ignore the deletion method. Breaking the other methods down further, it became clear that
Treelite POC with open-addressing/local-probing hash tableWith this profiling data available, I created a proof-of-concept PR for Treelite that moves from an Reviewing the same data presented for mainline, we have:
And for the low-level breakdown:
The moral of the stories is that linked lists are evil and hence so are stdlib maps ;). The overall runtime for Still to be doneWith profiling results from the POC, we see that |
Another update based on the same profiling setup. My general approach has been to develop cheap/quick PoCs exploring different possible avenues for conversion speedup and using them to guide further exploration and longterm development decisions. A quick summary of a few areas of investigation and the runtime for a single prediction ( EDIT: Removing earlier results based on build against incorrect Treelite library to avoid confusion. Based on offline discussion today, I'll be looking at parallelization of the TL->FIL conversion and ensuring that at least the FastMap + RF->TL parallelization + TL->FIL pre-allocation approach makes it into 0.18. I'll then turn my attention to migrating the direct RF->FIL conversion from a PoC to a more robust and optimized implementation. This may make it into 0.18 as an experimental feature, but it may also get pushed back to 0.19. |
Closed by #3395 - there is more room for optimization but this is by far the most important speedup we need |
The brief version of the final speedup we obtained was that we got about a 21.23x speedup relative to baseline for the parameters described above. I'll do one more "matrix" of runs with a variety of tree depths, number of features etc., and I'll post that table here for final comparison. |
In today's nightly (cuml commit
f1f1c7f6a
), thepredict
method of random forest classifier takes quite a bit of time the first time it's called on 1M rows binary classification, but is much faster the second time. Perhaps this could be related to #1922 ?After this, I added a print statement in the
predict
method to see if it's using the GPU path, which it appears to be.cuml/python/cuml/ensemble/randomforestclassifier.pyx
Lines 869 to 871 in 4b3213d
The text was updated successfully, but these errors were encountered: