Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Faster Treelite serialization #2263

Merged
merged 13 commits into from
Jun 17, 2020

Conversation

hcho3
Copy link
Contributor

@hcho3 hcho3 commented May 14, 2020

Summary

Speed up serialization of Treelite model objects and reduce overhead in multi-GPU RF prediction.

Features

  • Flat object layout: now Tree is an array of POD objects (Node). As a consequence, all queries for a give node should now be made via methods of Tree class.
  • Binary serialization without depending on Protobuf
  • Eliminate the use of tempfile.

Benchmark setup

  • Using this script by @Salonijain27. See the link for instructions.
  • AWS EC2 instance g4dn.12xlarge, T4 GPU (16 GB GDDR6) X 4
  • Benchmark setting: n_gpus=2, n_gb=2, n_features=20, depth=25, n_estimators=10. This leads to a forest consisting of 10 depth-25 trees, and we run through 25 million data rows.

Benchmark Results

Aggregate

Before After
269.0 sec 109.4 sec (2.46x speedup)

Breakdown by components

Notice that some of the overhead is still yet to be explained. Also note that the actual prediction time is a small portion of the total run time. Due to the nature of distributed algorithm, timing measures are approximate.

Current cuML

  Master Worker 0 Worker 1
RF->Treelite   32.6 32.8
Treelite->Protobuf   10.8 11.3
(Copy models workers -> master)      
Protobuf->Treelite   26.6 23.8
Concatenate Treelite handles 23.3    
(Copy model master -> workers)      
Protobuf->Treelite   52.2 52.3
Treelite->FIL   8.6 8.5
FIL predict   2.9 1.1

This PR

  Master Worker 0 Worker 1
RF->Treelite   22.1 22.5
Treelite->Bytes   < 0.001 < 0.001
(Copy models workers -> master)      
Bytes->Treelite   < 0.001 < 0.001
Concatenate Treelite handles 3.1    
(Copy model master -> workers)      
Bytes->Treelite   < 0.001 < 0.001
Treelite->FIL   4.8 4.7
FIL predict   4.9 1.1

cpp/cmake/Dependencies.cmake Outdated Show resolved Hide resolved
@JohnZed
Copy link
Contributor

JohnZed commented May 14, 2020

Very promising! I suspect that the RF -> treelite can be accelerated a lot in a future rev too... I don't know why it should have to be much longer than treelite-> fil in general if we move to convert the representation efficiently.

@dantegd dantegd added the 2 - In Progress Currenty a work in progress label May 14, 2020
Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a fairly quick pass for the first rev... will review again in more detail as things get wrapped up.
Overall, I think it's great and looks pretty clear. The changes to the core cython code are smaller than I expected - it fits in well with the existing pattern, so I think it's pretty understandable.
I don't fully get the serialization to frames - looks like it simply passing through the binary format used within tl? Seems reasonable to me, but I haven't used this style of conversion before with Py_buffers etc.

cpp/cmake/Dependencies.cmake Show resolved Hide resolved
cpp/src/fil/fil.cu Show resolved Hide resolved
cpp/test/sg/rf_treelite_test.cu Show resolved Hide resolved
@@ -515,18 +509,21 @@ class RandomForestClassifier(Base):
to a shared file. Cuml issue #1854 has been created to track this.
"""
def _tl_model_handles(self, model_bytes):
cdef ModelHandle cuml_model_ptr = NULL
cdef uintptr_t tl_handle_int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another pr, I proposed renaming this something like _alloc_and_convert_model to make it clear that the caller needs to free the result.

"""
Returns the self.model_pbuf_bytes.
Returns the self.model_bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify the type and update the rest of docstring: "Returns the treelite binary format representation of this model."

@hcho3
Copy link
Contributor Author

hcho3 commented May 14, 2020

I don't fully get the serialization to frames - looks like it simply passing through the binary format used within tl?

Treelite objects now exposes the Python buffer protocol interface, so that we can transparently convert Treelite objects to memory views with zero overhead. In init_from_frames(), we fetch the buffer interface from the Treelite object and cast it into memory view. It is O(1) because it amounts to pointer casting.

@jakirkham gave me valuable advice for implementing the Python buffer protocol.

Copy link
Contributor

@Salonijain27 Salonijain27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, its great to see the time required for predict drop so much! I have a few suggestions and questions

cpp/test/sg/fil_test.cu Show resolved Hide resolved
python/cuml/ensemble/randomforestclassifier.pyx Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforestregressor.pyx Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforestclassifier.pyx Outdated Show resolved Hide resolved
@hcho3 hcho3 marked this pull request as ready for review May 15, 2020 20:45
@hcho3 hcho3 requested review from a team as code owners May 15, 2020 20:45
@Salonijain27 Salonijain27 added the 4 - Waiting on Author Waiting for author to respond to review label May 15, 2020
@hcho3 hcho3 force-pushed the fast_treelite_serializer branch from 117d313 to a935bee Compare May 20, 2020 00:30
@hcho3
Copy link
Contributor Author

hcho3 commented May 20, 2020

This is quite a strange error:



=================================== FAILURES ===================================

_________________________ test_real_algos_runner[FIL] __________________________



algo_name = 'FIL'



    @pytest.mark.parametrize('algo_name', ['UMAP-Supervised',

                                           'DBSCAN',

                                           'LogisticRegression',

                                           'ElasticNet',

                                           'FIL'])

    def test_real_algos_runner(algo_name):

        pair = algorithms.algorithm_by_name(algo_name)

    

        if (algo_name == 'UMAP' and not has_umap()) or \

           (algo_name == 'FIL' and not has_xgboost()):

            pytest.xfail()

    

        runner = AccuracyComparisonRunner(

            [20], [5], dataset_name='classification', test_fraction=0.20

        )

        results = runner.run(pair)[0]

        print(results)

>       assert results["cuml_acc"] is not None

E       KeyError: 'cuml_acc'



cuml/test/test_benchmark.py:190: KeyError

@hcho3 hcho3 changed the title [WIP] Faster Treelite serialization [REVIEW] Faster Treelite serialization May 21, 2020
@hcho3
Copy link
Contributor Author

hcho3 commented May 21, 2020

I managed to fix the failing benchmark test. Marking this as ready for review.

cpp/src/fil/fil.cu Outdated Show resolved Hide resolved
@hcho3
Copy link
Contributor Author

hcho3 commented Jun 14, 2020

List of changes made to Treelite:

Once all tests pass, I will go ahead and release 0.92 version of Treelite.

@hcho3 hcho3 changed the title [WIP] Faster Treelite serialization [REVIEW] Faster Treelite serialization Jun 15, 2020
ci/gpu/build.sh Outdated Show resolved Hide resolved
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting to see all of the progress here, @hcho3! 😄

Added a couple of comments about how we might simplify things a bit here. Please let me know if you have any questions 🙂

python/cuml/ensemble/randomforest_shared.pyx Show resolved Hide resolved
python/cuml/ensemble/randomforestregressor.pyx Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforestregressor.pyx Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforestregressor.pyx Outdated Show resolved Hide resolved
python/cuml/dask/ensemble/base.py Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforest_shared.pyx Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforest_shared.pyx Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforest_shared.pyx Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforest_shared.pyx Outdated Show resolved Hide resolved
python/cuml/ensemble/randomforestclassifier.pyx Outdated Show resolved Hide resolved
@hcho3
Copy link
Contributor Author

hcho3 commented Jun 16, 2020

I submitted Treelite 0.92 to conda-forge: conda-forge/staged-recipes#11926. Fingers crossed.

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I have only small questions/suggestions.
Also, are there additional unit tests that would be helpful here? Serializing+deserializing model variants (e.g. classification/regression/multiclass) and ensuring we get the properties right? Not sure...

cpp/cmake/Dependencies.cmake Outdated Show resolved Hide resolved
if (task_category > 2) {
// Multi-class classification
TREELITE_CHECK(TreeliteModelBuilderSetModelParam(
model_builder, "pred_transform", "max_index"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't multiclass currently disabled until #2248 goes in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I included the line because it was already part of the current codebase.

python/cuml/ensemble/randomforestclassifier.pyx Outdated Show resolved Hide resolved
@hcho3
Copy link
Contributor Author

hcho3 commented Jun 16, 2020

@JohnZed The Treelite repo contains several round-trip tests for the new serializer: https://github.com/dmlc/treelite/blob/master/tests/cpp/test_serializer.cc

@hcho3
Copy link
Contributor Author

hcho3 commented Jun 16, 2020

@jakirkham I addressed all your comments, except the one about casting buffer frames.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping together suggested format_str encoding changes for clearer discussion.

python/cuml/ensemble/randomforest_shared.pyx Show resolved Hide resolved
python/cuml/ensemble/randomforest_shared.pyx Outdated Show resolved Hide resolved
model: uintptr_t
) -> Dict[str, Union[List[str], List[np.ndarray]]]:
frames = get_frames(model)
header = {'format_str': [x.format.encode('utf-8') for x in frames],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
header = {'format_str': [x.format.encode('utf-8') for x in frames],
header = {'format_str': [x.format for x in frames],

Copy link
Contributor Author

@hcho3 hcho3 Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and applied your suggestion.

Is it preferable to pass str to pickle, rather than bytes? I'd like to understand your reasoning behind this suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about this in the context of moving from pickle to Dask serialization down the road (assuming that is still the plan). Typically the header consists of things like int, str, dict, list, and tuple. Generally things that are MsgPack serializable. Typically bytes and memoryviews are reserved for frames instead.

Was unsure at first whether bytes would work in the header. However after playing with things a bit bytes may work. MsgPack is at least able to handle them with the flags that Dask is using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thanks for your explanation. I agree that built-in types like str would be well supported by Dask serializer.

Co-authored-by: jakirkham <jakirkham@gmail.com>
@jakirkham
Copy link
Member

Thanks for all of the work here @hcho3! 😄 Looks good

Sounds like we are going to handle switching to the treelite Conda package in another PR. Is that right?

@hcho3
Copy link
Contributor Author

hcho3 commented Jun 16, 2020

Sounds like we are going to handle switching to the treelite Conda package in another PR. Is that right?

Yes, I’ll work on it after this PR is merged.

@hcho3
Copy link
Contributor Author

hcho3 commented Jun 17, 2020

The failing test in the CI should be fixed by #2432

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@JohnZed JohnZed merged commit f231111 into rapidsai:branch-0.15 Jun 17, 2020
@hcho3 hcho3 deleted the fast_treelite_serializer branch June 17, 2020 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons 2 - In Progress Currenty a work in progress 4 - Waiting on Author Waiting for author to respond to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants