Only check dask.DataFrame dtypes of columns actually used #1236
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1235.
In our dask DataFrame workflows we use a prediction of a
dtype
to return, and previously we tried to calculate one that suited all columns of the DataFrame. This fix restricts the calculation to only look at the columns that we actually use.In terms of implementation, the columns used have already been identified in the
compile_components
function so we just need to return them to all callers, and the dask workflow now uses just those columns.I have been really conservative here. Using up-to-date dependent packages the predicted
dtype
doesn't matter at all, I can put in anything here and datashader works as expected. But given that this code does some potentially risky things with dask internals I do not want to change it any more than necessary.