Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Categorify inference and testing #1874

Merged
merged 8 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cpp/nvtabular/inference/categorify.cc
Original file line number Diff line number Diff line change
Expand Up @@ -337,12 +337,12 @@ namespace nvtabular
// this operator currently only supports CPU arrays
.def_property_readonly("supports", [](py::object self)
{
py::object supports = py::module_::import("nvtabular").attr("graph").attr("base_operator").attr("Supports");
py::object supports = py::module_::import("nvtabular").attr("graph").attr("operator").attr("Supports");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base_operator.py file was renamed to operator.py in NVIDIA-Merlin/core#359. Therefore, this fix should be valid for >=23.08

return supports.attr("CPU_DICT_ARRAY");
})
.def_property_readonly("supported_formats", [](py::object self)
{
py::object supported = py::module_::import("nvtabular").attr("graph").attr("base_operator").attr("DataFormats");
py::object supported = py::module_::import("nvtabular").attr("graph").attr("operator").attr("DataFormats");
return supported.attr("NUMPY_DICT_ARRAY");
});
}
Expand Down
13 changes: 11 additions & 2 deletions nvtabular/ops/operator.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
from merlin.dag import BaseOperator, ColumnSelector # noqa pylint: disable=unused-import
from merlin.dag import ( # noqa pylint: disable=unused-import
BaseOperator,
ColumnSelector,
DataFormats,
)

Operator = BaseOperator

# Avoid TENSOR_TABLE by default (for now)
class Operator(BaseOperator):
@property
def supported_formats(self):
return DataFormats.PANDAS_DATAFRAME | DataFormats.CUDF_DATAFRAME
Comment on lines +26 to +27
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With NVIDIA-Merlin/systems#389 applied to systems, I was seeing inference errors in Groupby. This is because the Groupby operator calls sort_values, and sorting is not supported for merlins TensorTable abstraction.

Since I am not entirely sure which NVTabular operations are supported with TensorTable, I figured the safest fix was to assume none of the operations are supported for now. (Any thoughts on this @jperez999 ?)

In a follow-up to this PR, it probably makes sense to add GeneralOperator = BaseOperator above this. That way, any operator with known support for TensorTable could inherit from GeneralOperator instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oliverholworthy and @jperez999 do you have a chance to look at Rick's comments above? thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be alright. All of the operators in nvtabular were created with data frames in mind. If we ever decide to add in tensor table, we can make the change then. If this speeds up the runs I say we should do it and solves breaking issues we should execute.

5 changes: 5 additions & 0 deletions tests/unit/ops/test_categorify.py
Original file line number Diff line number Diff line change
Expand Up @@ -734,3 +734,8 @@ def test_categorify_inference():
output_tensors = inference_op.transform(cats.input_columns, input_tensors)
for key in input_tensors:
assert output_tensors[key].dtype == np.dtype("int64")

# Check results are consistent with python code path
expect = workflow.transform(df)
got = pd.DataFrame(output_tensors)
assert_eq(expect, got)
Comment on lines +738 to +741
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was only.checking the data type of the result. Now it also checks if the result is correct.

Loading