Data loader merlin graph transforms and embeddings #37

jperez999 · 2022-10-14T18:36:31Z

This PR adds the ability to run a merlin graph transforms over the batches of data that come out of the data loader. Operator introduced here is the embedding operators. Allowing for batch level additions of the embedding representations for records.

jperez999 · 2022-10-14T18:37:09Z

merlin/loader/loader_base.py

@@ -438,21 +467,21 @@ def _to_tensor(self, df):
        tensor in the appropriate library, with an optional
        dtype kwarg to do explicit casting if need be
        """
-        raise NotImplementedError
+        return df.to_cupy()


should be removed

jperez999 · 2022-10-14T18:37:21Z

merlin/loader/loader_base.py


    def _get_device_ctx(self, dev):
        """
        One of the mandatory functions a child class needs
        to implement. Maps from a GPU index to a framework
        context object for placing tensors on specific GPUs
        """
-        raise NotImplementedError
+        return cp.cuda.Device(dev)


should be removed

jperez999 · 2022-10-14T18:37:34Z

merlin/loader/loader_base.py


    def _cast_to_numpy_dtype(self, dtype):
        """
        Get the numpy dtype from the framework dtype.
        """
-        raise NotImplementedError
+        return dtype


should be removed

jperez999 · 2022-10-17T18:18:56Z

merlin/loader/loader_base.py

+        #           are all operators going to need to know about lists as tuples?
+        #           seems like we could benefit from an object here that encapsulates
+        #               both lists and scalar tensor types?
+        if self.transforms:


We should think about creating a comprehensive "column" class that can be sub-classed to ScalarColumn and ListColumn. This will hide the tuple format behind a df series type interface that will be more friendly to the other parts of merlin, i.e. the graph. The use case is what if I want to do some after dataloader inbatch processing to a list column. It will be easier to abstract that tuple representation (values, nnz) and allow the user to not have to worry about keeping track of all that.

jperez999 · 2022-10-17T18:20:26Z

This PR requires a core change in https://github.com/NVIDIA-Merlin/core/pull/152/files

jperez999 · 2022-10-25T18:09:29Z

rerun tests

jperez999 · 2022-11-03T03:40:27Z

merlin/loader/ops/embeddings/tf_embedding_op.py

+
+
+class TFEmbeddingOperator(BaseOperator):
+    """Create an operator that will apply a tf embedding table to supplied indices.


Most of these are repeated with small tweaks, would be nice to be able to converge so we dont have three operators for the same thing just using different inputs.

jperez999 · 2022-11-03T03:41:54Z

merlin/loader/ops/embeddings/torch_embedding_op.py

+from merlin.schema import ColumnSchema, Schema, Tags
+
+
+class TorchEmbeddingOperator(BaseOperator):


Same as in tensorflow case, so many of the operators are just a little different, but to avoid confusions and allow users to understand more clearly uses and use cases we have kept these operators separate. Would be good to move to a state where we just have one operator for this (as previously stated).

jperez999 · 2022-11-03T03:44:38Z

tests/conftest.py

+
+
+@pytest.fixture(scope="session")
+def rev_embedding_ids(embedding_ids, tmpdir_factory):


reverse embeddings is used to ensure that id_lookup is working correctly, in this case the indexes are reversed, [99999:1] , In embedding_ids above its [1:99999]. This allows us to use enumeration of batches to pull out the correct (what should be in the embeddings) values and assert they are what came back in each batch.

jperez999 · 2022-11-03T12:48:05Z

moved from private to public so needs new fork

jperez999 added 4 commits October 14, 2022 14:35

lay down foundation of transform capability in dataloader.

db89cab

working transforms for greater that host memory and greater

736c8e8

add in memory versions of indexing

c1572fa

added docstrings to operators and made change to tf embedding test

0eb8d81

jperez999 commented Oct 14, 2022

View reviewed changes

remove base loader logic for gpu only

6334d91

jperez999 commented Oct 17, 2022

View reviewed changes

Merge branch 'main' into dl-transforms

f978c57

This was linked to issues Nov 1, 2022

Add lookup for embeddings based on key during dataloading #31

Closed

Add pretrained embedding to the dictionary of tensors #32

Closed

jperez999 commented Nov 3, 2022

View reviewed changes

jperez999 closed this Nov 3, 2022

jperez999 mentioned this pull request Nov 3, 2022

Dl transforms #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loader merlin graph transforms and embeddings #37

Data loader merlin graph transforms and embeddings #37

jperez999 commented Oct 14, 2022

jperez999 Oct 14, 2022

jperez999 Oct 14, 2022

jperez999 Oct 14, 2022

jperez999 Oct 17, 2022

jperez999 commented Oct 17, 2022

jperez999 commented Oct 25, 2022

jperez999 Nov 3, 2022

jperez999 Nov 3, 2022

jperez999 Nov 3, 2022

jperez999 commented Nov 3, 2022



		class TFEmbeddingOperator(BaseOperator):
		"""Create an operator that will apply a tf embedding table to supplied indices.

		from merlin.schema import ColumnSchema, Schema, Tags


		class TorchEmbeddingOperator(BaseOperator):



		@pytest.fixture(scope="session")
		def rev_embedding_ids(embedding_ids, tmpdir_factory):

Data loader merlin graph transforms and embeddings #37

Data loader merlin graph transforms and embeddings #37

Conversation

jperez999 commented Oct 14, 2022

jperez999 Oct 14, 2022

Choose a reason for hiding this comment

jperez999 Oct 14, 2022

Choose a reason for hiding this comment

jperez999 Oct 14, 2022

Choose a reason for hiding this comment

jperez999 Oct 17, 2022

Choose a reason for hiding this comment

jperez999 commented Oct 17, 2022

jperez999 commented Oct 25, 2022

jperez999 Nov 3, 2022

Choose a reason for hiding this comment

jperez999 Nov 3, 2022

Choose a reason for hiding this comment

jperez999 Nov 3, 2022

Choose a reason for hiding this comment

jperez999 commented Nov 3, 2022