Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Correct Integration of TT-Embedding to DLRM #10

Open
TimJZ opened this issue Feb 13, 2021 · 11 comments
Open

Correct Integration of TT-Embedding to DLRM #10

TimJZ opened this issue Feb 13, 2021 · 11 comments

Comments

@TimJZ
Copy link

TimJZ commented Feb 13, 2021

Hi, I'm currently trying to integrate the TT-Embedding to the original DLRM code base, and I've successfully reproduced the result shown in readme. However, I'm not quite sure what are the essential changes to make.

Right now I'm replacing the original embeddingbag function (within the create_emb in dlrm_s_pytorch.py file) in DLRM with TTEmbeddingBag, but have trouble figuring out the correct parameters for the function. The parameters I used right now is:

               EE = TTEmbeddingBag(
                    n,
                    m,
                    tt_ranks=[12,14],
                    sparse=False,
                    use_cache=False,
                    weight_dist="uniform"
                )

I left the tt_p_shapes and tt_q_shapes to blank since each layer's embedding dimension and number of embeddings are different.
The paper mentioned that the TT-Rank used was [8, 16, 32, 64], but I wasn't able to use that parameter, since it would result in failure of passing assertion len(self.tt_p_shapes) <= 4. Therefore I used the same parameters in example ([12,14]).

And that result a CUDA illegal memory access error at line 174 in tt_embedding_ops. Full error message is attached below:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1013, in <module>
    Z = dlrm_wrap(X, lS_o, lS_i, use_gpu, device)
  File "dlrm_s_pytorch.py", line 866, in dlrm_wrap
    return dlrm(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 385, in forward
    return self.parallel_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 470, in parallel_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l)
  File "dlrm_s_pytorch.py", line 328, in apply_emb
    V = E(sparse_index_group_batch, sparse_offset_group_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/dlrm/tt_embeddings_ops.py", line 801, in forward
    output = TTLookupFunction.apply(
  File "/mnt/dlrm/tt_embeddings_ops.py", line 174, in forward
    output = tt_embeddings.tt_forward(
RuntimeError: CUDA error: an illegal memory access was encountered
  1. I'm thinking this is caused by in correct parameters and wondering if anyone could help me out here.
  2. I'm also wondering if there's any additional changes need to be made to dlrm other than replacing the embeddingbag.

Thanks!

@bilgeacun
Copy link
Contributor

Hi @TimJZ,

  1. In DLRM, you can decompose an embedding table of size 9994222 x 64 as follows:
    9994222 < 200 * 200 * 250 (tt_p_shapes)
    64 = 4 * 4 * 4 (tt_q_shapes)

Hence the shape of the three tensor cores would be:
(1, 200, 4, R1), (R1, 200, 4, R2), and (R2, 250, 4, 1)

Specifying tt_p_shapes and tt_q_shapes is optional as the library will find automatic values for this. You need to specify R1 & R2 values as tt_ranks. In the paper, we set the ranks as 8, 16, 32, 64 one at a time (with R1=R2), i.e. [8,8] or [16,16] etc. Can you try these arguments?

  1. No additional changes is needed.

@TimJZ
Copy link
Author

TimJZ commented Feb 19, 2021

Thank you very much for your response! I've tried it with the parameters you mentioned but I'm still getting the same error. I'm thinking this is a version-specific error. Could you please tell me which version of DLRM were you using when testing TT-Embedding? Thanks!

@bilgeacun
Copy link
Contributor

bilgeacun commented Feb 22, 2021

For the latest version of @facebookresearch/DLRM (1302c71624fa9dbe7f0c75fea719d5e58d33e059), this patch made it work for me:

+from tt_embeddings_ops import TTEmbeddingBag
+
 # from torchviz import make_dot
 # import torch.nn.functional as Functional
 # from torch.nn.parameter import Parameter
@@ -243,7 +247,14 @@ class DLRM_Net(nn.Module):
             n = ln[i]
 
             # construct embedding operator
-            if self.qr_flag and n > self.qr_threshold:
+            if True:
+                EE = TTEmbeddingBag(n, m, [8,8],
+                        None, None,
+                        sparse=False,
+                        weight_dist="approx-normal",
+                        use_cache=False)
+            # construct embedding operator
+            elif self.qr_flag and n > self.qr_threshold:
                 EE = QREmbeddingBag(
                     n,
                     m,
@@ -407,14 +418,24 @@ class DLRM_Net(nn.Module):
             # We are using EmbeddingBag, which implicitly uses sum operator.
             # The embeddings are represented as tall matrices, with sum
             # happening vertically across 0 axis, resulting in a row vector
-            # E = emb_l[k]
+            E = emb_l[k]
 
             if v_W_l[k] is not None:
                 per_sample_weights = v_W_l[k].gather(0, sparse_index_group_batch)
             else:
                 per_sample_weights = None
 
-            if self.quantize_emb:
+            if (isinstance(E, TTEmbeddingBag)):
+                l = sparse_index_group_batch.shape[0]
+                ll = torch.empty(1, dtype=torch.long)
+                ll[0]=l
+                if (sparse_offset_group_batch.is_cuda):
+                    ll = ll.to(torch.device("cuda"))
+                sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0)
+
+                V = E(sparse_index_group_batch,sparse_offset)
+                ly.append(V)
+            elif self.quantize_emb:
                 s1 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
                 s2 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
                 print("quantized emb sizes:", s1, s2)

And I ran it with a command like this:
python dlrm_s_pytorch.py --use-gpu --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.05 --mini-batch-size=128 --print-freq=1024 --print-time --test-freq=102400 --test-num-workers=16

Note that this makes all embeddings TTEmbedding, you can make some of them TT by changing the if True statement above.
Could you try this and see if it works for you?

@latifisalar
Copy link

Hi, I have been facing the "RuntimeError: CUDA error: an illegal memory access was encountered" error as well.
I am trying to train the DLRM model on Terabyte dataset. I have made the changes as you mentioned in the previous post, but I am facing the same error in sequentional_forward function.

I have tried it with PyTorch==1.8.0 with cuda 11.0 and 10.2, both of them have resulted in the same error:

Traceback (most recent call last): File "dlrm_s_pytorch_ttemb.py", line 1891, in <module> run() File "dlrm_s_pytorch_ttemb.py", line 1570, in run ndevices=ndevices, File "dlrm_s_pytorch_ttemb.py", line 142, in dlrm_wrap return dlrm(X.to(device), lS_o, lS_i) File "/home/salar/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "dlrm_s_pytorch_ttemb.py", line 529, in forward return self.sequential_forward(dense_x, lS_o, lS_i) File "dlrm_s_pytorch_ttemb.py", line 601, in sequential_forward ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l) File "dlrm_s_pytorch_ttemb.py", line 430, in apply_emb ll = ll.to(d) RuntimeError: CUDA error: an illegal memory access was encountered done

Here is the command I used:
python3 dlrm_s_pytorch_ttemb.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --processed-data-file=/data4/salar/terabyte/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --print-freq=1024 --print-time --test-mini-batch-size=4096 --test-num-workers=16 --use-gpu --test-freq=10240 --memory-map --data-sub-sample-rate=0.875 --raw-data-file=/data4/salar/terabyte/day --mini-batch-size=2048 --mlperf-logging

I was printing the content of sparse_index_group_batch, sparse_offset_group_batch, and embedding output to see what could be the possible issue, I observed that error occurs when the current batch has lots of zero values in the sparse_index_group_batch tensor. Not sure if related, but wanted to mention it in case it helps.

I would really appreciate if you could help me find out what could be the possible issue.

Thanks

@latifisalar
Copy link

Update: The illegal memory access issue was being caused by smaller embedding tables which had very low number of entries. By applying TTEmbedding only to bigger embeddings, I was able to complete apply_emb function. However, I am now facing a new issue when calling the torch.cuda.synchronize() function:
Traceback (most recent call last): File "dlrm_s_pytorch_ttemb.py", line 1900, in <module> run() File "dlrm_s_pytorch_ttemb.py", line 1549, in run current_time = time_wrap(use_gpu) File "dlrm_s_pytorch_ttemb.py", line 122, in time_wrap torch.cuda.synchronize() File "/home/salar/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 402, in synchronize return torch._C._cuda_synchronize() RuntimeError: CUDA error: device-side assert triggered

Would it be possible to redirect me on how to reproduce the exact results included in the paper for the Terabyte dataset?

Thanks

@TimJZ
Copy link
Author

TimJZ commented Mar 20, 2021

For the latest version of @facebookresearch/DLRM (1302c71624fa9dbe7f0c75fea719d5e58d33e059), this patch made it work for me:

+from tt_embeddings_ops import TTEmbeddingBag
+
 # from torchviz import make_dot
 # import torch.nn.functional as Functional
 # from torch.nn.parameter import Parameter
@@ -243,7 +247,14 @@ class DLRM_Net(nn.Module):
             n = ln[i]
 
             # construct embedding operator
-            if self.qr_flag and n > self.qr_threshold:
+            if True:
+                EE = TTEmbeddingBag(n, m, [8,8],
+                        None, None,
+                        sparse=False,
+                        weight_dist="approx-normal",
+                        use_cache=False)
+            # construct embedding operator
+            elif self.qr_flag and n > self.qr_threshold:
                 EE = QREmbeddingBag(
                     n,
                     m,
@@ -407,14 +418,24 @@ class DLRM_Net(nn.Module):
             # We are using EmbeddingBag, which implicitly uses sum operator.
             # The embeddings are represented as tall matrices, with sum
             # happening vertically across 0 axis, resulting in a row vector
-            # E = emb_l[k]
+            E = emb_l[k]
 
             if v_W_l[k] is not None:
                 per_sample_weights = v_W_l[k].gather(0, sparse_index_group_batch)
             else:
                 per_sample_weights = None
 
-            if self.quantize_emb:
+            if (isinstance(E, TTEmbeddingBag)):
+                l = sparse_index_group_batch.shape[0]
+                ll = torch.empty(1, dtype=torch.long)
+                ll[0]=l
+                if (sparse_offset_group_batch.is_cuda):
+                    ll = ll.to(torch.device("cuda"))
+                sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0)
+
+                V = E(sparse_index_group_batch,sparse_offset)
+                ly.append(V)
+            elif self.quantize_emb:
                 s1 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
                 s2 = self.emb_l_q[k].element_size() * self.emb_l_q[k].nelement()
                 print("quantized emb sizes:", s1, s2)

And I ran it with a command like this:
python dlrm_s_pytorch.py --use-gpu --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --raw-data-file=./input/train.txt --processed-data-file=./input/kaggleAdDisplayChallenge_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.05 --mini-batch-size=128 --print-freq=1024 --print-time --test-freq=102400 --test-num-workers=16

Note that this makes all embeddings TTEmbedding, you can make some of them TT by changing the if True statement above.
Could you try this and see if it works for you?

Thank you very much for your reply!
I've tried it with the patch applied and got the following error:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1887, in <module>
    run()
  File "dlrm_s_pytorch.py", line 1566, in run
    ndevices=ndevices,
  File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
    return dlrm(X.to(device), lS_o, lS_i)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 534, in forward
    return self.parallel_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 688, in parallel_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
  File "dlrm_s_pytorch.py", line 432, in apply_emb
    sparse_offset = torch.cat((sparse_offset_group_batch, ll), dim=0)
RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0 

Could you please give me some insights on what might go wrong?

I'm using pytorch=1.6.0a0+9907a3e and cuda =11.0.167

Since in pytorch 1.6, there's no approriate API for float gpuAtomicAdd(&cache_optimizer_state[idx], g_avg_square) (line 1711 in tt_embeddings_cuda.cu), I've also used the void version of the function and assigned the calculated value to old_sum_square_grads after function call. I don't think this is the source of error though.

Thanks!

@bilgeacun
Copy link
Contributor

bilgeacun commented Mar 20, 2021

Received cuda:1 and cuda:0

@TimJZ it looks like you are using two devices. We only tested DRLM on a single GPU so far, it should fit on a single device with 16GB memory when training DLRM with Terabyte and Kaggle datasets. Can you try running on a single device (i.e. by setting export CUDA_VISIBLE_DEVICES=0)?

@TimJZ
Copy link
Author

TimJZ commented Mar 27, 2021

Received cuda:1 and cuda:0

@TimJZ it looks like you are using two devices. We only tested DRLM on a single GPU so far, it should fit on a single device with 16GB memory when training DLRM with Terabyte and Kaggle datasets. Can you try running on a single device (i.e. by setting export CUDA_VISIBLE_DEVICES=0)?

I've tried it on single GPU, but I'm constantly getting illegal memory access error after the for loop runs 6 times:

Traceback (most recent call last):
  File "dlrm_s_pytorch.py", line 1888, in <module>
    run()
  File "dlrm_s_pytorch.py", line 1567, in run
    ndevices=ndevices,
  File "dlrm_s_pytorch.py", line 138, in dlrm_wrap
    return dlrm(X.to(device), lS_o, lS_i)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "dlrm_s_pytorch.py", line 532, in forward
    return self.sequential_forward(dense_x, lS_o, lS_i)
  File "dlrm_s_pytorch.py", line 604, in sequential_forward
    ly = self.apply_emb(lS_o, lS_i, self.emb_l, self.v_W_l)
  File "dlrm_s_pytorch.py", line 431, in apply_emb
    ll = ll.to(torch.device("cuda"))
RuntimeError: CUDA error: an illegal memory access was encountered

The GPU I'm using is Tesla V100-SXM2 with 32 GB of memory

@latifisalar
Copy link

If you have —mlperf-logging in your arguments, remove it. I was facing the same issue and it seems to being caused by enabling MLPerf logging.

@TimJZ
Copy link
Author

TimJZ commented Mar 27, 2021

If you have —mlperf-logging in your arguments, remove it. I was facing the same issue and it seems to being caused by enabling MLPerf logging.

I actually did not use mlperf-logging, but thanks for the feedback! I'm wondering if it's because I was using the mlperf-binloader.

@TimJZ
Copy link
Author

TimJZ commented Apr 11, 2021

@latifisalar @bilgeacun
I'm wondering if guys have any update with regard to this issue? Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants