Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debertav2 debertav3 TPU : socket closed #18276

Closed
2 of 4 tasks
Shiro-LK opened this issue Jul 25, 2022 · 3 comments
Closed
2 of 4 tasks

Debertav2 debertav3 TPU : socket closed #18276

Shiro-LK opened this issue Jul 25, 2022 · 3 comments
Labels

Comments

@Shiro-LK
Copy link

System Info

  • transformers version: 4.20.1
  • Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.13
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.0+cu113 (False)
  • Tensorflow version (GPU?): 2.8.2 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: TF : 2.8.2 colab / 2.4 Kaggle
    TPU : v2 and v3

Who can help?

@Rocketknight1

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I tried to launch a script with a simple classification problem but got the error "socket close". I tried with deberta small and base so I doubt it is a memory error. Moreover I tried with Kaggle (TPUv3) and Colab (TPUv2). The same script with a roberta base model works perfectly fine. The length I used was 128.

I created the model using this :

def get_model() -> tf.keras.Model:
    backbone =  TFAutoModel.from_pretrained(cfg.model_name)
    input_ids = tf.keras.layers.Input(
        shape=(cfg.max_length,),
        dtype=tf.int32,
        name="input_ids",
    )
    attention_mask = tf.keras.layers.Input(
        shape=(cfg.max_length,),
        dtype=tf.int32,
        name="attention_mask",
    )
 
    x = backbone({"input_ids": input_ids, "attention_mask": attention_mask})[0]
    x = x[:, 0, :] # tf.concat([, feature], axis=1)
    outputs = tf.keras.layers.Dense(1, activation="sigmoid", dtype="float32")(x)
    return tf.keras.Model(
        inputs=[input_ids, attention_mask],
        outputs=outputs,
    )

It also seems that Embedding is not compatible with bfloat16 :

InvalidArgumentError: Exception encountered when calling layer "embeddings" (type TFDebertaV2Embeddings).

cannot compute Mul as input #1(zero-based) was expected to be a bfloat16 tensor but is a float tensor

https://colab.research.google.com/drive/1T4GGCfYy7lAFrgapOtY0KBXPcnEPeTQz?usp=sharing

Expected behavior

A regular training like training roberta. On GPU, the same script is working and use 3 or 4 GB.

@Shiro-LK Shiro-LK added the bug label Jul 25, 2022
@Rocketknight1
Copy link
Member

Hi @Shiro-LK, we're seeing other reports of issues with DeBERTa running slowly on TPU with TF - see #18239. I'm not sure what the cause of the "socket closed" error is though - the other user got it to run, but just had a lot of slowdown on one of the layers.

@Shiro-LK
Copy link
Author

@Rocketknight1 Thank for the reply. Yes I have just looked at it. But it does not seem to use the keras function "model.fit" so I wonder if that"s the issue.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants