Debertav2 debertav3 TPU : socket closed #18276

Shiro-LK · 2022-07-25T01:06:58Z

System Info

transformers version: 4.20.1
Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.13
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.0+cu113 (False)
Tensorflow version (GPU?): 2.8.2 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?: TF : 2.8.2 colab / 2.4 Kaggle
TPU : v2 and v3

Who can help?

@Rocketknight1

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I tried to launch a script with a simple classification problem but got the error "socket close". I tried with deberta small and base so I doubt it is a memory error. Moreover I tried with Kaggle (TPUv3) and Colab (TPUv2). The same script with a roberta base model works perfectly fine. The length I used was 128.

I created the model using this :

def get_model() -> tf.keras.Model:
    backbone =  TFAutoModel.from_pretrained(cfg.model_name)
    input_ids = tf.keras.layers.Input(
        shape=(cfg.max_length,),
        dtype=tf.int32,
        name="input_ids",
    )
    attention_mask = tf.keras.layers.Input(
        shape=(cfg.max_length,),
        dtype=tf.int32,
        name="attention_mask",
    )
 
    x = backbone({"input_ids": input_ids, "attention_mask": attention_mask})[0]
    x = x[:, 0, :] # tf.concat([, feature], axis=1)
    outputs = tf.keras.layers.Dense(1, activation="sigmoid", dtype="float32")(x)
    return tf.keras.Model(
        inputs=[input_ids, attention_mask],
        outputs=outputs,
    )

It also seems that Embedding is not compatible with bfloat16 :

InvalidArgumentError: Exception encountered when calling layer "embeddings" (type TFDebertaV2Embeddings).

cannot compute Mul as input #1(zero-based) was expected to be a bfloat16 tensor but is a float tensor

https://colab.research.google.com/drive/1T4GGCfYy7lAFrgapOtY0KBXPcnEPeTQz?usp=sharing

Expected behavior

A regular training like training roberta. On GPU, the same script is working and use 3 or 4 GB.

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2022-07-25T11:49:25Z

Hi @Shiro-LK, we're seeing other reports of issues with DeBERTa running slowly on TPU with TF - see #18239. I'm not sure what the cause of the "socket closed" error is though - the other user got it to run, but just had a lot of slowdown on one of the layers.

Shiro-LK · 2022-07-26T22:12:20Z

@Rocketknight1 Thank for the reply. Yes I have just looked at it. But it does not seem to use the keras function "model.fit" so I wonder if that"s the issue.

github-actions · 2022-08-24T15:01:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Shiro-LK added the bug label Jul 25, 2022

tmoroder mentioned this issue Aug 4, 2022

Fine tuning TensorFlow DeBERTa fails on TPU #18476

Closed

4 tasks

github-actions bot closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debertav2 debertav3 TPU : socket closed #18276

Debertav2 debertav3 TPU : socket closed #18276

Shiro-LK commented Jul 25, 2022

Rocketknight1 commented Jul 25, 2022

Shiro-LK commented Jul 26, 2022

github-actions bot commented Aug 24, 2022

Debertav2 debertav3 TPU : socket closed #18276

Debertav2 debertav3 TPU : socket closed #18276

Comments

Shiro-LK commented Jul 25, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Jul 25, 2022

Shiro-LK commented Jul 26, 2022

github-actions bot commented Aug 24, 2022