Failed to reproduce the results #2

zhuyy0810 · 2024-12-04T13:23:03Z

I attempted to reproduce the experiment in section 5.2.1 of the paper, where DeepONet is used to solve the advection equation. I used the same model architecture and training parameters as described in the paper, with a Trunk size of 4×512, a Branch size of 2×512, and trained for 250,000 iterations. However, the training time and memory usage differ significantly from the results in the paper. When using mixed-precision training, the training time was 2680.338209 seconds, and the memory usage was 736MB. With fp32 training, the training time was 3127.268575 seconds, and the memory usage was 736MB. My environment has TensorFlow version 2.13.1, DeepXDE version 1.10.1, and I trained on NVIDIA GeForce RTX 3090 GPU. When I ran advec_mixed_prec.py, I encountered the error 'The global policy can only be set in TensorFlow 2 or if V2 dtype behavior has been set. To enable V2 dtype behavior, call "tf.compat.v1.keras.layers.enable_v2_dtype_behavior()".' Therefore, I added tf.compat.v1.keras.layers.enable_v2_dtype_behavior() before policy = mixed_precision.Policy('mixed_float16'). The rest of advec_mixed_prec.py and Advection.py, except for the training parameter settings in the main function, are the same as the ones on GitHub.. Below is the main function of the codes I used to train DeepONet with mixed precision and fp32.

nt = 40
nx = 40
x_train, y_train = get_data("/home/zhuyiyan/mixed-precision-sciml-main/Dataset/DeepONEt/Advection_equation_dataset/train_IC2.npz")
x_test, y_test = get_data("/home/zhuyiyan/mixed-precision-sciml-main/Dataset/DeepONEt/Advection_equation_dataset/test_IC2.npz")
data = dde.data.TripleCartesianProd(x_train, y_train, x_test, y_test)

net = dde.maps.DeepONetCartesianProd(
    [nx, 512, 512], [2, 512, 512, 512, 512], "relu", "Glorot normal"
)

model = dde.Model(data, net)
# model.callbacks.append(time_callback(verbose=1))
model.compile(
    "adam",
    lr=1e-3,
    decay=("inverse time", 1, 1e-4),
    metrics=["mean l2 relative error"],
)

# IC1
# losshistory, train_state = model.train(epochs=100000, batch_size=None)
# IC2
# time_callback = TimeCallback()
losshistory, train_state = model.train(epochs=250000, batch_size=None)

y_pred = model.predict(data.test_x)
np.savetxt("y_pred_deeponet.dat", y_pred[0].reshape(nt, nx))
np.savetxt("y_true_deeponet.dat", data.test_y[0].reshape(nt, nx))
np.savetxt("y_error_deeponet.dat", (y_pred[0] - data.test_y[0]).reshape(nt, nx))

The text was updated successfully, but these errors were encountered:

Jhayford · 2024-12-27T18:25:12Z

Hello @zhuyy0810 , thank you for bringing this up. I just rerun the Advection.py code and I was able to reproduce the results for runtime and memory. For advec_mixed_prec.py, there were some issues with the code regarding the fp of the inputs which I have fixed and I can now reproduce the results.

The code has to be run with tensorFlow 2 to get the reported results. The errors you provided shows that either you did not use tf2 or there is something wrong with your tf2.

The time used for running this code on NVIDIA GeForce RTX 3090 GPU is definitely much less than what you reported. Please try again and let me know.

lululxvi assigned Jhayford Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to reproduce the results #2

Failed to reproduce the results #2

zhuyy0810 commented Dec 4, 2024 •

edited

Loading

Jhayford commented Dec 27, 2024

Failed to reproduce the results #2

Failed to reproduce the results #2

Comments

zhuyy0810 commented Dec 4, 2024 • edited Loading

Jhayford commented Dec 27, 2024

zhuyy0810 commented Dec 4, 2024 •

edited

Loading