Training on Titanic has a NaN loss #615

JulienVig · 2024-02-01T08:42:23Z

When training on the titanic task (and potentially other tasks), the loss is NaN at every epoch. The output of model.predict is only composed of NaN values. Therefore are no accuracy improvements throughout the epochs.
This happens for every training format: from the browser UI, local, federated etc.

The text was updated successfully, but these errors were encountered:

JulienVig · 2024-02-01T08:44:49Z

Here is a code snippet returning a NaN loss during training:

const serverUrl = new URL('http://localhost:8080/')
const tasks = await fetchTasks(serverUrl)
const task = tasks.get('titanic') as Task
const dataset = await loadTitanicData(task)

const model = await getModel()
model.compile({
    optimizer: 'sgd',
    loss: 'categoricalCrossentropy',
    metrics: ['accuracy']
});
model.fitDataset(dataset.train.preprocess().batch().dataset, {epochs:1})

JulienVig · 2024-02-01T08:54:14Z

This commit refactored the data preprocessing structure and dropped support for tabular data preprocessing.

martinjaggi · 2024-02-01T09:23:41Z

though the text loader is based on the csv (tabular) loader, shouldn't it still work?

JulienVig · 2024-02-01T15:10:22Z

I didn't see any dependency between text and tabular preprocessing, from what I've found the text preprocessing is standalone (it only tokenizes and handles padding) while the tabular preprocessing doesn't have any function implemented (while text and image have some).

I managed to solve the issue by implementing a very temporary preprocessing to handle missing values which were causing the NaNs. After the last preprocessing refactor, preprocessing now only handles data row by row which makes it very impractical to drop data or use overall aggregations to standardize for example.
I'm thinking of leaving the implementation of an actual preprocessing for later and address bugs first.

The model training caused weights to diverge to NaNs in some cases due to the default learning rate being too high (lowered it from 0.01 to 0.001).

peacefulotter · 2024-02-03T12:11:16Z

Could this be related to the issue I had when training on bigger gpt models, such as gpt2 and above, or did the sanitization preprocessing step entirely fixed it? Even outside of Disco, I would consistently have NaN loss values..

JulienVig · 2024-02-06T13:19:52Z

@peacefulotter the preprocessing was not enough, I also had to decrease the learning rate which was making the weights diverge. Have you tried fine-tuning the learning rate? I saw papers estimating the learning rate proportionally to the model's number of weights

JulienVig self-assigned this Feb 1, 2024

JulienVig added bug Something isn't working discojs Related to Disco.js labels Feb 1, 2024

JulienVig mentioned this issue Feb 1, 2024

Fix training related failures #616

Merged

JulienVig closed this as completed in 8a2e846 Feb 8, 2024

JulienVig closed this as completed in #616 Feb 8, 2024

martinjaggi added this to the v3.0.0 milestone Jul 24, 2024

JulienVig mentioned this issue Sep 26, 2024

*: framework agnostic preprocessing #781

Merged

JulienVig mentioned this issue Oct 21, 2024

Disco preprocessing is too constraining #649

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on Titanic has a NaN loss #615

Training on Titanic has a NaN loss #615

JulienVig commented Feb 1, 2024

JulienVig commented Feb 1, 2024

JulienVig commented Feb 1, 2024

martinjaggi commented Feb 1, 2024

JulienVig commented Feb 1, 2024

peacefulotter commented Feb 3, 2024 •

edited

Loading

JulienVig commented Feb 6, 2024

Training on Titanic has a NaN loss #615

Training on Titanic has a NaN loss #615

Comments

JulienVig commented Feb 1, 2024

JulienVig commented Feb 1, 2024

JulienVig commented Feb 1, 2024

martinjaggi commented Feb 1, 2024

JulienVig commented Feb 1, 2024

peacefulotter commented Feb 3, 2024 • edited Loading

JulienVig commented Feb 6, 2024

peacefulotter commented Feb 3, 2024 •

edited

Loading