Dataset Viewer issue for jonas/undp_jobs_raw #731

AndreaFrancis · 2023-01-30T18:48:56Z

Link

https://huggingface.co/datasets/jonas/undp_jobs_raw

Description

When going to the preview panel, this message is shown:

'update' command document too large
Error code:   UnexpectedError

Other datasets with same issue:

SamAct/medium_cleaned split train
grasshoff--lhc_sents split train
Sangmun/wiki_doc_preprocessed_withmaxlength split train
DavidVivancos--MindBigData2022_Imagenet_IN split test and train
heanu/soda split test and validation
DavidVivancos/MindBigData_Imagenet_IN split train
DavidVivancos/MindBigData2022_Imagenet_IN_Spct splitr test

When trying to investigate the issue I saw the following logs:
DEBUG: 2023-01-30 13:16:36,215 - root - the size of the first 10 rows (1087032102) is above the max number of bytes (-7173342), they will be truncated
DEBUG: 2023-01-30 13:16:40,076 - root - the size of the rows is now (11944186) after truncating row idx=0

It shows that the remaining space for the document without considering the rows is negative (-7173342) which means only the total space for columns is more that what is accepted.

When looking at the csv file for jonas/undp_jobs_raw dataset it looks like the datase has 103630 columns.

>>> import pandas as pd
>>> huge_ds = pd.read_csv('undp_jobs.csv')
>>> len(huge_ds.columns)
103630
>>> huge_ds.head(1)
                                                   0                                                  1  ...           104998                                             104999
0  {'content': ['hiv and sti clinical consultant ...  {'content': ['internship- pacific digital econ...  ...  {'content': []}  {'content': ['deputy head of resident coordina...

[1 rows x 103630 columns]
>>>

Need to find a way to keep the columns and store them in the cache without issues.

The text was updated successfully, but these errors were encountered:

AndreaFrancis · 2023-01-30T21:14:19Z

#self-assign

AndreaFrancis · 2023-01-31T11:40:39Z

Based on a couple of suggestions from @severo, I thought this could work:

Get the response without rows {dataset, config, split, features, rows=[]}
If response size is > : Take only N features (features_max_number)
While response size is still > rows_max_bytes: Reduce number of features progressively backwards until response size is < rows_max_bytes
If new response size < rows_max_bytes: Get first M rows (rows_min_number)
If response + rows size > rows_max_bytes: Truncate rows else: add the remaining rows until the end, or until the bytes threshold
If response + truncated rows size > rows_max_bytes: Reduce number of rows progressively backwards until response size is < rows_max_bytes
I know this sounds a kind of complicated but I think it is a balance for the goal of showing the dataset preview (features + rows)

I will prepare a PR proposal for this approach but please feel free to provide feedback

severo · 2023-01-31T12:57:27Z

Nice.

Before giving my feedback, I just want to give context and my feelings about the "truncation".

These cases where we cannot even display the first 100 rows, because there are too many columns or because the cells are too heavy, are annoying.

We are truncating the response in a non-standard ad-hoc way, just to make it manageable for the dataset viewer on the Hub. And even, it means that the dataset viewer has to do specific checks and manage these cases. Currently:

it displays a message when the number of rows is lower than 100
it displays a TRUNCATED... suffix in the cell

We now have to manage the case where the number of columns is too big. I'm afraid that the code, both in /first-rows, as in the clients, will become even more complex. Because we will have to tell how many columns have been truncated (maybe with their types and names, ie: will we also truncate the "features" part of the response?), and manage this in the viewer.

That's why it made me think a bit, maybe we have to take a step backward and rethink this endpoint. It's really aimed at the dataset viewer, and only while the parquet files are not available. Later, we will be able to query the contents we need and show any content in the viewer (see #13).

Maybe we could postpone this issue until we manage the new version of the viewer that is able to manage random access to the rows. And meanwhile, just:

return 0 rows,
or simpler: return an explicit error like: too many columns. The maximum supported number of columns is XXX

AndreaFrancis · 2023-02-08T18:56:42Z

Fix was applied on current version of datasets-server, it worked for the following datasets:

jonas/undp_jobs_raw train split
DavidVivancos/MindBigData2022_Imagenet_IN test split
DavidVivancos/MindBigData_Imagenet_IN train split
DavidVivancos/MindBigData_Imagenet_IN train and test split
DavidVivancos/MindBigData2022_Imagenet_IN_Spct test split

For the following ones, the process got stuck as "started" status but that is because of job zombies issue:

SamAct/medium_cleaned
grasshoff/lhc_sents
Sangmun/wiki_doc_preprocessed_withmaxlength
DavidVivancos/MindBigData2022_Imagenet_IN - train

Those should be retried ones #741 is fixed

AndreaFrancis added the bug Something isn't working label Jan 30, 2023

AndreaFrancis assigned severo Jan 30, 2023

severo mentioned this issue Jan 31, 2023

Improve the error messages in the dataset viewer #745

Closed

AndreaFrancis mentioned this issue Jan 31, 2023

Adding custom exception when cache insert fails because of too many columns #749

Merged

AndreaFrancis assigned AndreaFrancis and unassigned severo Feb 1, 2023

severo mentioned this issue Feb 2, 2023

Handle the case where the DatasetInfo is too big #762

Closed

AndreaFrancis closed this as completed Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Viewer issue for jonas/undp_jobs_raw #731

Dataset Viewer issue for jonas/undp_jobs_raw #731

AndreaFrancis commented Jan 30, 2023

AndreaFrancis commented Jan 30, 2023

AndreaFrancis commented Jan 31, 2023

severo commented Jan 31, 2023 •

edited

Loading

AndreaFrancis commented Feb 8, 2023

Dataset Viewer issue for jonas/undp_jobs_raw #731

Dataset Viewer issue for jonas/undp_jobs_raw #731

Comments

AndreaFrancis commented Jan 30, 2023

Link

Description

AndreaFrancis commented Jan 30, 2023

AndreaFrancis commented Jan 31, 2023

severo commented Jan 31, 2023 • edited Loading

AndreaFrancis commented Feb 8, 2023

severo commented Jan 31, 2023 •

edited

Loading