Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Viewer issue for jonas/undp_jobs_raw #731

Closed
AndreaFrancis opened this issue Jan 30, 2023 · 4 comments
Closed

Dataset Viewer issue for jonas/undp_jobs_raw #731

AndreaFrancis opened this issue Jan 30, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@AndreaFrancis
Copy link
Contributor

Link

https://huggingface.co/datasets/jonas/undp_jobs_raw

Description

When going to the preview panel, this message is shown:

'update' command document too large
Error code:   UnexpectedError

Other datasets with same issue:

  • SamAct/medium_cleaned split train
  • grasshoff--lhc_sents split train
  • Sangmun/wiki_doc_preprocessed_withmaxlength split train
  • DavidVivancos--MindBigData2022_Imagenet_IN split test and train
  • heanu/soda split test and validation
  • DavidVivancos/MindBigData_Imagenet_IN split train
  • DavidVivancos/MindBigData2022_Imagenet_IN_Spct splitr test

When trying to investigate the issue I saw the following logs:
DEBUG: 2023-01-30 13:16:36,215 - root - the size of the first 10 rows (1087032102) is above the max number of bytes (-7173342), they will be truncated
DEBUG: 2023-01-30 13:16:40,076 - root - the size of the rows is now (11944186) after truncating row idx=0

It shows that the remaining space for the document without considering the rows is negative (-7173342) which means only the total space for columns is more that what is accepted.

When looking at the csv file for jonas/undp_jobs_raw dataset it looks like the datase has 103630 columns.

>>> import pandas as pd
>>> huge_ds = pd.read_csv('undp_jobs.csv')
>>> len(huge_ds.columns)
103630
>>> huge_ds.head(1)
                                                   0                                                  1  ...           104998                                             104999
0  {'content': ['hiv and sti clinical consultant ...  {'content': ['internship- pacific digital econ...  ...  {'content': []}  {'content': ['deputy head of resident coordina...

[1 rows x 103630 columns]
>>> 

Need to find a way to keep the columns and store them in the cache without issues.

@AndreaFrancis AndreaFrancis added the bug Something isn't working label Jan 30, 2023
@AndreaFrancis
Copy link
Contributor Author

#self-assign

@AndreaFrancis
Copy link
Contributor Author

Based on a couple of suggestions from @severo, I thought this could work:

  1. Get the response without rows {dataset, config, split, features, rows=[]}
  2. If response size is > : Take only N features (features_max_number)
  3. While response size is still > rows_max_bytes: Reduce number of features progressively backwards until response size is < rows_max_bytes
  4. If new response size < rows_max_bytes: Get first M rows (rows_min_number)
  5. If response + rows size > rows_max_bytes: Truncate rows else: add the remaining rows until the end, or until the bytes threshold
  6. If response + truncated rows size > rows_max_bytes: Reduce number of rows progressively backwards until response size is < rows_max_bytes
    I know this sounds a kind of complicated but I think it is a balance for the goal of showing the dataset preview (features + rows)

I will prepare a PR proposal for this approach but please feel free to provide feedback

@severo
Copy link
Collaborator

severo commented Jan 31, 2023

Nice.


Before giving my feedback, I just want to give context and my feelings about the "truncation".

These cases where we cannot even display the first 100 rows, because there are too many columns or because the cells are too heavy, are annoying.

We are truncating the response in a non-standard ad-hoc way, just to make it manageable for the dataset viewer on the Hub. And even, it means that the dataset viewer has to do specific checks and manage these cases. Currently:

  • it displays a message when the number of rows is lower than 100
  • it displays a TRUNCATED... suffix in the cell

We now have to manage the case where the number of columns is too big. I'm afraid that the code, both in /first-rows, as in the clients, will become even more complex. Because we will have to tell how many columns have been truncated (maybe with their types and names, ie: will we also truncate the "features" part of the response?), and manage this in the viewer.


That's why it made me think a bit, maybe we have to take a step backward and rethink this endpoint. It's really aimed at the dataset viewer, and only while the parquet files are not available. Later, we will be able to query the contents we need and show any content in the viewer (see #13).

Maybe we could postpone this issue until we manage the new version of the viewer that is able to manage random access to the rows. And meanwhile, just:

  • return 0 rows,
  • or simpler: return an explicit error like: too many columns. The maximum supported number of columns is XXX

@AndreaFrancis
Copy link
Contributor Author

Fix was applied on current version of datasets-server, it worked for the following datasets:

  • jonas/undp_jobs_raw train split
  • DavidVivancos/MindBigData2022_Imagenet_IN test split
  • DavidVivancos/MindBigData_Imagenet_IN train split
  • DavidVivancos/MindBigData_Imagenet_IN train and test split
  • DavidVivancos/MindBigData2022_Imagenet_IN_Spct test split
    Screenshot from 2023-02-08 14-54-10

For the following ones, the process got stuck as "started" status but that is because of job zombies issue:

  • SamAct/medium_cleaned
  • grasshoff/lhc_sents
  • Sangmun/wiki_doc_preprocessed_withmaxlength
  • DavidVivancos/MindBigData2022_Imagenet_IN - train

Those should be retried ones #741 is fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants