-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset Viewer issue for jonas/undp_jobs_raw #731
Comments
#self-assign |
Based on a couple of suggestions from @severo, I thought this could work:
I will prepare a PR proposal for this approach but please feel free to provide feedback |
Nice. Before giving my feedback, I just want to give context and my feelings about the "truncation". These cases where we cannot even display the first 100 rows, because there are too many columns or because the cells are too heavy, are annoying. We are truncating the response in a non-standard ad-hoc way, just to make it manageable for the dataset viewer on the Hub. And even, it means that the dataset viewer has to do specific checks and manage these cases. Currently:
We now have to manage the case where the number of columns is too big. I'm afraid that the code, both in /first-rows, as in the clients, will become even more complex. Because we will have to tell how many columns have been truncated (maybe with their types and names, ie: will we also truncate the "features" part of the response?), and manage this in the viewer. That's why it made me think a bit, maybe we have to take a step backward and rethink this endpoint. It's really aimed at the dataset viewer, and only while the parquet files are not available. Later, we will be able to query the contents we need and show any content in the viewer (see #13). Maybe we could postpone this issue until we manage the new version of the viewer that is able to manage random access to the rows. And meanwhile, just:
|
Fix was applied on current version of datasets-server, it worked for the following datasets:
For the following ones, the process got stuck as "started" status but that is because of job zombies issue:
Those should be retried ones #741 is fixed |
Link
https://huggingface.co/datasets/jonas/undp_jobs_raw
Description
When going to the preview panel, this message is shown:
Other datasets with same issue:
When trying to investigate the issue I saw the following logs:
DEBUG: 2023-01-30 13:16:36,215 - root - the size of the first 10 rows (1087032102) is above the max number of bytes (-7173342), they will be truncated
DEBUG: 2023-01-30 13:16:40,076 - root - the size of the rows is now (11944186) after truncating row idx=0
It shows that the remaining space for the document without considering the rows is negative (-7173342) which means only the total space for columns is more that what is accepted.
When looking at the csv file for jonas/undp_jobs_raw dataset it looks like the datase has 103630 columns.
Need to find a way to keep the columns and store them in the cache without issues.
The text was updated successfully, but these errors were encountered: