Update datasets dependency to 2.13.0 version #1372

albertvillanova · 2023-06-15T09:48:39Z

After 2.13.0 datasets release, update dependencies on it.

Note that I have also removed the explicit dependency on datasets from services/api,

see commit: a2c0cd9

This is analogous to what was previously done on services/worker.

See discussion: Update datasets dependency to 2.12.0 version #1147 (comment)
See commit: 4163a18

Fix #1370.

codecov-commenter · 2023-06-15T09:53:35Z

Codecov Report

Patch coverage has no change and project coverage change: -0.72 ⚠️

Comparison is base (3dad303) 90.02% compared to head (da071e3) 89.30%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1372      +/-   ##
==========================================
- Coverage   90.02%   89.30%   -0.72%     
==========================================
  Files         184      114      -70     
  Lines       10812     6491    -4321     
==========================================
- Hits         9733     5797    -3936     
+ Misses       1079      694     -385

Flag	Coverage Δ
jobs_cache_maintenance	`99.08% <ø> (ø)`
jobs_mongodb_migration	`83.49% <ø> (ø)`
libs_libcommon	`92.01% <ø> (+0.66%)`	⬆️
services_admin	`86.05% <ø> (ø)`
services_api	`87.73% <ø> (ø)`
services_worker	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 72 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

severo · 2023-06-15T13:45:34Z

Once deployed, I'll query the database to get the list of datasets that should be refreshed (affected by one of the two bugs fixed by 2.13.0) and launch a refresh for them.

severo · 2023-06-15T14:03:14Z

For huggingface/datasets#5938:

> db.cachedResponsesBlue.aggregate(
    [
        {$match: {kind: "dataset-config-names", error_code: "ConfigNamesError", "details.cause_exception": "OSError"}},
        {$group: {_id: null, dataset: {$addToSet: "$dataset"}}}
    ]
);
{ _id: null,
  datasets: 
   [ 'LambdaTests/VQAv2Validation_ViT_H_14_A_T_C_Q_benchmarks_partition_global_2_10000000',
     'mteb/amazon_reviews_multi',
     'davanstrien/ia-loaded',
     'mask-distilled-one-sec-cv12/chunk_114',
     'results-sd-v1-5-sd-v2-1-if-v1-0-karlo/7b29f2d3',
     'Brizape/Variome_split_0404',
     'james-burton/imdb_gross',
     'results-sd-v1-5-sd-v2-1-if-v1-0-karlo/aaa3977f',
     'Brizape/Variome_ibo',
     'LambdaTests/VQAv2_sample_validation_benchmarks_partition_global_10_loca_2',
     'trojblue/bad_ai',
     'autoevaluate/autoeval-eval-billsum-default-37bdaa-1564755702',
     'CVasNLPExperiments/VQAv2_sample_validation_google_flan_t5_xl_mode_Q_rices_ns_200',
     'omarelsayeed/dddddd',
     'christinacdl/OFF_HATE_TOXIC_ENGLISH',
     'mask-distilled-one-sec-cv12/chunk_137',
     'drumwell/skatedecks',
     'strombergnlp/x-stance',
     'joey234/mmlu-logical_fallacies-rule-neg',
     'GazeLocation/stimuli_data_arrays',
     'tgsc/embedding-data-NQ-train_pairs-pt',
     'LambdaTests/VQAv2Validation_ViT_H_14_A_T_C_Q_benchmarks_partition_global_26_10000000',
     'tomekkorbak/pile-curse-chunk-27',
     'ydmeira/segment-pokemon',
     'foldl/99problems',
     'alzoubi36/privaseer',
     'mHossain/final_train_v2_420000',
     'arbml/sudanese_dialect_speech',
     'autoevaluate/autoeval-staging-eval-project-be45ecbd-7284772',
     'Cainiao-AI/LaDe-D',
     'offchan/fill50k',
     'emilylearning/cond_ft_subreddit_on_reddit__prcnt_100__test_run_False__bert-base-uncased',
     'scholarly360/indian_ipo_prospectus_data',
     'OpenCovenant/fjalet-Albanian-vocabulary',
     'mask-distilled-one-sec-cv12/chunk_58',
     'doushabao4766/wnut_17_ner_k_V3',
     'MicPie/adaptable_cluster27',
     'coeuslearning/product_ads',
     'sajjadrauf/VQA',
     'LangChainDatasets/sql-qa-chinook',
     'LambdaTests/VQAv2_sample_validation_benchmarks_partition_global_12_loca_4',
     'results-sd-v1-5-sd-v2-1-if-v1-0-karlo/0f1659c6',
     'EJaalborg2022/beer_reviews_label_drift_neg',
     'cakiki/dockerfile_paths',
     'Sree1994/babylm_childstories',
     'oyxy2019/BertTokenizer_THUCNews_10000_to_lm_datasets',
     'tsf-tsf/qa',
     'joey234/mmlu-electrical_engineering-neg',
     'autoevaluate/autoeval-eval-lener_br-lener_br-280a5d-1776961679',
     'jlbaker361/avatar-lite_captioned-augmented',
     'roa7n/patched_test_p_40_f_UCH_m1_predictions',
     'keremberke/construction-safety-object-detection',
     'Sampson2022/demo3',
     'pratultandon/tokenized-recipe-nlg-gpt2-ingredients-to-recipe-end',
     'rjac/DepressionDetection',
     'LambdaTests/VQAv2Validation_ViT_H_14_A_T_C_Q_benchmarks_partition_global_4_500',
     'joey234/mmlu-miscellaneous-rule-neg',
     'roa7n/patched_test_p_150_f_UCH_m1_predictions',
     'jet-universe/top_landscape',
     'mask-distilled-one-sec-cv12/chunk_17',
     'IdoAi/FypDatasetWithSplitsRgb',
     'xOrfe/Match3',
     'GV05/dreambooth-hackathon-images',
     'results-sd-v1-5-sd-v2-1-if-v1-0-karlo/2c4acff4',
     'TrainThenObtain-ai/Utra-mini-GPT-4',
     'wurongbo/wurongbo',
     'myradeng/diffusion_db_5k_val_v3',
     'Kartik14Singh/ifd' 
] }

severo · 2023-06-15T14:04:45Z

For

it should also fix issues where images were shown as {"bytes": ...} thanks to huggingface/datasets#5921

I'm not sure how we could query them? @lhoestq ?

lhoestq · 2023-06-15T14:59:38Z

This query seems to include many datasets made of parquet files but without dataset_info:

{"kind": "config-info", "content.error": "Dict key must be str"}

This should include the parquet image datasets like https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema_12_06

lhoestq · 2023-06-15T15:05:40Z

^ we might need to fix the related bug though

Traceback (most recent call last):
  File "/src/services/worker/src/worker/job_manager.py", line 167, in process
    if len(orjson_dumps(content)) > self.worker_config.content_max_bytes:
  File "/src/libs/libcommon/src/libcommon/utils.py", line 79, in orjson_dumps
    return orjson.dumps(content, option=orjson.OPT_UTC_Z, default=orjson_default)
TypeError: Dict key must be str

severo · 2023-06-15T15:06:04Z

Hmmm, indeed, this dataset shows {"bytes":"/9j/4SMNRXhpZgAATU0AKgAAAAgABwESAAMAAAABAAEAAAEaAAUAAAABAAAAYgEbAAUAAAABAAAAagEoAAMAAAABAA(...TRUNCATED) in the image column, and it has UnexpectedError:

 {
  "error": "Dict key must be str",
  "cause_exception": "TypeError",
  "cause_message": "Dict key must be str",
  "cause_traceback": [
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/src/worker/job_manager.py\", line 167, in process\n if len(orjson_dumps(content)) > self.worker_config.content_max_bytes:\n",
    " File \"/src/libs/libcommon/src/libcommon/utils.py\", line 79, in orjson_dumps\n return orjson.dumps(content, option=orjson.OPT_UTC_Z, default=orjson_default)\n",
    "TypeError: Dict key must be str\n"
  ]
}

I will use this to get a list of datasets. But it looks like an additional bug we should take care of! Opening an issue

severo · 2023-06-15T15:09:56Z

> db.cachedResponsesBlue.aggregate(
    [
        { $match: { kind: "config-parquet-and-info", "details.error": "Dict key must be str", "details.cause_exception": "TypeError" } },
        { $group: { _id: null, datasets: { $addToSet: "$dataset" } } },
    ]
);
< { _id: null,
  datasets: 
   [ 'HuggingFaceH4/stack-exchange-preferences',
     'fahamu/ioi',
     'lishuyang/recipepairs',
     'Rapidinnovation/instrcuction-dataset',
     'SotiriosKastanas/chinese_try',
     'autoevaluate/autoeval-eval-futin__feed-top_en-c0540d-2175569974',
     'ChristophSchuhmann/books',
     'hlky/aitemplate',
     'mcemilg/laion2B-multi-turkish-subset',
     'crowdsource/nowcasting-test',
     'marianna13/zlib',
     'JetQin/seven-wonders',
     'autoevaluate/autoeval-staging-eval-project-glue-4805e982-13995915',
     'autoevaluate/autoeval-staging-eval-project-xsum-8dc1621c-12925734',
     'marianna13/laion1B-nolang-joined-translated-to-en-hr',
     'autoevaluate/autoeval-eval-lener_br-lener_br-b36dee-1776161641',
     'autoevaluate/autoeval-staging-eval-squad_v2-squad_v2-00af64-15586150',
     'autoevaluate/autoeval-staging-eval-project-xsum-69daf1dd-12935743',
     'ArmelR/stack-exchange-instruction',
     'autoevaluate/autoeval-eval-futin__guess-vi-4200fb-2012366604',
     'autoevaluate/autoeval-staging-eval-project-xsum-f0ba0c18-12915726',
     'hlky/lexica-aperture-v3',
     'autoevaluate/autoeval-staging-eval-project-f0d30a26-9815308',
     'mammut/mammut-corpus-venezuela',
     'WasuratS/ECMWF_Thailand_Land_Air_Temperatures',
     'autoevaluate/autoeval-eval-project-jnlpba-37dc127e-1276948841',
     'mikex86/stackoverflow-posts',
     'autoevaluate/autoeval-staging-eval-project-c76b0e96-8395129',
     'marianna13/improved_aesthetics_4.5plus-ultra-hr',
     'autoevaluate/autoeval-eval-indonli-indonli-717ea6-1995866375',
     'ChristophSchuhmann/Chess-Selfplay2',
     'debatelab/parquet-stream',
     'lvwerra/stack-exchange-paired',
     'autoevaluate/autoeval-eval-project-squad_v2-1e2c143e-1305549899',
     'LysandreJik/test-16340052901609',
     'autoevaluate/autoeval-staging-eval-project-d60b4e7e-7574887',
     'autoevaluate/autoeval-staging-eval-project-emotion-af6a16fe-14025918',
     'xiyuez/im-feeling-curious',
     'autoevaluate/autoeval-eval-futin__feed-top_en-c0540d-2175569969',
     'Wauplin/user-preferences-from-space',
     'autoevaluate/autoeval-staging-eval-cnn_dailymail-3.0.0-0b05dc-15886185',
     'Haidra-Org/AI-Horde-Ratings',
     'jlohding/sp500-edgar-10k',
     'alxfgh/fsmol',
     'autoevaluate/autoeval-eval-futin__feed-top_vi-b5257d-2174969944',
     'autoevaluate/autoeval-staging-eval-squad_v2-squad_v2-76c05b-14906065',
     'musiki/dwset',
     'philippemo/dummy_dataset_without_schema_12_06',
     'autoevaluate/autoeval-eval-futin__feed-top_en_-e32ef4-2240271545',
     'autoevaluate/autoeval-staging-eval-multi_nli-default-4a02ee-14425976',
     'lhoestq/multi-configs',
     'autoevaluate/autoeval-staging-eval-billsum-default-3fec5f-14625986',
     'pacovaldez/stackoverflow-questions',
     'xiyuez/instruction-narrative-poems-on-frederick-douglass',
     'autoevaluate/autoeval-staging-eval-project-squad_v2-e06b4410-11855584',
     'Hypoxiic/wikipedia-summary-subset1k',
     'autoevaluate/autoeval-eval-project-banking77-77f5d7e6-1267748583',
     '7eu7d7/HCP-Diffusion-datas',
     'laion/laion2B-en-safety' ] }

albertvillanova · 2023-06-15T15:57:57Z

Let's cross fingers and hope that huggingface/datasets#5938 really fixes the "Stale file handle" error... 😅

severo · 2023-06-15T19:07:51Z

Yes @albertvillanova, they have all been fixed! 👏

eg https://huggingface.co/datasets/rjac/DepressionDetection

severo · 2023-06-15T19:13:17Z

For the other bug... hmmm, not sure, we still have the error in all the datasets of the list.

eg.
https://datasets-server.huggingface.co/parquet?dataset=philippemo/dummy_dataset_without_schema_12_06&config=philippemo--dummy_dataset_without_schema_12_06

severo · 2023-06-15T19:13:26Z

cc @lhoestq

severo · 2023-06-15T20:49:22Z

I just merged and deployed #1375, which fixes #1374.

After launching the refresh for the affected datasets, all have been fixed! well done @lhoestq 👏

albertvillanova added 5 commits June 15, 2023 11:05

Update datasets dep version to 2.13.0 in libcommon

252117c

Update poetry lock file

865b407

Remove explicit dependency on datasets from api

a2c0cd9

Update poetry lock file

7da3195

Update libcommon in poetry lock files

da071e3

severo approved these changes Jun 15, 2023

View reviewed changes

severo mentioned this pull request Jun 15, 2023

Truncated cells seem to prevent conversion to parquet #1374

Closed

albertvillanova merged commit 6865d8f into main Jun 15, 2023

albertvillanova deleted the fix-1370 branch June 15, 2023 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update datasets dependency to 2.13.0 version #1372

Update datasets dependency to 2.13.0 version #1372

albertvillanova commented Jun 15, 2023

codecov-commenter commented Jun 15, 2023 •

edited

Loading

severo commented Jun 15, 2023

severo commented Jun 15, 2023

severo commented Jun 15, 2023

lhoestq commented Jun 15, 2023 •

edited

Loading

lhoestq commented Jun 15, 2023

severo commented Jun 15, 2023

severo commented Jun 15, 2023

albertvillanova commented Jun 15, 2023 •

edited

Loading

severo commented Jun 15, 2023

severo commented Jun 15, 2023 •

edited

Loading

severo commented Jun 15, 2023

severo commented Jun 15, 2023

Update datasets dependency to 2.13.0 version #1372

Update datasets dependency to 2.13.0 version #1372

Conversation

albertvillanova commented Jun 15, 2023

codecov-commenter commented Jun 15, 2023 • edited Loading

Codecov Report

severo commented Jun 15, 2023

severo commented Jun 15, 2023

severo commented Jun 15, 2023

lhoestq commented Jun 15, 2023 • edited Loading

lhoestq commented Jun 15, 2023

severo commented Jun 15, 2023

severo commented Jun 15, 2023

albertvillanova commented Jun 15, 2023 • edited Loading

severo commented Jun 15, 2023

severo commented Jun 15, 2023 • edited Loading

severo commented Jun 15, 2023

severo commented Jun 15, 2023

codecov-commenter commented Jun 15, 2023 •

edited

Loading

lhoestq commented Jun 15, 2023 •

edited

Loading

albertvillanova commented Jun 15, 2023 •

edited

Loading

severo commented Jun 15, 2023 •

edited

Loading