Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update datasets dependency to 2.13.0 version #1372

Merged
merged 5 commits into from
Jun 15, 2023
Merged

Update datasets dependency to 2.13.0 version #1372

merged 5 commits into from
Jun 15, 2023

Conversation

albertvillanova
Copy link
Member

After 2.13.0 datasets release, update dependencies on it.

Note that I have also removed the explicit dependency on datasets from services/api,

This is analogous to what was previously done on services/worker.

Fix #1370.

@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.72 ⚠️

Comparison is base (3dad303) 90.02% compared to head (da071e3) 89.30%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1372      +/-   ##
==========================================
- Coverage   90.02%   89.30%   -0.72%     
==========================================
  Files         184      114      -70     
  Lines       10812     6491    -4321     
==========================================
- Hits         9733     5797    -3936     
+ Misses       1079      694     -385     
Flag Coverage Δ
jobs_cache_maintenance 99.08% <ø> (ø)
jobs_mongodb_migration 83.49% <ø> (ø)
libs_libcommon 92.01% <ø> (+0.66%) ⬆️
services_admin 86.05% <ø> (ø)
services_api 87.73% <ø> (ø)
services_worker ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 72 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@severo
Copy link
Collaborator

severo commented Jun 15, 2023

Once deployed, I'll query the database to get the list of datasets that should be refreshed (affected by one of the two bugs fixed by 2.13.0) and launch a refresh for them.

@severo
Copy link
Collaborator

severo commented Jun 15, 2023

For huggingface/datasets#5938:

> db.cachedResponsesBlue.aggregate(
    [
        {$match: {kind: "dataset-config-names", error_code: "ConfigNamesError", "details.cause_exception": "OSError"}},
        {$group: {_id: null, dataset: {$addToSet: "$dataset"}}}
    ]
);
{ _id: null,
  datasets: 
   [ 'LambdaTests/VQAv2Validation_ViT_H_14_A_T_C_Q_benchmarks_partition_global_2_10000000',
     'mteb/amazon_reviews_multi',
     'davanstrien/ia-loaded',
     'mask-distilled-one-sec-cv12/chunk_114',
     'results-sd-v1-5-sd-v2-1-if-v1-0-karlo/7b29f2d3',
     'Brizape/Variome_split_0404',
     'james-burton/imdb_gross',
     'results-sd-v1-5-sd-v2-1-if-v1-0-karlo/aaa3977f',
     'Brizape/Variome_ibo',
     'LambdaTests/VQAv2_sample_validation_benchmarks_partition_global_10_loca_2',
     'trojblue/bad_ai',
     'autoevaluate/autoeval-eval-billsum-default-37bdaa-1564755702',
     'CVasNLPExperiments/VQAv2_sample_validation_google_flan_t5_xl_mode_Q_rices_ns_200',
     'omarelsayeed/dddddd',
     'christinacdl/OFF_HATE_TOXIC_ENGLISH',
     'mask-distilled-one-sec-cv12/chunk_137',
     'drumwell/skatedecks',
     'strombergnlp/x-stance',
     'joey234/mmlu-logical_fallacies-rule-neg',
     'GazeLocation/stimuli_data_arrays',
     'tgsc/embedding-data-NQ-train_pairs-pt',
     'LambdaTests/VQAv2Validation_ViT_H_14_A_T_C_Q_benchmarks_partition_global_26_10000000',
     'tomekkorbak/pile-curse-chunk-27',
     'ydmeira/segment-pokemon',
     'foldl/99problems',
     'alzoubi36/privaseer',
     'mHossain/final_train_v2_420000',
     'arbml/sudanese_dialect_speech',
     'autoevaluate/autoeval-staging-eval-project-be45ecbd-7284772',
     'Cainiao-AI/LaDe-D',
     'offchan/fill50k',
     'emilylearning/cond_ft_subreddit_on_reddit__prcnt_100__test_run_False__bert-base-uncased',
     'scholarly360/indian_ipo_prospectus_data',
     'OpenCovenant/fjalet-Albanian-vocabulary',
     'mask-distilled-one-sec-cv12/chunk_58',
     'doushabao4766/wnut_17_ner_k_V3',
     'MicPie/adaptable_cluster27',
     'coeuslearning/product_ads',
     'sajjadrauf/VQA',
     'LangChainDatasets/sql-qa-chinook',
     'LambdaTests/VQAv2_sample_validation_benchmarks_partition_global_12_loca_4',
     'results-sd-v1-5-sd-v2-1-if-v1-0-karlo/0f1659c6',
     'EJaalborg2022/beer_reviews_label_drift_neg',
     'cakiki/dockerfile_paths',
     'Sree1994/babylm_childstories',
     'oyxy2019/BertTokenizer_THUCNews_10000_to_lm_datasets',
     'tsf-tsf/qa',
     'joey234/mmlu-electrical_engineering-neg',
     'autoevaluate/autoeval-eval-lener_br-lener_br-280a5d-1776961679',
     'jlbaker361/avatar-lite_captioned-augmented',
     'roa7n/patched_test_p_40_f_UCH_m1_predictions',
     'keremberke/construction-safety-object-detection',
     'Sampson2022/demo3',
     'pratultandon/tokenized-recipe-nlg-gpt2-ingredients-to-recipe-end',
     'rjac/DepressionDetection',
     'LambdaTests/VQAv2Validation_ViT_H_14_A_T_C_Q_benchmarks_partition_global_4_500',
     'joey234/mmlu-miscellaneous-rule-neg',
     'roa7n/patched_test_p_150_f_UCH_m1_predictions',
     'jet-universe/top_landscape',
     'mask-distilled-one-sec-cv12/chunk_17',
     'IdoAi/FypDatasetWithSplitsRgb',
     'xOrfe/Match3',
     'GV05/dreambooth-hackathon-images',
     'results-sd-v1-5-sd-v2-1-if-v1-0-karlo/2c4acff4',
     'TrainThenObtain-ai/Utra-mini-GPT-4',
     'wurongbo/wurongbo',
     'myradeng/diffusion_db_5k_val_v3',
     'Kartik14Singh/ifd' 
] }

@severo
Copy link
Collaborator

severo commented Jun 15, 2023

For

it should also fix issues where images were shown as {"bytes": ...} thanks to huggingface/datasets#5921

I'm not sure how we could query them? @lhoestq ?

@lhoestq
Copy link
Member

lhoestq commented Jun 15, 2023

This query seems to include many datasets made of parquet files but without dataset_info:

{"kind": "config-info", "content.error": "Dict key must be str"}

This should include the parquet image datasets like https://huggingface.co/datasets/philippemo/dummy_dataset_without_schema_12_06

@lhoestq
Copy link
Member

lhoestq commented Jun 15, 2023

^ we might need to fix the related bug though

Traceback (most recent call last):
  File "/src/services/worker/src/worker/job_manager.py", line 167, in process
    if len(orjson_dumps(content)) > self.worker_config.content_max_bytes:
  File "/src/libs/libcommon/src/libcommon/utils.py", line 79, in orjson_dumps
    return orjson.dumps(content, option=orjson.OPT_UTC_Z, default=orjson_default)
TypeError: Dict key must be str

@severo
Copy link
Collaborator

severo commented Jun 15, 2023

Hmmm, indeed, this dataset shows {"bytes":"/9j/4SMNRXhpZgAATU0AKgAAAAgABwESAAMAAAABAAEAAAEaAAUAAAABAAAAYgEbAAUAAAABAAAAagEoAAMAAAABAA(...TRUNCATED) in the image column, and it has UnexpectedError:

 {
  "error": "Dict key must be str",
  "cause_exception": "TypeError",
  "cause_message": "Dict key must be str",
  "cause_traceback": [
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/src/worker/job_manager.py\", line 167, in process\n if len(orjson_dumps(content)) > self.worker_config.content_max_bytes:\n",
    " File \"/src/libs/libcommon/src/libcommon/utils.py\", line 79, in orjson_dumps\n return orjson.dumps(content, option=orjson.OPT_UTC_Z, default=orjson_default)\n",
    "TypeError: Dict key must be str\n"
  ]
}

I will use this to get a list of datasets. But it looks like an additional bug we should take care of! Opening an issue

@severo
Copy link
Collaborator

severo commented Jun 15, 2023

> db.cachedResponsesBlue.aggregate(
    [
        { $match: { kind: "config-parquet-and-info", "details.error": "Dict key must be str", "details.cause_exception": "TypeError" } },
        { $group: { _id: null, datasets: { $addToSet: "$dataset" } } },
    ]
);
< { _id: null,
  datasets: 
   [ 'HuggingFaceH4/stack-exchange-preferences',
     'fahamu/ioi',
     'lishuyang/recipepairs',
     'Rapidinnovation/instrcuction-dataset',
     'SotiriosKastanas/chinese_try',
     'autoevaluate/autoeval-eval-futin__feed-top_en-c0540d-2175569974',
     'ChristophSchuhmann/books',
     'hlky/aitemplate',
     'mcemilg/laion2B-multi-turkish-subset',
     'crowdsource/nowcasting-test',
     'marianna13/zlib',
     'JetQin/seven-wonders',
     'autoevaluate/autoeval-staging-eval-project-glue-4805e982-13995915',
     'autoevaluate/autoeval-staging-eval-project-xsum-8dc1621c-12925734',
     'marianna13/laion1B-nolang-joined-translated-to-en-hr',
     'autoevaluate/autoeval-eval-lener_br-lener_br-b36dee-1776161641',
     'autoevaluate/autoeval-staging-eval-squad_v2-squad_v2-00af64-15586150',
     'autoevaluate/autoeval-staging-eval-project-xsum-69daf1dd-12935743',
     'ArmelR/stack-exchange-instruction',
     'autoevaluate/autoeval-eval-futin__guess-vi-4200fb-2012366604',
     'autoevaluate/autoeval-staging-eval-project-xsum-f0ba0c18-12915726',
     'hlky/lexica-aperture-v3',
     'autoevaluate/autoeval-staging-eval-project-f0d30a26-9815308',
     'mammut/mammut-corpus-venezuela',
     'WasuratS/ECMWF_Thailand_Land_Air_Temperatures',
     'autoevaluate/autoeval-eval-project-jnlpba-37dc127e-1276948841',
     'mikex86/stackoverflow-posts',
     'autoevaluate/autoeval-staging-eval-project-c76b0e96-8395129',
     'marianna13/improved_aesthetics_4.5plus-ultra-hr',
     'autoevaluate/autoeval-eval-indonli-indonli-717ea6-1995866375',
     'ChristophSchuhmann/Chess-Selfplay2',
     'debatelab/parquet-stream',
     'lvwerra/stack-exchange-paired',
     'autoevaluate/autoeval-eval-project-squad_v2-1e2c143e-1305549899',
     'LysandreJik/test-16340052901609',
     'autoevaluate/autoeval-staging-eval-project-d60b4e7e-7574887',
     'autoevaluate/autoeval-staging-eval-project-emotion-af6a16fe-14025918',
     'xiyuez/im-feeling-curious',
     'autoevaluate/autoeval-eval-futin__feed-top_en-c0540d-2175569969',
     'Wauplin/user-preferences-from-space',
     'autoevaluate/autoeval-staging-eval-cnn_dailymail-3.0.0-0b05dc-15886185',
     'Haidra-Org/AI-Horde-Ratings',
     'jlohding/sp500-edgar-10k',
     'alxfgh/fsmol',
     'autoevaluate/autoeval-eval-futin__feed-top_vi-b5257d-2174969944',
     'autoevaluate/autoeval-staging-eval-squad_v2-squad_v2-76c05b-14906065',
     'musiki/dwset',
     'philippemo/dummy_dataset_without_schema_12_06',
     'autoevaluate/autoeval-eval-futin__feed-top_en_-e32ef4-2240271545',
     'autoevaluate/autoeval-staging-eval-multi_nli-default-4a02ee-14425976',
     'lhoestq/multi-configs',
     'autoevaluate/autoeval-staging-eval-billsum-default-3fec5f-14625986',
     'pacovaldez/stackoverflow-questions',
     'xiyuez/instruction-narrative-poems-on-frederick-douglass',
     'autoevaluate/autoeval-staging-eval-project-squad_v2-e06b4410-11855584',
     'Hypoxiic/wikipedia-summary-subset1k',
     'autoevaluate/autoeval-eval-project-banking77-77f5d7e6-1267748583',
     '7eu7d7/HCP-Diffusion-datas',
     'laion/laion2B-en-safety' ] }

@albertvillanova
Copy link
Member Author

albertvillanova commented Jun 15, 2023

Let's cross fingers and hope that huggingface/datasets#5938 really fixes the "Stale file handle" error... 😅

@albertvillanova albertvillanova merged commit 6865d8f into main Jun 15, 2023
@albertvillanova albertvillanova deleted the fix-1370 branch June 15, 2023 15:59
@severo
Copy link
Collaborator

severo commented Jun 15, 2023

Yes @albertvillanova, they have all been fixed! 👏

eg https://huggingface.co/datasets/rjac/DepressionDetection

Capture d’écran 2023-06-15 à 21 06 37

@severo
Copy link
Collaborator

severo commented Jun 15, 2023

For the other bug... hmmm, not sure, we still have the error in all the datasets of the list.

eg.
https://datasets-server.huggingface.co/parquet?dataset=philippemo/dummy_dataset_without_schema_12_06&config=philippemo--dummy_dataset_without_schema_12_06

@severo
Copy link
Collaborator

severo commented Jun 15, 2023

cc @lhoestq

@severo
Copy link
Collaborator

severo commented Jun 15, 2023

I just merged and deployed #1375, which fixes #1374.

After launching the refresh for the affected datasets, all have been fixed! well done @lhoestq 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update datasets to 2.13.0
4 participants