Skip to content

Commit

Permalink
Config-level parquet-and-dataset-info (#985)
Browse files Browse the repository at this point in the history
* make parquet-and-dataset-info a config-level job, rename

* fix processing grap config and tests

* refactor rename something that is too hard to recall at friday evening

* update api config

* update tests for config-parquet-and-info, fix refactored names

* add custom error for missing parameter in request (dataset/config/split) and raise it in all job runners consistently

* fix outdated error classes names

* fir config-parquet: pass config parameter to it

* fix config-parquet test: pass config parameter to it

* test endpoints when config is provided (but not required) too

* fix names of error classes raised by workers in docstrings

* change step version from 2 to 1

* refactor (rename) ParquetAndInfoConfig params

* get back /parquet-and-dataset-info

* get back tests for old step /parquet-and-dataset-info

* rename env vars: change PARQUET_AND_DATASET_INFO_ prefix to PARQUET_AND_INFO

* get back env template and fix blocked datasets for worker tests

* fix error classes names in docstrings

* Update services/worker/src/worker/job_runners/config/parquet_and_info.py

Co-authored-by: Sylvain Lesage <sylvain.lesage@huggingface.co>

* remove files of legacy configs if any

* get configs only with datasets library, not from cache

* add test for removing files for configs that do not exist anymore

* fix processing graph test

* update version of processing steps that are dependent on config-parquet-and-info

* remove unused error code from split-first-rows-from-streaming

* fix outdated params in test for config-parquet-and-info

* rename in /chart: parquetAndDatasetInfo -> parquetAndInfo

* take config-names from cache if possible, make all changes in a single commit

* get back previous versions of dependent processing steps

* get config names from /config-names cache, don't use datasets lib as a fallback

+ update test

* fix test for config-parquet-and-info: add upserting response for previous step /config-names

* clone from initial commit instead of main, delete all the files except for:

current config files created by this step, other configs files and .gitattributes

* fix processing graph test:

get back parquet-and-dataset-info to first step since it's not deleted yet

* update test that checks correctness of files pushed to the repo

* delete previous files for current config while pushing new ones

* add config-parquet-and-info to processing graph test

---------

Co-authored-by: Sylvain Lesage <sylvain.lesage@huggingface.co>
  • Loading branch information
polinaeterna and severo authored Apr 11, 2023
1 parent 401cb3f commit af1a46c
Show file tree
Hide file tree
Showing 42 changed files with 2,089 additions and 397 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/_e2e_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
# ^ hard coded, see e2e/tests/fixtures/hub.py
LOG_LEVEL: "DEBUG"
FIRST_ROWS_MAX_NUMBER: "4"
PARQUET_AND_DATASET_INFO_COMMITTER_HF_TOKEN: "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"
PARQUET_AND_INFO_COMMITTER_HF_TOKEN: "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"
PORT_REVERSE_PROXY: "8000"
PROMETHEUS_MULTIPROC_DIR: "/tmp"
WORKER_SLEEP_SECONDS: "1"
Expand Down Expand Up @@ -77,7 +77,7 @@ jobs:
# ^ hard coded, see e2e/tests/fixtures/hub.py
LOG_LEVEL: "DEBUG"
FIRST_ROWS_MAX_NUMBER: "4"
PARQUET_AND_DATASET_INFO_COMMITTER_HF_TOKEN: "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"
PARQUET_AND_INFO_COMMITTER_HF_TOKEN: "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"
PORT_REVERSE_PROXY: "8000"
PROMETHEUS_MULTIPROC_DIR: "/tmp"
WORKER_SLEEP_SECONDS : "1"
Expand Down
2 changes: 1 addition & 1 deletion chart/env/dev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ log:
# Log level
level: "DEBUG"

parquetAndDatasetInfo:
parquetAndInfo:
maxDatasetSize: "500_000_000"

# --- jobs (pre-install/upgrade hooks) ---
Expand Down
2 changes: 1 addition & 1 deletion chart/env/prod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ log:
firstRows:
maxBytes: "200_000"

parquetAndDatasetInfo:
parquetAndInfo:
maxDatasetSize: "5_000_000_000"
blockedDatasets: "matallanas/linustechtips-transcript-audio-wav,KnutJaegersberg/Interpretable_word_embeddings_large_cskg,ashraf-ali/quran-data,cjvt/cc_gigafida,cmudrc/porous-microstructure-strain-fields,dlwh/MultiLegalPile_Wikipedia_Shuffled,izumaru/os2-datasets,joelito/MultiLegalPile_Wikipedia_Filtered,leviethoang/VBVLSP,nyanko7/yandere-images,severo/wit,texturedesign/td01_natural-ground-textures,Tristan/olm-october-2022-tokenized-1024-exact-dedup-only,Whispering-GPT/linustechtips-transcript-audio,beyond/chinese_clean_passages_80m,bigscience/xP3,dalle-mini/YFCC100M_OpenAI_subset,galman33/gal_yair_166000_256x256_fixed,matallanas/linustechtips-transcript-audio-mp3,mwitiderrick/arXiv,sjpmpzx/qm_ly_gy_soundn,tilos/ASR-CCANTCSC,matallanas/linustechtips-transcript-audio-ogg,VIMA/VIMA-Data,severo/wit,wmt/europarl,chrisjay/mnist-adversarial-dataset,mwitiderrick/arXiv,HuggingFaceM4/TextCaps,CristianaLazar/librispeech5k_train,texturedesign/td01_natural-ground-textures,cjvt/cc_gigafida,Yehor/ukrainian-tts-lada,YWjimmy/PeRFception-v1,SDbiaseval/dataset-dalle,Pinguin/images,DTU54DL/librispeech5k-augmentated-train-prepared,CristianaLazar/librispeech500,abdusahmbzuai/masc_dev,anonymousdeepcc/DeepCC,bigcode/the-stack-username-to-repo,bigscience/massive-probing-results,dgrnd4/stanford_dog_dataset,gigant/romanian_speech_synthesis_0_8_1,helena-balabin/sentences,icelab/ntrs_meta,joefox/Mozilla_Common_Voice_ru_test_noise,m-aliabbas/idrak_splitted_amy_1,marinone94/nst_sv,mbarnig/lb-de-fr-en-pt-12800-TTS-CORPUS,momilla/Ethereum_transacitons,nev/anime-giph,openclimatefix/nimrod-uk-1km-validation,raghav66/whisper-gpt,strombergnlp/broad_twitter_corpus,z-uo/female-LJSpeech-italian,Champion/vpc2020_clear_anon_speech,DelgadoPanadero/Pokemon,GEM/references,HuggingFaceM4/FairFace,Karavet/ILUR-news-text-classification-corpus,Voicemod/LibriTTS-100-preproc,YWjimmy/PeRFception-v1-1,albertvillanova/TextCaps,allenai/c4,dog/punks,chenghao/scielo_books,YWjimmy/PeRFception-v1-2,bigcode/the-stack-dedup,openclimatefix/era5,Carlisle/msmarco-passage-non-abs,SetFit/mnli,valurank/PoliticalBias_AllSides_Txt,Biomedical-TeMU/ProfNER_corpus_classification,LeoFeng/MLHW_6,pragnakalp/squad_v2_french_translated,textvqa,polinaeterna/vox_lingua,nishita/ade20k-sample,oyk100/ChaSES-data,YWjimmy/PeRFception-v1-3,YWjimmy/PeRFception-ScanNet,ChaiML/AnthropicRLHFPreferenceData,voidful/librispeech_asr_text,Isma/librispeech_1000_seed_42,Graphcore/vqa-lxmert,Tevatron/wikipedia-curated-corpus,adamlin/daily_dialog,cameronbc/synthtiger,clarin-pl/multiwiki_90k,echarlaix/vqa-lxmert,gigant/african_accented_french,Graphcore/vqa,echarlaix/vqa,jimregan/clarinpl_studio,GEM/xsum,Tevatron/wikipedia-squad-corpus,mulcyber/europarl-mono,nateraw/wit,bigscience/P3,tau/mrqa,uva-irlab/trec-cast-2019-multi-turn,vblagoje/wikipedia_snippets_streamed,Tevatron/wikipedia-wq-corpus,malteos/paperswithcode-aspects,Samip/Scotch,iluvvatar/RuREBus,nateraw/quickdraw,tau/scrolls,qanastek/MASSIVE,TalTechNLP/VoxLingua107,shanya/crd3,HugoLaurencon/libri_light,jerpint/imagenette,Leyo/TGIF,DFKI-SLT/few-nerd,crystina-z/msmarco-passage-dl20,HuggingFaceM4/epic_kitchens_100,HuggingFaceM4/yttemporal180m,andreagasparini/librispeech_train_other_only,allenai/nllb,biglam/nls_chapbook_illustrations,winvoker/lvis,Lacito/pangloss,indonesian-nlp/librivox-indonesia,Graphcore/gqa-lxmert,nanom/splittedspanish3bwc,cahya/librivox-indonesia,asapp/slue,sil-ai/audio-keyword-spotting,tner/wikiann,rogerdehe/xfund,arpelarpe/nota,mwhanna/ACT-Thor,sanchit-gandhi/librispeech_asr_clean,echarlaix/gqa-lxmert,shunk031/cocostuff,gigant/m-ailabs_speech_dataset_fr,jimregan/clarinpl_sejmsenat,1aurent/icdar-2011,marinone94/nst_no,jamescalam/unsplash-25k-images,stas/openwebtext-10k,florianbussmann/train_tickets-yu2020pick,benschill/brain-tumor-collection,imvladikon/paranames,PolyAI/evi,bengaliAI/cvbn,Sreyan88/librispeech_asr,superb,mozilla-foundation/common_voice_10_0,darkproger/librispeech_asr,kresnik/librispeech_asr_test,Lehrig/Monkey-Species-Collection,HuggingFaceM4/TGIF,crystina-z/miracl-bm25-negative,cats_vs_dogs,biglam/gallica_literary_fictions,common_language,competition_math,cornell_movie_dialog,evidence_infer_treatment,hebrew_projectbenyehuda,lj_speech,mc4,muchocine,opus_euconst,tab_fact,the_pile,tapaco,turkic_xwmt,web_nlg,vctk,mathaillah/BeritaHoaks-NonHoaks,universal_morphologies,LanceaKing/asvspoof2019,andreagasparini/librispeech_train_clean_only,nuprl/MultiPL-E,SLPL/naab-raw,mteb/results,SocialGrep/the-reddit-climate-change-dataset,bigscience-biomedical/anat_em,crystina-z/xor-tydi-corpus,qanastek/QUAERO,TomTBT/pmc_open_access_section,jamescalam/movielens-25m-ratings,HuggingFaceM4/charades,Tevatron/xor-tydi-corpus,khalidalt/tydiqa-primary,nvm472001/cvdataset-layoutlmv3,Lehrig/GTZAN-Collection,mteb/tatoeba-bitext-mining,sled-umich/Action-Effect,HamdiJr/Egyptian_hieroglyphs,joelito/lextreme,cooleel/xfund_de,oscar,mozilla-foundation/common_voice_7_0,KETI-AIR/vqa,Livingwithmachines/MapReader_Data_SIGSPATIAL_2022,NLPC-UOM/document_alignment_dataset-Sinhala-Tamil-English,miracl/miracl,Muennighoff/flores200,Murple/mmcrsc,mesolitica/dbp,CodedotAI/code_clippy,keshan/clean-si-mc4,yhavinga/ccmatrix,metashift,google/fleurs,HugoLaurencon/libri_light_bytes,biwi_kinect_head_pose,ami,bigscience-biomedical/ebm_pico,HuggingFaceM4/general-pmd-synthetic-testing,crystina-z/mmarco,robertmyers/pile_v2,bigbio/anat_em,biglam/early_printed_books_font_detection,nateraw/imagenet-sketch,jpwahle/dblp-discovery-dataset,andreagasparini/librispeech_test_only,crystina-z/mmarco-corpus,mozilla-foundation/common_voice_6_0,biglam/brill_iconclass,bigscience-biomedical/evidence_inference,HuggingFaceM4/cm4-synthetic-testing,SocialGrep/ten-million-reddit-answers,bnl_newspapers,multilingual_librispeech,openslr,GEM/BiSECT,Graphcore/gqa,SaulLu/Natural_Questions_HTML_reduced_all,ccdv/cnn_dailymail,mozilla-foundation/common_voice_1_0,huggan/anime-faces,Biomedical-TeMU/ProfNER_corpus_NER,MorVentura/TRBLLmaker,student/celebA,Rodion/uno_sustainable_development_goals,Nart/parallel-ab-ru,HuggingFaceM4/VQAv2,mesolitica/noisy-ms-en-augmentation,nateraw/rice-image-dataset,tensorcat/wikipedia-japanese,angelolab/ark_example,RAYZ/Mixed-Dia,ywchoi/mdpi_sept10,TomTBT/pmc_open_access_figure,society-ethics/lila_camera_traps,autoevaluator/shoes-vs-sandals-vs-boots,cjvt/slo_collocations,parambharat/mile_dataset,rossevine/tesis,ksaml/Stanford_dogs,nuprl/MultiPL-E-raw-data,ZihaoLin/zhlds,ACL-OCL/acl-anthology-corpus,mozilla-foundation/common_voice_2_0,Biomedical-TeMU/SPACCC_Sentence-Splitter,nateraw/rice-image-dataset-2,mesolitica/noisy-en-ms-augmentation,bigbio/ctebmsp,bigbio/distemist,nlphuji/vasr,parambharat/malayalam_asr_corpus,cjvt/sloleks,DavidVivancos/MindBigData2022_Imagenet_IN_Spct,KokeCacao/oracle,keremberke/nfl-object-detection,lafi23333/ds,Lykon/OnePiece,kaliansh/sdaia,sil-ai/audio-kw-in-context,andite/riyo-tag,ilhanemirhan/eee543,backslashlim/LoRA-Datasets,hr16/Miwano-Rag,ccdv/mediasum,mozilla-foundation/common_voice_3_0,mozilla-foundation/common_voice_4_0,bigbio/ebm_pico,parambharat/kannada_asr_corpus,parambharat/telugu_asr_corpus,Abuelnour/json_1000_Scientific_Paper,reazon-research/reazonspeech,shunk031/livedoor-news-corpus,mesolitica/translated-SQUAD,SamAct/medium_cleaned,EfaceD/ElysiumInspirations,cahya/fleurs,guangguang/azukijpg,genjib/LAVISHData,rohitp1/librispeech_asr_clean,azraahmadi/autotrain-data-xraydatasetp2,HuggingFaceM4/COCO,bio-datasets/e3c,nateraw/auto-cats-and-dogs,keremberke/smoke-object-detection,ds4sd/DocLayNet,nlphuji/utk_faces,corentinm7/MyoQuant-SDH-Data,xglue,grasshoff/lhc_sents,HugoLaurencon/IIIT-5K,alkzar90/CC6204-Hackaton-Cub-Dataset,RaphaelOlivier/whisper_adversarial_examples,bruno-cotrim/arch-max,keshan/multispeaker-tts-sinhala,Tevatron/beir-corpus,fcakyon/gun-object-detection,ccdv/arxiv-summarization,keremberke/protective-equipment-detection,mozilla-foundation/common_voice_5_0,nlphuji/winogavil,Poupou/Gitcoin-Grant-DataBuilder,orieg/elsevier-oa-cc-by,castorini/msmarco_v1_passage_doc2query-t5_expansions,inseq/divemt_attributions,crystina-z/msmarco-passage-dl19,mozilla-foundation/common_voice_5_1,matchbench/dbp15k-fr-en,keremberke/garbage-object-detection,crystina-z/no-nonself-mrtydi,ashraq/dhivehi-corpus,zyznull/dureader-retrieval-ranking,zyznull/msmarco-passage-corpus,zyznull/msmarco-passage-ranking,Tevatron/wikipedia-squad,Tevatron/wikipedia-trivia-corpus,NeuroSenko/senko_anime_full,plncmm/wl-disease,plncmm/wl-family-member"
supportedDatasets: "bigcode/the-stack"
Expand Down
32 changes: 16 additions & 16 deletions chart/templates/_envWorker.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,12 @@
value: {{ .Values.firstRows.minNumber| quote }}
- name: FIRST_ROWS_COLUMNS_MAX_NUMBER
value: {{ .Values.firstRows.columnsMaxNumber| quote }}
# specific to the /parquet-and-dataset-info job runner
- name: PARQUET_AND_DATASET_INFO_BLOCKED_DATASETS
value: {{ .Values.parquetAndDatasetInfo.blockedDatasets | quote }}
- name: PARQUET_AND_DATASET_INFO_COMMIT_MESSAGE
value: {{ .Values.parquetAndDatasetInfo.commitMessage | quote }}
- name: PARQUET_AND_DATASET_INFO_COMMITTER_HF_TOKEN
# specific to the /parquet-and-dataset-info and config-parquet-and-info job runners
- name: PARQUET_AND_INFO_BLOCKED_DATASETS
value: {{ .Values.parquetAndInfo.blockedDatasets | quote }}
- name: PARQUET_AND_INFO_COMMIT_MESSAGE
value: {{ .Values.parquetAndInfo.commitMessage | quote }}
- name: PARQUET_AND_INFO_COMMITTER_HF_TOKEN
{{- if .Values.secrets.userHfToken.fromSecret }}
valueFrom:
secretKeyRef:
Expand All @@ -51,14 +51,14 @@
{{- else }}
value: {{ .Values.secrets.userHfToken.value }}
{{- end }}
- name: PARQUET_AND_DATASET_INFO_MAX_DATASET_SIZE
value: {{ .Values.parquetAndDatasetInfo.maxDatasetSize | quote }}
- name: PARQUET_AND_DATASET_INFO_SOURCE_REVISION
value: {{ .Values.parquetAndDatasetInfo.sourceRevision | quote }}
- name: PARQUET_AND_DATASET_INFO_SUPPORTED_DATASETS
value: {{ .Values.parquetAndDatasetInfo.supportedDatasets | quote }}
- name: PARQUET_AND_DATASET_INFO_TARGET_REVISION
value: {{ .Values.parquetAndDatasetInfo.targetRevision | quote }}
- name: PARQUET_AND_DATASET_INFO_URL_TEMPLATE
value: {{ .Values.parquetAndDatasetInfo.urlTemplate | quote }}
- name: PARQUET_AND_INFO_MAX_DATASET_SIZE
value: {{ .Values.parquetAndInfo.maxDatasetSize | quote }}
- name: PARQUET_AND_INFO_SOURCE_REVISION
value: {{ .Values.parquetAndInfo.sourceRevision | quote }}
- name: PARQUET_AND_INFO_SUPPORTED_DATASETS
value: {{ .Values.parquetAndInfo.supportedDatasets | quote }}
- name: PARQUET_AND_INFO_TARGET_REVISION
value: {{ .Values.parquetAndInfo.targetRevision | quote }}
- name: PARQUET_AND_INFO_URL_TEMPLATE
value: {{ .Values.parquetAndInfo.urlTemplate | quote }}
{{- end -}}
2 changes: 1 addition & 1 deletion chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ firstRows:
# Max number of columns in the /first-rows endpoint response
columnsMaxNumber: 1_000

parquetAndDatasetInfo:
parquetAndInfo:
# comma-separated list of the blocked datasets. Defaults to empty.
blockedDatasets: ""
# the git commit message when the parquet files are uploaded to the Hub. Defaults to `Update parquet files`.
Expand Down
2 changes: 1 addition & 1 deletion e2e/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ export COMMON_HF_TOKEN := hf_app_datasets-server_token
export LOG_LEVEL := DEBUG
export FIRST_ROWS_MAX_NUMBER := 4
export MONGO_PORT := 27050
export PARQUET_AND_DATASET_INFO_COMMITTER_HF_TOKEN := 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
export PARQUET_AND_INFO_COMMITTER_HF_TOKEN := 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
export PORT_REVERSE_PROXY := 9000
export PROMETHEUS_MULTIPROC_DIR := /tmp
export WORKER_SLEEP_SECONDS := 1
Expand Down
5 changes: 4 additions & 1 deletion e2e/tests/test_11_auth.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,13 @@ def test_auth_e2e(
f"/config-names?dataset={dataset}",
f"/splits?dataset={dataset}",
f"/first-rows?dataset={dataset}&config={config}&split={split}",
f"/parquet-and-dataset-info?dataset={dataset}",
f"/parquet-and-dataset-info?dataset={dataset}&config={config}",
f"/parquet?dataset={dataset}",
f"/parquet?dataset={dataset}&config={config}",
f"/dataset-info?dataset={dataset}",
f"/dataset-info?dataset={dataset}&config={config}",
f"/size?dataset={dataset}",
f"/size?dataset={dataset}&config={config}",
]
for endpoint in endpoints:
poll_until_ready_and_assert(
Expand Down
12 changes: 9 additions & 3 deletions libs/libcommon/src/libcommon/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from libcommon.constants import (
PROCESSING_STEP_CONFIG_INFO_VERSION,
PROCESSING_STEP_CONFIG_NAMES_VERSION,
PROCESSING_STEP_CONFIG_PARQUET_AND_INFO_VERSION,
PROCESSING_STEP_CONFIG_PARQUET_VERSION,
PROCESSING_STEP_CONFIG_SIZE_VERSION,
PROCESSING_STEP_DATASET_INFO_VERSION,
Expand Down Expand Up @@ -139,13 +140,18 @@ class ProcessingGraphConfig:
"required_by_dataset_viewer": True,
"job_runner_version": PROCESSING_STEP_SPLIT_FIRST_ROWS_FROM_STREAMING_VERSION,
},
"config-parquet-and-info": {
"input_type": "config",
"requires": "/config-names",
"job_runner_version": PROCESSING_STEP_CONFIG_PARQUET_AND_INFO_VERSION,
},
"/parquet-and-dataset-info": {
"input_type": "dataset",
"job_runner_version": PROCESSING_STEP_PARQUET_AND_DATASET_INFO_VERSION,
},
"config-parquet": {
"input_type": "config",
"requires": "/parquet-and-dataset-info",
"requires": "config-parquet-and-info",
"job_runner_version": PROCESSING_STEP_CONFIG_PARQUET_VERSION,
},
"split-first-rows-from-parquet": {
Expand All @@ -160,7 +166,7 @@ class ProcessingGraphConfig:
},
"config-info": {
"input_type": "config",
"requires": "/parquet-and-dataset-info",
"requires": "config-parquet-and-info",
"job_runner_version": PROCESSING_STEP_CONFIG_INFO_VERSION,
},
"dataset-info": {
Expand All @@ -175,7 +181,7 @@ class ProcessingGraphConfig:
},
"config-size": {
"input_type": "config",
"requires": "/parquet-and-dataset-info",
"requires": "config-parquet-and-info",
"job_runner_version": PROCESSING_STEP_CONFIG_SIZE_VERSION,
},
"dataset-size": {
Expand Down
1 change: 1 addition & 0 deletions libs/libcommon/src/libcommon/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
PROCESSING_STEP_DATASET_SPLIT_NAMES_FROM_STREAMING_VERSION = 1
PROCESSING_STEP_PARQUET_AND_DATASET_INFO_VERSION = 1
PROCESSING_STEP_SPLIT_FIRST_ROWS_FROM_PARQUET_VERSION = 1
PROCESSING_STEP_CONFIG_PARQUET_AND_INFO_VERSION = 1
PROCESSING_STEP_SPLIT_FIRST_ROWS_FROM_STREAMING_VERSION = 2
PROCESSING_STEP_SPLIT_NAMES_FROM_DATASET_INFO_VERSION = 2
PROCESSING_STEP_SPLIT_NAMES_FROM_STREAMING_VERSION = 2
Expand Down
43 changes: 28 additions & 15 deletions libs/libcommon/tests/test_processing_steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,11 +65,12 @@ def graph() -> ProcessingGraph:
@pytest.mark.parametrize(
"step_name,children,ancestors",
[
("/config-names", ["/split-names-from-streaming"], []),
("/config-names", ["/split-names-from-streaming", "config-parquet-and-info"], []),
("config-parquet-and-info", ["config-parquet", "config-info", "config-size"], ["/config-names"]),
(
"/split-names-from-dataset-info",
["dataset-split-names-from-dataset-info", "split-first-rows-from-streaming", "dataset-split-names"],
["/parquet-and-dataset-info", "config-info"],
["/config-names", "config-parquet-and-info", "config-info"],
),
(
"/split-names-from-streaming",
Expand All @@ -79,45 +80,57 @@ def graph() -> ProcessingGraph:
(
"dataset-split-names-from-dataset-info",
[],
["/parquet-and-dataset-info", "config-info", "/split-names-from-dataset-info"],
["/config-names", "config-parquet-and-info", "config-info", "/split-names-from-dataset-info"],
),
("dataset-split-names-from-streaming", [], ["/config-names", "/split-names-from-streaming"]),
(
"dataset-split-names",
["dataset-is-valid"],
[
"/parquet-and-dataset-info",
"/config-names",
"config-parquet-and-info",
"config-info",
"/split-names-from-dataset-info",
"/config-names",
"/split-names-from-streaming",
],
),
("split-first-rows-from-parquet", ["dataset-is-valid"], ["config-parquet", "/parquet-and-dataset-info"]),
(
"split-first-rows-from-parquet",
["dataset-is-valid"],
["config-parquet", "/config-names", "config-parquet-and-info"],
),
(
"split-first-rows-from-streaming",
["dataset-is-valid"],
[
"/config-names",
"/split-names-from-streaming",
"/split-names-from-dataset-info",
"/parquet-and-dataset-info",
"config-parquet-and-info",
"config-info",
],
),
("/parquet-and-dataset-info", ["config-parquet", "config-info", "config-size"], []),
("config-parquet", ["split-first-rows-from-parquet", "dataset-parquet"], ["/parquet-and-dataset-info"]),
("dataset-parquet", [], ["/parquet-and-dataset-info", "config-parquet"]),
("config-info", ["dataset-info", "/split-names-from-dataset-info"], ["/parquet-and-dataset-info"]),
("dataset-info", [], ["/parquet-and-dataset-info", "config-info"]),
("config-size", ["dataset-size"], ["/parquet-and-dataset-info"]),
("dataset-size", [], ["/parquet-and-dataset-info", "config-size"]),
("/parquet-and-dataset-info", [], []),
(
"config-parquet",
["split-first-rows-from-parquet", "dataset-parquet"],
["/config-names", "config-parquet-and-info"],
),
("dataset-parquet", [], ["/config-names", "config-parquet-and-info", "config-parquet"]),
(
"config-info",
["dataset-info", "/split-names-from-dataset-info"],
["/config-names", "config-parquet-and-info"],
),
("dataset-info", [], ["/config-names", "config-parquet-and-info", "config-info"]),
("config-size", ["dataset-size"], ["/config-names", "config-parquet-and-info"]),
("dataset-size", [], ["/config-names", "config-parquet-and-info", "config-size"]),
(
"dataset-is-valid",
[],
[
"/config-names",
"/parquet-and-dataset-info",
"config-parquet-and-info",
"dataset-split-names",
"config-info",
"config-parquet",
Expand Down
Loading

0 comments on commit af1a46c

Please sign in to comment.