-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run multi-dataset tests #104
Comments
We can't even submit the multi-dataset due to our timeout issue:
|
Changed timeout to |
Crash during the 4th dataset submission: (venv) [bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --distributed-client scheduler --dask-scheduler 'tcp://dask-gwatts-2e1782e2-0.af-jupyter:8786' --dask-profile --dataset multi_data --query xaod_small --num-files 0
0000.0420 - INFO - root - Using release 22.2.107 for type information.
0000.0780 - WARNING - func_adl.type_based_replacement - Unknown type for name len
0000.8158 - INFO - root - Running over 4 datasets, 142.636 TB and 19,074,862,754 events.
0000.8161 - INFO - root - Building ServiceX query
0000.8164 - INFO - root - Querying dataset data15_13TeV:data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026
0000.8165 - INFO - root - Querying dataset data16_13TeV:data16_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp16_v01_p6026
0000.8165 - INFO - root - Querying dataset data17_13TeV:data17_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp17_v01_p6026
0000.8166 - INFO - root - Querying dataset data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026
0000.8166 - INFO - root - Running on the full dataset(s).
0000.8166 - INFO - root - Starting ServiceX query
0000.8286 - INFO - servicex.servicex_client - Returning code generators from cache
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/? 0006.6497 - INFO - servicex.query - ServiceX Transform speed_test_data15_13TeV:data15_13TeV.periodAllYear.phys
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/10049 --:--0027.8295 - INFO - servicex.query - ServiceX Transform speed_test_data17_13TeV:data17_13TeV.period
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/10049 --:--0032.1429 - INFO - servicex.query - ServiceX Transform speed_test_data18_13TeV:data18_13TeV.period
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/55534 --:--
Traceback (most recent call last):
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 508, in <module>
main(
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 181, in main
dataset_files = query_servicex(
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 148, in query_servicex
results = sx.deliver(spec)
File "/venv/lib/python3.9/site-packages/servicex/servicex_client.py", line 107, in deliver
results = group.as_signed_urls()
File "/venv/lib/python3.9/site-packages/make_it_sync/func_wrapper.py", line 63, in wrapped_call
return _sync_version_of_function(fn, *args, **kwargs)
File "/venv/lib/python3.9/site-packages/make_it_sync/func_wrapper.py", line 14, in _sync_version_of_function
return loop.run_until_complete(r)
File "/usr/AnalysisBaseExternals/25.2.2/InstallArea/x86_64-el9-gcc13-opt/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/venv/lib/python3.9/site-packages/servicex/dataset_group.py", line 76, in as_signed_urls_async
return await asyncio.gather(*self.tasks)
File "/venv/lib/python3.9/site-packages/servicex/query.py", line 521, in as_signed_urls_async
return await self.submit_and_download(
File "/venv/lib/python3.9/site-packages/servicex/query.py", line 260, in submit_and_download
self.request_id = await self.servicex.submit_transform(sx_request)
File "/venv/lib/python3.9/site-packages/servicex/servicex_adapter.py", line 120, in submit_transform
raise RuntimeError("ServiceX WebAPI Error during transformation "
RuntimeError: ServiceX WebAPI Error during transformation submission: 504 - <Response [504 Gateway Time-out]> |
To work around the timeout issues, added retry code to the frontend:
Likely need to be more specific (only the timeout error). |
Here is the first time 4 got submitted - sadly someone was running continuously in the background:
There is no way that number is right! |
On a quiet cluster - full run went!!! (venv) [bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --distributed-client scheduler --dask-scheduler 'tcp://dask-gwatts-2e1782e2-0.af-jupyter:8786' --dask-profile --dataset multi_data --query xaod_small --num-files 0
0000.0374 - INFO - root - Using release 22.2.107 for type information.
0000.0732 - WARNING - func_adl.type_based_replacement - Unknown type for name len
0000.8248 - INFO - root - Running over 4 datasets, 142.636 TB and 19,074,862,754 events.
0000.8252 - INFO - root - Building ServiceX query
0000.8256 - INFO - root - Querying dataset data15_13TeV:data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026
0000.8256 - INFO - root - Querying dataset data16_13TeV:data16_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp16_v01_p6026
0000.8257 - INFO - root - Querying dataset data17_13TeV:data17_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp17_v01_p6026
0000.8257 - INFO - root - Querying dataset data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026
0000.8257 - INFO - root - Running on the full dataset(s).
0000.8258 - INFO - root - Starting ServiceX query
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/? 0006.8627 - INFO - servicex.query - ServiceX Transform speed_test_data15_13TeV:data15_13TeV.p
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/10049 --:--0025.7301 - INFO - servicex.query - ServiceX Transform speed_test_data16_13TeV:da
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/10049 --:--0030.2254 - INFO - servicex.query - ServiceX Transform speed_test_data17_13TeV:da
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/10049 --:--0034.7581 - INFO - servicex.query - ServiceX Transform speed_test_data18_13TeV:da
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Transform ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64803/64803 30:43
1949.7755 - INFO - root - Event rate for ServiceX: 00:32:28 time, 9787.25 kHz, Data rate: 585.49 Gbits/s
1949.7756 - INFO - root - Dataset speed_test_data15_13TeV:data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026 has 2893 files
1949.7756 - INFO - root - Dataset speed_test_data16_13TeV:data16_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp16_v01_p6026 has 3981 files
1949.7757 - INFO - root - Dataset speed_test_data17_13TeV:data17_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp17_v01_p6026 has 4204 files
1949.7757 - INFO - root - Dataset speed_test_data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026 has 4591 files
Traceback (most recent call last):
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 508, in <module>
main(
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 196, in main
report, n_events = dask.compute(*calculate_n_events(dataset_files, steps_per_file))
File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
File "/venv/lib/python3.9/site-packages/distributed/client.py", line 2232, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task ('<dask-awkward.lib.core.ArgsKwargsPackedFunction ob-c9e5eefd8340b50f30ff99510d27817b', 1078) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://172.16.160.82:37043. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html. |
Ran on DASK without errors. Had to limit it to 100 workers in order to make it work.
Clearly the 301 second long green bars are timeouts. Would be good to reduce the number of timeouts! |
With 500 workers: (venv) [bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --distributed-client scheduler --dask-scheduler 'tcp://dask-gwatts-2e1782e2-0.af-jupyter:8786' --dask-profile --dataset multi_data --query xaod_small --num-files 0
0000.0598 - INFO - root - Registering retry HTTPFileSystem and HTTPFile with fsspec on DASK cluster
0000.7733 - INFO - root - Using release 22.2.107 for type information.
0000.8078 - WARNING - func_adl.type_based_replacement - Unknown type for name len
0001.5643 - INFO - root - Running over 4 datasets, 142.636 TB and 19,074,862,754 events.
0001.5646 - INFO - root - Building ServiceX query
0001.5650 - INFO - root - Querying dataset data15_13TeV:data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026
0001.5651 - INFO - root - Querying dataset data16_13TeV:data16_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp16_v01_p6026
0001.5651 - INFO - root - Querying dataset data17_13TeV:data17_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp17_v01_p6026
0001.5651 - INFO - root - Querying dataset data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026
0001.5652 - INFO - root - Running on the full dataset(s).
0001.5652 - INFO - root - Starting ServiceX query
0001.6020 - INFO - servicex.servicex_client - Returning code generators from cache
0001.6253 - INFO - servicex.query - Returning results from cache
0001.6438 - INFO - servicex.query - Returning results from cache
0001.6622 - INFO - servicex.query - Returning results from cache
0001.6810 - INFO - servicex.query - Returning results from cache
0001.6828 - INFO - root - Event rate for ServiceX not calculated since cached result was used
0001.6829 - INFO - root - Dataset speed_test_data15_13TeV:data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026 has 2893 files
0001.6829 - INFO - root - Dataset speed_test_data16_13TeV:data16_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp16_v01_p6026 has 3981 files
0001.6829 - INFO - root - Dataset speed_test_data17_13TeV:data17_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp17_v01_p6026 has 4204 files
0001.6830 - INFO - root - Dataset speed_test_data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026 has 4591 files
0001.6830 - INFO - root - Using `uproot.dask` to open files (splitting files 1 ways).
0409.5802 - INFO - root - Number of skimmed events: 87,463,684 (skim percent: 0.4585%)
0410.5958 - INFO - root - Starting build of DASK graphs
0415.0800 - INFO - root - Computing the total count
0855.2350 - INFO - root - Event rate for DASK Calculation: 00:07:20 time, 43336.73 kHz, Data rate: 2592.47 Gbits/s
0855.2352 - INFO - root - DASK event rate over actual events: 198.71 kHz
0855.2353 - INFO - root - speed_test_data15_13TeV:data15_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp15_v01_p6026: result = 16,310,072
0855.2353 - INFO - root - speed_test_data16_13TeV:data16_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp16_v01_p6026: result = 22,730,800
0855.2354 - INFO - root - speed_test_data17_13TeV:data17_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp17_v01_p6026: result = 25,561,943
0855.2354 - INFO - root - speed_test_data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026: result = 22,860,869
|
Run tests on SX with multiple datasets. See how fast we can go!
The text was updated successfully, but these errors were encountered: