Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove subset inputs #6

Merged
merged 6 commits into from
Jun 24, 2022
Merged

Conversation

paigem
Copy link
Contributor

@paigem paigem commented Jun 9, 2022

Create 30-day time chunks (13 total, 12 x 30-day chunks and 1 x 5-day chunk) in attempt to solve #5

@cisaacstern could you merge this PR? Thanks for your help with this!

@pangeo-forge-bot
Copy link
Collaborator

🎉 New recipe runs created for the following recipes at sha 60f6e9352e9eead4663231b2bc489f96eac2ad68:

@cisaacstern
Copy link
Member

@paigem, since we ended up with an error in #5 which may have been caught by an initial test, I think it may save us some effort if I run a test prior to merging. I'll do that now, and merge pending a successful result. Thanks for the PR!

@cisaacstern
Copy link
Member

/run recipe-rest recipe_run_id=353

@cisaacstern
Copy link
Member

Misspelled the last command!

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=353

@pangeo-forge-bot
Copy link
Collaborator

✨ A test of your recipe cesm-atm-025deg is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/353

@pangeo-forge-bot
Copy link
Collaborator

Pangeo Forge Cloud told me that our test of your recipe cesm-atm-025deg failed. But don't worry, I'm sure we can fix this!

To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/353

If you haven't yet tried pruning and running your recipe locally, I suggest trying that now.

Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps!

@cisaacstern
Copy link
Member

Hmm the linked logs are not helpful without a solution for pangeo-forge/pangeo-forge.org#63

Pulling logs from the backend directly, I'm seeing this rather opaque error

Task 'store_chunk[336]': Exception encountered during task execution!
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.9/site-packages/registrar/flow.py", line 113, in wrapper
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 631, in store_chunk
    with lock_for_conflicts(lock_keys, timeout=config.lock_timeout):
  File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/utils.py", line 106, in lock_for_conflicts
    acquired = lock.acquire(timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/lock.py", line 137, in acquire
    result = self.client.sync(
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py", line 868, in sync
    return sync(
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync
    raise exc.with_traceback(tb)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f
    result[0] = yield future
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 895, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 672, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
    convert_stream_closed_error(self, e)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.lock_acquire local=tcp://10.60.0.5:58998 remote=tcp://dask-jovyan-d630fd13-4.pangeo-forge-columbia-staging-bakery:8786>: Stream is closed

@paigem, does the version of the recipe in this PR execute locally for you without error?

@paigem
Copy link
Contributor Author

paigem commented Jun 10, 2022

Yes, @cisaacstern this recipe runs without errors locally.

Side note: the warning about 255 character length issues gets printed out many, many times in the local run, and so it's a bit difficult to sift through and see if there are any other warnings that show up. But, it looks like in this case, in addition to no errors, there are no other warnings either.

feedstock/recipe.py Outdated Show resolved Hide resolved
@pangeo-forge-bot
Copy link
Collaborator

🎉 New recipe runs created for the following recipes at sha 669c2bfe6403674553b4fdf96331266152f27e78:

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=396

@pangeo-forge-bot
Copy link
Collaborator

When I tried to import your recipe module, I encountered this error

            line 47
        target_chunks=target_chunks,
                                    ^
    SyntaxError: unexpected EOF while parsing

Please correct your recipe module so that it's importable.

feedstock/recipe.py Outdated Show resolved Hide resolved
@pangeo-forge-bot
Copy link
Collaborator

🎉 New recipe runs created for the following recipes at sha f5c78aa30faaef25485aeb3fb5d4967bb3ebdf80:

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=397

@pangeo-forge-bot
Copy link
Collaborator

✨ A test of your recipe cesm-atm-025deg is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/397

@pangeo-forge-bot
Copy link
Collaborator

Pangeo Forge Cloud told me that our test of your recipe cesm-atm-025deg failed. But don't worry, I'm sure we can fix this!

To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/397

If you haven't yet tried pruning and running your recipe locally, I suggest trying that now.

Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps!

@pangeo-forge-bot
Copy link
Collaborator

🎉 New recipe runs created for the following recipes at sha 20ab126604bbb34b8c23f4f20ac861bb8364c5d3:

@cisaacstern
Copy link
Member

Two reflections:

  1. I'd previously removed subset_inputs but having just taken a closer look at the input file sizes (which are each ~1 GB), I do think a subset_inputs make sense, and I've restored it to {"time": 2}, which is a more modest level of subsetting than we currently have in main.
  2. I've just manually deleted the cached files for this recipe. It's occurred to me that my assumptions about cache-reuse with our current infrastructure may have been faulty: it's possible that subsequent recipe runs which attempt to re-use a cache from a prior recipe run may run into problems. We have a clean slate now for the cache.

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=399

@pangeo-forge-bot
Copy link
Collaborator

✨ A test of your recipe cesm-atm-025deg is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/399

@pangeo-forge-bot
Copy link
Collaborator

Pangeo Forge Cloud told me that our test of your recipe cesm-atm-025deg failed. But don't worry, I'm sure we can fix this!

To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/399

If you haven't yet tried pruning and running your recipe locally, I suggest trying that now.

Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps!

@cisaacstern
Copy link
Member

I've just increased worker memory in hopes that this may allow us to move past the KilledWorker issues we've seen. I'm going to try to re-run this recipe from the existing recipe run now.

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=399

@pangeo-forge-bot
Copy link
Collaborator

✨ A test of your recipe cesm-atm-025deg is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/399

@cisaacstern
Copy link
Member

This hasn't officially failed yet, but it does appear to be stalled and likely to time out.

Because we have a lot more memory available to us now, I'm going to remove the subset inputs again to make this more intuitive to debug.

feedstock/recipe.py Outdated Show resolved Hide resolved
@pangeo-forge-bot
Copy link
Collaborator

🎉 New recipe runs created for the following recipes at sha 591f2f790bb140aa697c48891fcc8d4125c12d78:

@cisaacstern
Copy link
Member

399 has not failed yet, but I expect it will.

I'm going to run 501 now, which I imagine will also fail, but because we have no subsetting set (thanks to the larger memory allocation), it should be more intuitive to debug once it does.

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=501

@pangeo-forge-bot
Copy link
Collaborator

✨ A test of your recipe cesm-atm-025deg is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/501

@cisaacstern
Copy link
Member

I just manually cancelled 501. It had been hanging on a single store_chunk task since yesterday.

There's a lot of noise in the full traceback, but if you look closely, it appears to have been stalled on acquiring a lock

Note these lines present in the full trace

  |   | 2022-06-24T16:41:39.820282266Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/utils.py", line 106, in lock_for_conflicts
  |   | 2022-06-24T16:41:39.820286606Z stderr F     acquired = lock.acquire(timeout=timeout)
Traceback

2022-06-24T16:41:38.38910185Z stderr F distributed.dask_worker - INFO - Exiting on signal 15
--
  |   | 2022-06-24T16:41:38.389540753Z stderr F distributed.nanny - INFO - Closing Nanny at 'tcp://10.60.8.7:36169'
  |   | 2022-06-24T16:41:38.394733293Z stderr F distributed.worker - INFO - Stopping worker at tcp://10.60.8.7:44651
  |   | 2022-06-24T16:41:39.819390004Z stdout F [2022-06-24 16:41:39+0000] ERROR - prefect.CloudTaskRunner \| Task 'store_chunk[11]': Exception encountered during task execution!
  |   | 2022-06-24T16:41:39.819439877Z stdout F Traceback (most recent call last):
  |   | 2022-06-24T16:41:39.81944926Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
  |   | 2022-06-24T16:41:39.819462077Z stdout F     frames_nbytes = await stream.read_bytes(fmt_size)
  |   | 2022-06-24T16:41:39.819468248Z stdout F tornado.iostream.StreamClosedError: Stream is closed
  |   | 2022-06-24T16:41:39.819478403Z stdout F
  |   | 2022-06-24T16:41:39.819483755Z stdout F The above exception was the direct cause of the following exception:
  |   | 2022-06-24T16:41:39.819489907Z stdout F
  |   | 2022-06-24T16:41:39.81949485Z stdout F Traceback (most recent call last):
  |   | 2022-06-24T16:41:39.819500297Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state
  |   | 2022-06-24T16:41:39.819506001Z stdout F     value = prefect.utilities.executors.run_task_with_timeout(
  |   | 2022-06-24T16:41:39.819512129Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout
  |   | 2022-06-24T16:41:39.819517295Z stdout F     return task.run(*args, **kwargs)  # type: ignore
  |   | 2022-06-24T16:41:39.819522842Z stdout F   File "/usr/local/lib/python3.9/site-packages/registrar/flow.py", line 113, in wrapper
  |   | 2022-06-24T16:41:39.819528814Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 631, in store_chunk
  |   | 2022-06-24T16:41:39.819533792Z stdout F     with lock_for_conflicts(lock_keys, timeout=config.lock_timeout):
  |   | 2022-06-24T16:41:39.819539175Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__
  |   | 2022-06-24T16:41:39.819544444Z stdout F     return next(self.gen)
  |   | 2022-06-24T16:41:39.819549625Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/utils.py", line 106, in lock_for_conflicts
  |   | 2022-06-24T16:41:39.819554837Z stdout F     acquired = lock.acquire(timeout=timeout)
  |   | 2022-06-24T16:41:39.819559897Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/lock.py", line 137, in acquire
  |   | 2022-06-24T16:41:39.819565039Z stdout F     result = self.client.sync(
  |   | 2022-06-24T16:41:39.819569999Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py", line 868, in sync
  |   | 2022-06-24T16:41:39.81957527Z stdout F     return sync(
  |   | 2022-06-24T16:41:39.819582053Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync
  |   | 2022-06-24T16:41:39.819587125Z stdout F     raise exc.with_traceback(tb)
  |   | 2022-06-24T16:41:39.819591942Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f
  |   | 2022-06-24T16:41:39.819596916Z stdout F     result[0] = yield future
  |   | 2022-06-24T16:41:39.819619038Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
  |   | 2022-06-24T16:41:39.819623473Z stdout F     value = future.result()
  |   | 2022-06-24T16:41:39.819627811Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 895, in send_recv_from_rpc
  |   | 2022-06-24T16:41:39.819633516Z stdout F     result = await send_recv(comm=comm, op=key, **kwargs)
  |   | 2022-06-24T16:41:39.819638537Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 672, in send_recv
  |   | 2022-06-24T16:41:39.819642915Z stdout F     response = await comm.read(deserializers=deserializers)
  |   | 2022-06-24T16:41:39.819659697Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
  |   | 2022-06-24T16:41:39.819664017Z stdout F     convert_stream_closed_error(self, e)
  |   | 2022-06-24T16:41:39.819668568Z stdout F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
  |   | 2022-06-24T16:41:39.819672914Z stdout F     raise CommClosedError(f"in {obj}: {exc}") from exc
  |   | 2022-06-24T16:41:39.819677507Z stdout F distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.lock_acquire local=tcp://10.60.8.7:57734 remote=tcp://dask-jovyan-779c8f5b-c.pangeo-forge-columbia-staging-bakery:8786>: Stream is closed
  |   | 2022-06-24T16:41:39.820138647Z stderr F ERROR:prefect.CloudTaskRunner:Task 'store_chunk[11]': Exception encountered during task execution!
  |   | 2022-06-24T16:41:39.820186395Z stderr F Traceback (most recent call last):
  |   | 2022-06-24T16:41:39.820196449Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
  |   | 2022-06-24T16:41:39.820203736Z stderr F     frames_nbytes = await stream.read_bytes(fmt_size)
  |   | 2022-06-24T16:41:39.820211269Z stderr F tornado.iostream.StreamClosedError: Stream is closed
  |   | 2022-06-24T16:41:39.820217195Z stderr F
  |   | 2022-06-24T16:41:39.820223646Z stderr F The above exception was the direct cause of the following exception:
  |   | 2022-06-24T16:41:39.820228905Z stderr F
  |   | 2022-06-24T16:41:39.82023335Z stderr F Traceback (most recent call last):
  |   | 2022-06-24T16:41:39.820238096Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state
  |   | 2022-06-24T16:41:39.820244554Z stderr F     value = prefect.utilities.executors.run_task_with_timeout(
  |   | 2022-06-24T16:41:39.820249631Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout
  |   | 2022-06-24T16:41:39.820254158Z stderr F     return task.run(*args, **kwargs)  # type: ignore
  |   | 2022-06-24T16:41:39.820258797Z stderr F   File "/usr/local/lib/python3.9/site-packages/registrar/flow.py", line 113, in wrapper
  |   | 2022-06-24T16:41:39.820263903Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 631, in store_chunk
  |   | 2022-06-24T16:41:39.820268316Z stderr F     with lock_for_conflicts(lock_keys, timeout=config.lock_timeout):
  |   | 2022-06-24T16:41:39.820272887Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__
  |   | 2022-06-24T16:41:39.82027737Z stderr F     return next(self.gen)
  |   | 2022-06-24T16:41:39.820282266Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/utils.py", line 106, in lock_for_conflicts
  |   | 2022-06-24T16:41:39.820286606Z stderr F     acquired = lock.acquire(timeout=timeout)
  |   | 2022-06-24T16:41:39.820291086Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/lock.py", line 137, in acquire
  |   | 2022-06-24T16:41:39.820295475Z stderr F     result = self.client.sync(
  |   | 2022-06-24T16:41:39.820299846Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py", line 868, in sync
  |   | 2022-06-24T16:41:39.820304218Z stderr F     return sync(
  |   | 2022-06-24T16:41:39.820320311Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync
  |   | 2022-06-24T16:41:39.820324877Z stderr F     raise exc.with_traceback(tb)
  |   | 2022-06-24T16:41:39.820329103Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f
  |   | 2022-06-24T16:41:39.820333453Z stderr F     result[0] = yield future
  |   | 2022-06-24T16:41:39.820337847Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
  |   | 2022-06-24T16:41:39.820342096Z stderr F     value = future.result()
  |   | 2022-06-24T16:41:39.820346316Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 895, in send_recv_from_rpc
  |   | 2022-06-24T16:41:39.820351913Z stderr F     result = await send_recv(comm=comm, op=key, **kwargs)
  |   | 2022-06-24T16:41:39.820356729Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 672, in send_recv
  |   | 2022-06-24T16:41:39.820361286Z stderr F     response = await comm.read(deserializers=deserializers)
  |   | 2022-06-24T16:41:39.82036562Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read
  |   | 2022-06-24T16:41:39.820370048Z stderr F     convert_stream_closed_error(self, e)
  |   | 2022-06-24T16:41:39.820374447Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
  |   | 2022-06-24T16:41:39.820378986Z stderr F     raise CommClosedError(f"in {obj}: {exc}") from exc
  |   | 2022-06-24T16:41:39.820383506Z stderr F distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.lock_acquire local=tcp://10.60.8.7:57734 remote=tcp://dask-jovyan-779c8f5b-c.pangeo-forge-columbia-staging-bakery:8786>: Stream is closed
  |   | 2022-06-24T16:41:39.995283823Z stderr F distributed.nanny - WARNING - Worker process still alive after 1 seconds, killing
  |   | 2022-06-24T16:41:39.997992272Z stderr F distributed.dask_worker - INFO - End worker
  |   | 2022-06-24T16:41:39.999204802Z stderr F distributed.process - INFO - reaping stray process <SpawnProcess name='Dask Worker process (from Nanny)' pid=313 parent=1 started daemon>
  |   | 2022-06-24T16:41:40.196608195Z stderr F Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join:
  |   | 2022-06-24T16:41:40.19677442Z stderr F Traceback (most recent call last):
  |   | 2022-06-24T16:41:40.196795839Z stderr F   File "/srv/conda/envs/notebook/lib/python3.9/threading.py", line 973, in _bootstrap_inner

I'm going to try restoring the target_chunks to original {"time": 73}, which as a divisor of 365 (the number of time steps per file), should prevent the need for any locking.

This is the setting we had initially run with, and which caused a KilledWorker in #5, but we have 3x the worker memory now, so perhaps all will work smoothly.

feedstock/recipe.py Outdated Show resolved Hide resolved
@pangeo-forge-bot
Copy link
Collaborator

🎉 New recipe runs created for the following recipes at sha ca47e1e6f5f0eb40bad9bbbe620ff3131f97493d:

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=502

@pangeo-forge-bot
Copy link
Collaborator

✨ A test of your recipe cesm-atm-025deg is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/502

@pangeo-forge-bot
Copy link
Collaborator

🥳 Hooray! The test execution of your recipe cesm-atm-025deg succeeded.

Here is a static representation of the dataset built by this recipe:

            <xarray.Dataset>
    Dimensions:    (time: 730, lat: 768, lon: 1152, ilev: 31, lev: 30, nbnd: 2)
    Coordinates:
      * ilev       (ilev) float64 2.255 5.032 10.16 18.56 ... 967.5 985.1 1e+03
      * lat        (lat) float64 -90.0 -89.77 -89.53 -89.3 ... 89.3 89.53 89.77 90.0
      * lev        (lev) float64 3.643 7.595 14.36 24.61 ... 936.2 957.5 976.3 992.6
      * lon        (lon) float64 0.0 0.3125 0.625 0.9375 ... 358.8 359.1 359.4 359.7
      * time       (time) object 0078-01-01 00:00:00 ... 0079-12-31 00:00:00
    Dimensions without coordinates: nbnd
    Data variables: (12/39)
        FLDS       (time, lat, lon) float32 dask.array<chunksize=(73, 768, 1152), meta=np.ndarray>
        FSNS       (time, lat, lon) float32 dask.array<chunksize=(73, 768, 1152), meta=np.ndarray>
        LHFLX      (time, lat, lon) float32 dask.array<chunksize=(73, 768, 1152), meta=np.ndarray>
        P0         float64 ...
        PSL        (time, lat, lon) float32 dask.array<chunksize=(73, 768, 1152), meta=np.ndarray>
        QREFHT     (time, lat, lon) float32 dask.array<chunksize=(73, 768, 1152), meta=np.ndarray>
        ...         ...
        nsteph     (time) float64 dask.array<chunksize=(73,), meta=np.ndarray>
        ntrk       float64 ...
        ntrm       float64 ...
        ntrn       float64 ...
        sol_tsi    (time) float64 dask.array<chunksize=(73,), meta=np.ndarray>
        time_bnds  (time, nbnd) object dask.array<chunksize=(73, 2), meta=np.ndarray>
    Attributes: (12/18)
        Conventions:      CF-1.0
        TITLE:            REMAPPED: UNSET                                        ...
        Version:          $Name$
        case:             hybrid_v5_rel04_BC5_ne120_t12_pop62                    ...
        creation_date:    Sat Sep 20 07:47:59 MDT 2014
        host:             ys0103          
        ...               ...
        revision_Id:      $Id$
        separator1:       ------- SOURCE FILE ATTRIBUTES --------
        separator2:       ---------------------------------------
        source:           CAM
        title:            UNSET                                                  ...
        topography_file:  /glade/p/cesm/cseg//inputdata/atm/cam/topo/USGS_gtopo30...

You can also open this dataset by running the following Python code

import fsspec
import xarray as xr

dataset_public_url = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-502/pangeo-forge/cesm-atm-025deg-feedstock/cesm-atm-025deg.zarr'
mapper = fsspec.get_mapper(dataset_public_url)
ds = xr.open_zarr(mapper, consolidated=True)
ds

in this badge (or your Python interpreter of choice).

Checklist

Please copy-and-paste the list below into a new comment on this thread, and check the boxes off as you've reviewed them.

Note: This test execution is limited to two increments in the concatenation dimension, so you should expect the length of that dimension (e.g, "time" or equivalent) to be 2.

- [ ] Are the dimension lengths correct?
- [ ] Are all of the expected variables present?
- [ ] Does plotting the data produce a plot that looks like your dataset?
- [ ] Can you run a simple computation/reduction on the data and produce a plausible result?

@cisaacstern
Copy link
Member

Ok so here was what ultimately happened, to make this work:

  1. Increase worker memory from 4 GB -> 12 GB
  2. Remove subsetting inputs (not required with more worker memory)
  3. Keep original large time chunks, to avoid locking stalls (also possible via more memory)

Going to rename this PR accordingly, for clarity, then merge, which will trigger the next production build.

@cisaacstern cisaacstern changed the title Create smaller time chunks Remove subset inputs Jun 24, 2022
@cisaacstern cisaacstern merged commit a795a1b into pangeo-forge:main Jun 24, 2022
@rabernat
Copy link

Charles thanks so much for your perseverance here! 🙏

@paigem
Copy link
Contributor Author

paigem commented Jun 24, 2022

Amazing!! We have a successful test! Thank you @cisaacstern!!

@cisaacstern
Copy link
Member

Screen Shot 2022-06-24 at 2 28 19 PM

KilledWorker on the production run 😵‍💫 https://pangeo-forge.org/dashboard/recipe-run/503

I will look into this more closely and get back with some ideas. Thanks for your patience, Paige.

@paigem
Copy link
Contributor Author

paigem commented Jun 24, 2022

Oh no!! So sorry about this @cisaacstern. Thanks for continuing to push this through!

For now, having the test data might go pretty far in the short term, so that's at least a good step!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants