-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some notes on data transfer rates using .to_zarr #166
Comments
You might consider looking at the profile tab to see if we're blocked on
bandwidth (something like read or write from a network library) or
something else.
…On Fri, Mar 16, 2018 at 4:52 PM, Ryan Abernathey ***@***.***> wrote:
I am currently pushing a 5 TB dataset to GCS using xarray.to_zarr. The
dataset has the following structure
<xarray.Dataset>
Dimensions: (nv: 2, st_edges_ocean: 51, st_ocean: 50, time: 730, xt_ocean: 3600, xu_ocean: 3600, yt_ocean: 2700, yu_ocean: 2700)
Coordinates:
* xt_ocean (xt_ocean) float64 -279.9 -279.8 -279.7 -279.6 -279.5 ...
* yt_ocean (yt_ocean) float64 -81.11 -81.07 -81.02 -80.98 -80.94 ...
* st_ocean (st_ocean) float64 5.034 15.1 25.22 35.36 45.58 55.85 ...
* st_edges_ocean (st_edges_ocean) float64 0.0 10.07 20.16 30.29 40.47 ...
* nv (nv) float64 1.0 2.0
* time (time) float64 6.94e+04 6.94e+04 6.941e+04 6.941e+04 ...
* xu_ocean (xu_ocean) float64 -279.9 -279.8 -279.7 -279.6 -279.5 ...
* yu_ocean (yu_ocean) float64 -81.09 -81.05 -81.0 -80.96 -80.92 ...
Data variables:
temp (time, st_ocean, yt_ocean, xt_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
salt (time, st_ocean, yt_ocean, xt_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
u (time, st_ocean, yu_ocean, xu_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
v (time, st_ocean, yu_ocean, xu_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
I use the following encoding
import zarr
compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
encoding = {vname: {'compressor': compressor} for vname in ds_subset.variables}
ds.to_zarr(store=gcsmap, encoding=encoding)
Here is what the dashboard shows while this is happening.
[image: image]
<https://user-images.githubusercontent.com/1197350/37543937-a7beeea4-2939-11e8-949b-0e6e3e424421.png>
The total data size is 5.6 TB. By monitoring the system via netdata, I can
see that the transfer rate is approx. 18.75 MB / s. At this rate it will
take about 86 hours to transfer the dataset.
In comparison, copying the raw netCDF files via globus from the same
server to cheyenne gives a transfer rate of 114.07 MB / s, nearly 10x
faster.
I am trying to understand the bottlenecks here. If I give the cluster more
threads, it doesn't go any faster.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#166>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszOgBaBZPZSQAaz4FRGKN8C5dj1R0ks5tfCYNgaJpZM4Sua9d>
.
|
I am trying an alternate approach for uploading zarr data to GCS. I first dump the dataset to a regular zarr
This appears much more stable and totally saturates my network at 111 MiB/s. The downside is that you need to have space to duplicate the data. I am now convinced that the |
@rabernat - This is a very good diagnostic result and a nice alternative method for moving zarr datasets around. It may be useful, for comparative purposes, to try to understand how/why |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
I am currently pushing a 5 TB dataset to GCS using
xarray.to_zarr
. The dataset has the following structureI use the following encoding
Here is what the dashboard shows while this is happening.
The total data size is 5.6 TB. By monitoring the system via netdata, I can see that the transfer rate is approx. 18.75 MB / s. At this rate it will take about 86 hours to transfer the dataset.
In comparison, copying the raw netCDF files via globus from the same server to cheyenne gives a transfer rate of 114.07 MB / s, nearly 10x faster.
I am trying to understand the bottlenecks here. If I give the cluster more threads, it doesn't go any faster.
The text was updated successfully, but these errors were encountered: