some notes on data transfer rates using .to_zarr #166

rabernat · 2018-03-16T20:52:27Z

I am currently pushing a 5 TB dataset to GCS using xarray.to_zarr. The dataset has the following structure

<xarray.Dataset>
Dimensions:         (nv: 2, st_edges_ocean: 51, st_ocean: 50, time: 730, xt_ocean: 3600, xu_ocean: 3600, yt_ocean: 2700, yu_ocean: 2700)
Coordinates:
  * xt_ocean        (xt_ocean) float64 -279.9 -279.8 -279.7 -279.6 -279.5 ...
  * yt_ocean        (yt_ocean) float64 -81.11 -81.07 -81.02 -80.98 -80.94 ...
  * st_ocean        (st_ocean) float64 5.034 15.1 25.22 35.36 45.58 55.85 ...
  * st_edges_ocean  (st_edges_ocean) float64 0.0 10.07 20.16 30.29 40.47 ...
  * nv              (nv) float64 1.0 2.0
  * time            (time) float64 6.94e+04 6.94e+04 6.941e+04 6.941e+04 ...
  * xu_ocean        (xu_ocean) float64 -279.9 -279.8 -279.7 -279.6 -279.5 ...
  * yu_ocean        (yu_ocean) float64 -81.09 -81.05 -81.0 -80.96 -80.92 ...
Data variables:
    temp            (time, st_ocean, yt_ocean, xt_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    salt            (time, st_ocean, yt_ocean, xt_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    u               (time, st_ocean, yu_ocean, xu_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>
    v               (time, st_ocean, yu_ocean, xu_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)>

I use the following encoding

import zarr
compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
encoding = {vname: {'compressor': compressor} for vname in ds_subset.variables}
ds.to_zarr(store=gcsmap, encoding=encoding)

Here is what the dashboard shows while this is happening.

The total data size is 5.6 TB. By monitoring the system via netdata, I can see that the transfer rate is approx. 18.75 MB / s. At this rate it will take about 86 hours to transfer the dataset.

In comparison, copying the raw netCDF files via globus from the same server to cheyenne gives a transfer rate of 114.07 MB / s, nearly 10x faster.

I am trying to understand the bottlenecks here. If I give the cluster more threads, it doesn't go any faster.

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-03-16T22:34:50Z

You might consider looking at the profile tab to see if we're blocked on bandwidth (something like read or write from a network library) or something else.

…

On Fri, Mar 16, 2018 at 4:52 PM, Ryan Abernathey ***@***.***> wrote: I am currently pushing a 5 TB dataset to GCS using xarray.to_zarr. The dataset has the following structure <xarray.Dataset> Dimensions: (nv: 2, st_edges_ocean: 51, st_ocean: 50, time: 730, xt_ocean: 3600, xu_ocean: 3600, yt_ocean: 2700, yu_ocean: 2700) Coordinates: * xt_ocean (xt_ocean) float64 -279.9 -279.8 -279.7 -279.6 -279.5 ... * yt_ocean (yt_ocean) float64 -81.11 -81.07 -81.02 -80.98 -80.94 ... * st_ocean (st_ocean) float64 5.034 15.1 25.22 35.36 45.58 55.85 ... * st_edges_ocean (st_edges_ocean) float64 0.0 10.07 20.16 30.29 40.47 ... * nv (nv) float64 1.0 2.0 * time (time) float64 6.94e+04 6.94e+04 6.941e+04 6.941e+04 ... * xu_ocean (xu_ocean) float64 -279.9 -279.8 -279.7 -279.6 -279.5 ... * yu_ocean (yu_ocean) float64 -81.09 -81.05 -81.0 -80.96 -80.92 ... Data variables: temp (time, st_ocean, yt_ocean, xt_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)> salt (time, st_ocean, yt_ocean, xt_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)> u (time, st_ocean, yu_ocean, xu_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)> v (time, st_ocean, yu_ocean, xu_ocean) float32 dask.array<shape=(730, 50, 2700, 3600), chunksize=(1, 1, 2700, 3600)> I use the following encoding import zarr compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2) encoding = {vname: {'compressor': compressor} for vname in ds_subset.variables} ds.to_zarr(store=gcsmap, encoding=encoding) Here is what the dashboard shows while this is happening. [image: image] <https://user-images.githubusercontent.com/1197350/37543937-a7beeea4-2939-11e8-949b-0e6e3e424421.png> The total data size is 5.6 TB. By monitoring the system via netdata, I can see that the transfer rate is approx. 18.75 MB / s. At this rate it will take about 86 hours to transfer the dataset. In comparison, copying the raw netCDF files via globus from the same server to cheyenne gives a transfer rate of 114.07 MB / s, nearly 10x faster. I am trying to understand the bottlenecks here. If I give the cluster more threads, it doesn't go any faster. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#166>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszOgBaBZPZSQAaz4FRGKN8C5dj1R0ks5tfCYNgaJpZM4Sua9d> .

rabernat · 2018-04-03T14:06:35Z

I am trying an alternate approach for uploading zarr data to GCS. I first dump the dataset to a regular zarr DirectoryStore. Then I use gsutil to upload the files as objects

gsutil -m cp -r control gs://pangeo-data/cm2.6/

This appears much more stable and totally saturates my network at 111 MiB/s. The downside is that you need to have space to duplicate the data.

I am now convinced that the ds.to_zarr(gcsmap) approach is I/O bound on my system by xarray's speed at reading from the disk. (It takes about the same amount of time to transcode to zarr DirectoryStore as it does to copy directly GCS.) The problem with ds.to_zarr(gcsmap) is that it is highly error prone.

jhamman · 2018-04-03T14:55:04Z

@rabernat - This is a very good diagnostic result and a nice alternative method for moving zarr datasets around. It may be useful, for comparative purposes, to try to understand how/why gsutil is able to achieve a more robust transfer to GCS. Presumably, it is more flexible in terms of retries and timeouts (IIRC, both are configurable within dask).

stale · 2018-06-25T16:15:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-07-02T16:38:33Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

This was referenced Mar 19, 2018

Why is xarray.to_zarr slow sometimes? #150

Closed

Expand user group to XArray mailing list #130

Closed

This was referenced Mar 27, 2018

How to organize files on GCP #19

Closed

WIP: google cloud storage class zarr-developers/zarr-python#252

Closed

rabernat mentioned this issue Apr 12, 2018

Pangeo use case: Advanced regridding using ESMF/ESMpy/OCGIS/xESMF/Xarray/Dask #197

Closed

jacobtomlinson added xarray data access labels Apr 26, 2018

stale bot added the stale label Jun 25, 2018

stale bot closed this as completed Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some notes on data transfer rates using .to_zarr #166

some notes on data transfer rates using .to_zarr #166

rabernat commented Mar 16, 2018

mrocklin commented Mar 16, 2018 via email

rabernat commented Apr 3, 2018

jhamman commented Apr 3, 2018

stale bot commented Jun 25, 2018

stale bot commented Jul 2, 2018

some notes on data transfer rates using .to_zarr #166

some notes on data transfer rates using .to_zarr #166

Comments

rabernat commented Mar 16, 2018

mrocklin commented Mar 16, 2018 via email

rabernat commented Apr 3, 2018

jhamman commented Apr 3, 2018

stale bot commented Jun 25, 2018

stale bot commented Jul 2, 2018