Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Aug 18, 2023
1 parent b237718 commit 1a0bbb5
Show file tree
Hide file tree
Showing 7 changed files with 15 additions and 19 deletions.
8 changes: 4 additions & 4 deletions _sources/advanced.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Using Dask
Scanning and combining datasets can be computationally intensive and may
require a lot of bandwidth for some data formats. Where the target data
contains many input files, it makes sense to parallelise the job with
dask and maybe disrtibuted the workload on a cluster to get additional
dask and maybe distribute the workload on a cluster to get additional
CPUs and network performance.

Simple parallel
Expand Down Expand Up @@ -41,7 +41,7 @@ Tree reduction

In some cases, the combine process can itself be slow or memory hungry.
In such cases, it is useful to combine the single-file reference sets in
batches (which reducec a lot of redundancy between them) and then
batches (which reduce a lot of redundancy between them) and then
combine the results of the batches. This technique is known as tree
reduction. An example of doing this by hand can be seen `here`_.

Expand Down Expand Up @@ -106,13 +106,13 @@ Parquet Storage

JSON is very convenient as a storage format for references, because it is
simple, human-readable and ubiquitously supported. However, it is not the most
efficient in terns of storage size of parsing speed. For python, in particular,
efficient in terms of storage size of parsing speed. For python, in particular,
it comes with the added downside of repeated strings becoming separate python
string instances, greatly inflating memory footprint at load time.

To overcome these problems, and in particular keep down the memory use for the
end-user of kerchunked data, we can convert references to be stored in parquet,
and use them with ``fsspec.implementations.reference.DRFererenceFileSystem``,
and use them with ``fsspec.implementations.reference.ReferenceFileSystem``,
an alternative new implementation designed to work only with parquet input.

The principle benefits of the parquet path are:
Expand Down
2 changes: 1 addition & 1 deletion _sources/test_example.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ This is what a user of the generated dataset would do. This person does not need
Since the invocation for xarray to read this data is a little involved, we recommend
declaring the data set in an ``intake`` catalog. Alternatively, you might split the command
into mlutiple lines by first constructing the filesystem or mapper (you will see this in some
into multiple lines by first constructing the filesystem or mapper (you will see this in some
examples).

Note that, if the combining was done previously and saved to a JSON file, then the path to
Expand Down
5 changes: 2 additions & 3 deletions _sources/tutorial.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Initially we create a pair of single file jsons for two ERA5 variables using ``K
Single file JSONs
-----------------

The ``Kerchunk.hdf.SingleHdf5ToZarr`` method is used to create a single ``.json`` reference file for each file url passed to it. Here we use it to create a number of reference files for the ERA5 pubic dataset on `AWS <https://registry.opendata.aws/ecmwf-era5/>`__. We will compute a number of different times and variables to demonstrate different methods of combining them.
The ``Kerchunk.hdf.SingleHdf5ToZarr`` method is used to create a single ``.json`` reference file for each file url passed to it. Here we use it to create a number of reference files for the ERA5 public dataset on `AWS <https://registry.opendata.aws/ecmwf-era5/>`__. We will compute a number of different times and variables to demonstrate different methods of combining them.

The Kerchunk package is still in a development phase and so changes frequently. Installing directly from the source code is recommended.

Expand Down Expand Up @@ -244,8 +244,7 @@ For more complex uses it is also possible to pass in a compiled ``regex`` functi
Here the ``new_dimension`` values have been populated by the compiled ``regex`` function ``ex`` which takes the file urls as input.

To extract time information from file names, a custom function can be defined of the form ``(index, fs, var, fn) -> value`` to generate a valid ``datetime.datetime`` data type, typically using regular expressions. These datetime objects are then used to generate time coordinates through the
``coo_dtypes`` argument in the ``MultiZarrToZarr`` function.
To extract time information from file names, a custom function can be defined of the form ``(index, fs, var, fn) -> value`` to generate a valid ``datetime.datetime`` data type, typically using regular expressions. These datetime objects are then used to generate time coordinates through the ``coo_dtypes`` argument in the ``MultiZarrToZarr`` function.

Here's an example for file names following the pattern ``cgl_TOC_YYYYmmddHHMM_X21Y05_S3A_v1.1.0.json``:

Expand Down
8 changes: 4 additions & 4 deletions advanced.html
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ <h2>Using Dask<a class="headerlink" href="#using-dask" title="Permalink to this
<p>Scanning and combining datasets can be computationally intensive and may
require a lot of bandwidth for some data formats. Where the target data
contains many input files, it makes sense to parallelise the job with
dask and maybe disrtibuted the workload on a cluster to get additional
dask and maybe distribute the workload on a cluster to get additional
CPUs and network performance.</p>
<div class="section" id="simple-parallel">
<h3>Simple parallel<a class="headerlink" href="#simple-parallel" title="Permalink to this headline"></a></h3>
Expand Down Expand Up @@ -125,7 +125,7 @@ <h3>Simple parallel<a class="headerlink" href="#simple-parallel" title="Permalin
<h3>Tree reduction<a class="headerlink" href="#tree-reduction" title="Permalink to this headline"></a></h3>
<p>In some cases, the combine process can itself be slow or memory hungry.
In such cases, it is useful to combine the single-file reference sets in
batches (which reducec a lot of redundancy between them) and then
batches (which reduce a lot of redundancy between them) and then
combine the results of the batches. This technique is known as tree
reduction. An example of doing this by hand can be seen <a class="reference external" href="https://gist.github.com/peterm790/5f901453ed7ac75ac28ed21a7138dcf8">here</a>.</p>
<p>We also provide <a class="reference internal" href="reference.html#kerchunk.combine.auto_dask" title="kerchunk.combine.auto_dask"><code class="xref py py-func docutils literal notranslate"><span class="pre">kerchunk.combine.auto_dask()</span></code></a> as a convenience. This
Expand Down Expand Up @@ -179,12 +179,12 @@ <h2>Archive Files<a class="headerlink" href="#archive-files" title="Permalink to
<h2>Parquet Storage<a class="headerlink" href="#parquet-storage" title="Permalink to this headline"></a></h2>
<p>JSON is very convenient as a storage format for references, because it is
simple, human-readable and ubiquitously supported. However, it is not the most
efficient in terns of storage size of parsing speed. For python, in particular,
efficient in terms of storage size of parsing speed. For python, in particular,
it comes with the added downside of repeated strings becoming separate python
string instances, greatly inflating memory footprint at load time.</p>
<p>To overcome these problems, and in particular keep down the memory use for the
end-user of kerchunked data, we can convert references to be stored in parquet,
and use them with <code class="docutils literal notranslate"><span class="pre">fsspec.implementations.reference.DRFererenceFileSystem</span></code>,
and use them with <code class="docutils literal notranslate"><span class="pre">fsspec.implementations.reference.ReferenceFileSystem</span></code>,
an alternative new implementation designed to work only with parquet input.</p>
<p>The principle benefits of the parquet path are:</p>
<ul class="simple">
Expand Down
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion test_example.html
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ <h2>Using the output<a class="headerlink" href="#using-the-output" title="Permal
</div>
<p>Since the invocation for xarray to read this data is a little involved, we recommend
declaring the data set in an <code class="docutils literal notranslate"><span class="pre">intake</span></code> catalog. Alternatively, you might split the command
into mlutiple lines by first constructing the filesystem or mapper (you will see this in some
into multiple lines by first constructing the filesystem or mapper (you will see this in some
examples).</p>
<p>Note that, if the combining was done previously and saved to a JSON file, then the path to
it should replace <code class="docutils literal notranslate"><span class="pre">out</span></code>, above, along with a <code class="docutils literal notranslate"><span class="pre">target_options</span></code> for any additional
Expand Down
7 changes: 2 additions & 5 deletions tutorial.html
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ <h1>Tutorial<a class="headerlink" href="#tutorial" title="Permalink to this head
<p>Initially we create a pair of single file jsons for two ERA5 variables using <code class="docutils literal notranslate"><span class="pre">Kerchunk.hdf.SingleHdf5ToZarr</span></code>. This ERA5 dataset is free to access and so it is possible to replicate this workflow on a local machine without credentials.</p>
<div class="section" id="single-file-jsons">
<h2>Single file JSONs<a class="headerlink" href="#single-file-jsons" title="Permalink to this headline"></a></h2>
<p>The <code class="docutils literal notranslate"><span class="pre">Kerchunk.hdf.SingleHdf5ToZarr</span></code> method is used to create a single <code class="docutils literal notranslate"><span class="pre">.json</span></code> reference file for each file url passed to it. Here we use it to create a number of reference files for the ERA5 pubic dataset on <a class="reference external" href="https://registry.opendata.aws/ecmwf-era5/">AWS</a>. We will compute a number of different times and variables to demonstrate different methods of combining them.</p>
<p>The <code class="docutils literal notranslate"><span class="pre">Kerchunk.hdf.SingleHdf5ToZarr</span></code> method is used to create a single <code class="docutils literal notranslate"><span class="pre">.json</span></code> reference file for each file url passed to it. Here we use it to create a number of reference files for the ERA5 public dataset on <a class="reference external" href="https://registry.opendata.aws/ecmwf-era5/">AWS</a>. We will compute a number of different times and variables to demonstrate different methods of combining them.</p>
<p>The Kerchunk package is still in a development phase and so changes frequently. Installing directly from the source code is recommended.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>!pip install git+https://github.com/fsspec/kerchunk
</pre></div>
Expand Down Expand Up @@ -292,10 +292,7 @@ <h3>Using coo_map<a class="headerlink" href="#using-coo-map" title="Permalink to
</pre></div>
</div>
<p>Here the <code class="docutils literal notranslate"><span class="pre">new_dimension</span></code> values have been populated by the compiled <code class="docutils literal notranslate"><span class="pre">regex</span></code> function <code class="docutils literal notranslate"><span class="pre">ex</span></code> which takes the file urls as input.</p>
<dl class="simple">
<dt>To extract time information from file names, a custom function can be defined of the form <code class="docutils literal notranslate"><span class="pre">(index,</span> <span class="pre">fs,</span> <span class="pre">var,</span> <span class="pre">fn)</span> <span class="pre">-&gt;</span> <span class="pre">value</span></code> to generate a valid <code class="docutils literal notranslate"><span class="pre">datetime.datetime</span></code> data type, typically using regular expressions. These datetime objects are then used to generate time coordinates through the</dt><dd><p><code class="docutils literal notranslate"><span class="pre">coo_dtypes</span></code> argument in the <code class="docutils literal notranslate"><span class="pre">MultiZarrToZarr</span></code> function.</p>
</dd>
</dl>
<p>To extract time information from file names, a custom function can be defined of the form <code class="docutils literal notranslate"><span class="pre">(index,</span> <span class="pre">fs,</span> <span class="pre">var,</span> <span class="pre">fn)</span> <span class="pre">-&gt;</span> <span class="pre">value</span></code> to generate a valid <code class="docutils literal notranslate"><span class="pre">datetime.datetime</span></code> data type, typically using regular expressions. These datetime objects are then used to generate time coordinates through the <code class="docutils literal notranslate"><span class="pre">coo_dtypes</span></code> argument in the <code class="docutils literal notranslate"><span class="pre">MultiZarrToZarr</span></code> function.</p>
<p>Here’s an example for file names following the pattern <code class="docutils literal notranslate"><span class="pre">cgl_TOC_YYYYmmddHHMM_X21Y05_S3A_v1.1.0.json</span></code>:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">fn_to_time</span><span class="p">(</span><span class="n">index</span><span class="p">,</span> <span class="n">fs</span><span class="p">,</span> <span class="n">var</span><span class="p">,</span> <span class="n">fn</span><span class="p">):</span>
<span class="kn">import</span> <span class="nn">re</span>
Expand Down

0 comments on commit 1a0bbb5

Please sign in to comment.