Skip to content

Commit

Permalink
[hailtop.batch] add default_regions to hb.Batch, improve docs (#14224)
Browse files Browse the repository at this point in the history
`hb.Batch` now supports `default_regions` which completes the natural
hierarchy of: config, envvar, backend, batch, job. I went a little hog
wild with examples. I think we should have more examples everywhere!

The ServiceBackend doc page also had several basic formatting issues
which I addressed.
  • Loading branch information
danking authored Feb 2, 2024
1 parent 7f12473 commit dace919
Show file tree
Hide file tree
Showing 4 changed files with 203 additions and 44 deletions.
110 changes: 86 additions & 24 deletions hail/python/hailtop/batch/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -413,42 +413,100 @@ async def _async_close(self):


class ServiceBackend(Backend[bc.Batch]):
ANY_REGION: ClassVar[List[str]] = ['any_region']

"""Backend that executes batches on Hail's Batch Service on Google Cloud.
Examples
--------
>>> service_backend = ServiceBackend(billing_project='my-billing-account', remote_tmpdir='gs://my-bucket/temporary-files/') # doctest: +SKIP
>>> b = Batch(backend=service_backend) # doctest: +SKIP
Create and use a backend that bills to the Hail Batch billing project named "my-billing-account"
and stores temporary intermediate files in "gs://my-bucket/temporary-files".
>>> import hailtop.batch as hb
>>> service_backend = hb.ServiceBackend(
... billing_project='my-billing-account',
... remote_tmpdir='gs://my-bucket/temporary-files/'
... ) # doctest: +SKIP
>>> b = hb.Batch(backend=service_backend) # doctest: +SKIP
>>> j = b.new_job() # doctest: +SKIP
>>> j.command('echo hello world!') # doctest: +SKIP
>>> b.run() # doctest: +SKIP
>>> service_backend.close() # doctest: +SKIP
If the Hail configuration parameters batch/billing_project and
batch/remote_tmpdir were previously set with ``hailctl config set``, then
one may elide the `billing_project` and `remote_tmpdir` parameters.
Same as above, but set the billing project and temporary intermediate folders via a
configuration file::
>>> service_backend = ServiceBackend()
>>> b = Batch(backend=service_backend)
>>> b.run() # doctest: +SKIP
>>> service_backend.close()
cat >my-batch-script.py >>EOF
import hailtop.batch as hb
b = hb.Batch(backend=ServiceBackend())
j = b.new_job()
j.command('echo hello world!')
b.run()
EOF
hailctl config set batch/billing_project my-billing-account
hailctl config set batch/remote_tmpdir gs://my-bucket/temporary-files/
python3 my-batch-script.py
Same as above, but also specify the use of the :class:`.ServiceBackend` via configuration file::
cat >my-batch-script.py >>EOF
import hailtop.batch as hb
b = hb.Batch()
j = b.new_job()
j.command('echo hello world!')
b.run()
EOF
hailctl config set batch/billing_project my-billing-account
hailctl config set batch/remote_tmpdir gs://my-bucket/temporary-files/
hailctl config set batch/backend service
python3 my-batch-script.py
Create a backend which stores temporary intermediate files in
"https://my-account.blob.core.windows.net/my-container/tempdir".
>>> service_backend = hb.ServiceBackend(
... billing_project='my-billing-account',
... remote_tmpdir='https://my-account.blob.core.windows.net/my-container/tempdir'
... ) # doctest: +SKIP
Require all jobs in all batches in this backend to execute in us-central1::
>>> b = hb.Batch(backend=hb.ServiceBackend(regions=['us-central1']))
Same as above, but using a configuration file::
hailctl config set batch/regions us-central1
python3 my-batch-script.py
Same as above, but using the ``HAIL_BATCH_REGIONS`` environment variable::
export HAIL_BATCH_REGIONS=us-central1
python3 my-batch-script.py
Permit jobs to execute in *either* us-central1 or us-east1::
>>> b = hb.Batch(backend=hb.ServiceBackend(regions=['us-central1', 'us-east1']))
Same as above, but using a configuration file::
hailctl config set batch/regions us-central1,us-east1
Allow reading or writing to buckets even though they are "cold" storage:
>>> b = hb.Batch(
... backend=hb.ServiceBackend(
... gcs_bucket_allow_list=['cold-bucket', 'cold-bucket2'],
... ),
... )
Parameters
----------
billing_project:
Name of billing project to use.
bucket:
Name of bucket to use. Should not include the ``gs://`` prefix. Cannot be used with
`remote_tmpdir`. Temporary data will be stored in the "/batch" folder of this
bucket. This argument is deprecated. Use `remote_tmpdir` instead.
This argument is deprecated. Use `remote_tmpdir` instead.
remote_tmpdir:
Temporary data will be stored in this cloud storage folder. Cannot be used with deprecated
argument `bucket`. Paths should match a GCS URI like gs://<BUCKET_NAME>/<PATH> or an ABS
URI of the form https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/<PATH>.
Temporary data will be stored in this cloud storage folder.
google_project:
DEPRECATED. Please use gcs_requester_pays_configuration.
This argument is deprecated. Use `gcs_requester_pays_configuration` instead.
gcs_requester_pays_configuration : either :class:`str` or :class:`tuple` of :class:`str` and :class:`list` of :class:`str`, optional
If a string is provided, configure the Google Cloud Storage file system to bill usage to the
project identified by that string. If a tuple is provided, configure the Google Cloud
Expand All @@ -458,15 +516,19 @@ class ServiceBackend(Backend[bc.Batch]):
The authorization token to pass to the batch client.
Should only be set for user delegation purposes.
regions:
Cloud region(s) to run jobs in. Use py:staticmethod:`.ServiceBackend.supported_regions` to list the
available regions to choose from. Use py:attribute:`.ServiceBackend.ANY_REGION` to signify the default is jobs
can run in any available region. The default is jobs can run in any region unless a default value has
been set with hailctl. An example invocation is `hailctl config set batch/regions "us-central1,us-east1"`.
Cloud regions in which jobs may run. :attr:`.ServiceBackend.ANY_REGION` indicates jobs may
run in any region. If unspecified or ``None``, the ``batch/regions`` Hail configuration
variable is consulted. See examples above. If none of these variables are set, then jobs may
run in any region. :meth:`.ServiceBackend.supported_regions` lists the available regions.
gcs_bucket_allow_list:
A list of buckets that the :class:`.ServiceBackend` should be permitted to read from or write to, even if their
default policy is to use "cold" storage. Should look like ``["bucket1", "bucket2"]``.
default policy is to use "cold" storage.
"""

ANY_REGION: ClassVar[List[str]] = ['any_region']
"""A special value that indicates a job may run in any region."""

@staticmethod
def supported_regions():
"""
Expand Down
24 changes: 17 additions & 7 deletions hail/python/hailtop/batch/batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ class Batch:
--------
Create a batch object:
>>> p = Batch()
>>> import hailtop.batch as hb
>>> p = hb.Batch()
Create a new job that prints "hello":
Expand All @@ -35,6 +36,10 @@ class Batch:
>>> p.run()
Require all jobs in this batch to execute in us-central1:
>>> b = hb.Batch(backend=hb.ServiceBackend(), default_regions=['us-central1'])
Notes
-----
Expand Down Expand Up @@ -77,6 +82,9 @@ class Batch:
default_storage:
Storage setting to use by default if not specified by a job. Only
applicable for the :class:`.ServiceBackend`. See :meth:`.Job.storage`.
default_regions:
Cloud regions in which jobs may run. When unspecified or ``None``, use the regions attribute of
:class:`.ServiceBackend`. See :class:`.ServiceBackend` for details.
default_timeout:
Maximum time in seconds for a job to run before being killed. Only
applicable for the :class:`.ServiceBackend`. If `None`, there is no
Expand Down Expand Up @@ -157,6 +165,7 @@ def __init__(
default_memory: Optional[Union[int, str]] = None,
default_cpu: Optional[Union[float, int, str]] = None,
default_storage: Optional[Union[int, str]] = None,
default_regions: Optional[List[str]] = None,
default_timeout: Optional[Union[float, int]] = None,
default_shell: Optional[str] = None,
default_python_image: Optional[str] = None,
Expand Down Expand Up @@ -195,6 +204,9 @@ def __init__(
self._default_memory = default_memory
self._default_cpu = default_cpu
self._default_storage = default_storage
self._default_regions = default_regions
if self._default_regions is None and isinstance(self._backend, _backend.ServiceBackend):
self._default_regions = self._backend.regions
self._default_timeout = default_timeout
self._default_shell = default_shell
self._default_python_image = default_python_image
Expand Down Expand Up @@ -316,14 +328,13 @@ def new_bash_job(
j.cpu(self._default_cpu)
if self._default_storage is not None:
j.storage(self._default_storage)
if self._default_regions is not None:
j.regions(self._default_regions)
if self._default_timeout is not None:
j.timeout(self._default_timeout)
if self._default_spot is not None:
j.spot(self._default_spot)

if isinstance(self._backend, _backend.ServiceBackend):
j.regions(self._backend.regions)

self._jobs.append(j)
return j

Expand Down Expand Up @@ -388,14 +399,13 @@ def hello(name):
j.cpu(self._default_cpu)
if self._default_storage is not None:
j.storage(self._default_storage)
if self._default_regions is not None:
j.regions(self._default_regions)
if self._default_timeout is not None:
j.timeout(self._default_timeout)
if self._default_spot is not None:
j.spot(self._default_spot)

if isinstance(self._backend, _backend.ServiceBackend):
j.regions(self._backend.regions)

self._jobs.append(j)
return j

Expand Down
83 changes: 71 additions & 12 deletions hail/python/hailtop/batch/docs/service.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,22 +232,15 @@ error messages in the terminal window.
Submitting a Batch to the Service
---------------------------------

.. warning::

To avoid substantial network costs, ensure your jobs and data reside in the same `region`_.

To execute a batch on the Batch service rather than locally, first
construct a :class:`.ServiceBackend` object with a billing project and
bucket for storing intermediate files. Your service account must have read
and write access to the bucket.

.. warning::

By default, the Batch Service runs jobs in any region in the US. Make sure you have considered additional `ingress and
egress fees <https://cloud.google.com/storage/pricing>`_ when using regional buckets and container or artifact
registries. Multi-regional buckets also have additional replication fees when writing data. A good rule of thumb is to use
a multi-regional artifact registry for Docker images and regional buckets for data. You can then specify which region(s)
you want your job to run in with :meth:`.Job.regions`. To set the default region(s) for all jobs, you can set the input
regions argument to :class:`.ServiceBackend` or use hailctl to set the default value. An example invocation is
`hailctl config set batch/regions "us-central1,us-east1"`. You can also get the full list of supported regions
with py:staticmethod:`.ServiceBackend.supported_regions`.

Next, pass the :class:`.ServiceBackend` object to the :class:`.Batch` constructor
with the parameter name `backend`.

Expand All @@ -257,7 +250,7 @@ and execute the following batch:

.. code-block:: python
>>> import hailtop.batch as hb # doctest: +SKIP
>>> import hailtop.batch as hb
>>> backend = hb.ServiceBackend('my-billing-project', remote_tmpdir='gs://my-bucket/batch/tmp/') # doctest: +SKIP
>>> b = hb.Batch(backend=backend, name='test') # doctest: +SKIP
>>> j = b.new_job(name='hello') # doctest: +SKIP
Expand All @@ -276,6 +269,72 @@ have previously set them with ``hailctl``:

A trial billing project is automatically created for you with the name {USERNAME}-trial

.. _region:

Regions
-------

Data and compute both reside in a physical location. In Google Cloud Platform, the location of data
is controlled by the location of the containing bucket. ``gcloud`` can determine the location of a
bucket::

gcloud storage buckets describe gs://my-bucket

If your compute resides in a different location from the data it reads or writes, then you will
accrue substantial `network charges <https://cloud.google.com/storage/pricing#network-pricing>`__.

To avoid network charges ensure all your data is in one region and specify that region in one of the
following five ways. As a running example, we consider data stored in `us-central1`. The options are
listed from highest to lowest precedence.

1. :meth:`.Job.regions`:

.. code-block:: python
>>> b = hb.Batch(backend=hb.ServiceBackend())
>>> j = b.new_job()
>>> j.regions(['us-central1'])
2. The ``default_regions`` parameter of :class:`.Batch`:

.. code-block:: python
>>> b = hb.Batch(backend=hb.ServiceBackend(), default_regions=['us-central1'])
3. The ``regions`` parameter of :class:`.ServiceBackend`:

.. code-block:: python
>>> b = hb.Batch(backend=hb.ServiceBackend(regions=['us-central1']))
4. The ``HAIL_BATCH_REGIONS`` environment variable:

.. code-block:: sh
export HAIL_BATCH_REGIONS=us-central1
python3 my-batch-script.py
5. The ``batch/region`` configuration variable:

.. code-block:: sh
hailctl config set batch/regions us-central1
python3 my-batch-script.py
.. warning::

If none of the five options above are specified, your job may run in *any* region!

In Google Cloud Platform, the location of a multi-region bucket is considered *different* from any
region within that multi-region. For example, if a VM in the `us-central1` region reads data from a
bucket in the `us` multi-region, this incurs network charges becuse `us` is not considered equal to
`us-central1`.

Container (aka Docker) images are a form of data. In Google Cloud Platform, we recommend storing
your images in a multi-regional artifact registry, which at time of writing, despite being
"multi-regional", does not incur network charges in the manner described above.


Using the UI
------------
Expand Down
30 changes: 29 additions & 1 deletion hail/python/test/hailtop/batch/test_batch_service_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -798,7 +798,7 @@ async def foo(i, j):


def test_specify_job_region(backend: ServiceBackend):
b = batch(backend, cancel_after_n_failures=1)
b = batch(backend)
j = b.new_job('region')
possible_regions = backend.supported_regions()
j.regions(possible_regions)
Expand All @@ -809,6 +809,34 @@ def test_specify_job_region(backend: ServiceBackend):
assert res_status['state'] == 'success', str((res_status, res.debug_info()))


def test_job_regions_controls_job_execution_region(backend: ServiceBackend):
the_region = backend.supported_regions()[0]

b = batch(backend)
j = b.new_job()
j.regions([the_region])
j.command('true')
res = b.run()

assert res
job_status = res.get_job(1).status()
assert job_status['status']['region'] == the_region, str((job_status, res.debug_info()))


def test_job_regions_overrides_batch_regions(backend: ServiceBackend):
the_region = backend.supported_regions()[0]

b = batch(backend, default_regions=['some-other-region'])
j = b.new_job()
j.regions([the_region])
j.command('true')
res = b.run()

assert res
job_status = res.get_job(1).status()
assert job_status['status']['region'] == the_region, str((job_status, res.debug_info()))


def test_always_copy_output(backend: ServiceBackend, output_tmpdir: str):
output_path = os.path.join(output_tmpdir, 'test_always_copy_output.txt')

Expand Down

0 comments on commit dace919

Please sign in to comment.