Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate RepositoryPlugin.list_partitions (#6531) #6554

Merged
Merged
44 changes: 15 additions & 29 deletions OPERATOR.rst
Original file line number Diff line number Diff line change
Expand Up @@ -439,40 +439,26 @@ Export AWS Inspector findings
Adding snapshots to ``dev``
^^^^^^^^^^^^^^^^^^^^^^^^^^^

When adding a new snapshot to dev, the operator should also add the snapshot to
sandbox, but with an appropriate prefix.
When adding a new snapshot to ``dev``, ``anvildev``, the operator should also
add the snapshot to ``sandbox`` or ``anvilbox``, respectively.

To determine the prefix:

#. Go to `TDR dev in the Google Cloud Console`_. Authenticate with your personal
(…@ucsc.edu) account.

#. Run queries such as ::

SELECT COUNT(*) FROM `<TDR_PROJECT_NAME>.<SNAPSHOT_NAME>.links` where starts_with(links_id, '4')

in order to find the shortest prefix that yields 64 or more links (the amount
required by the integration test). By convention, prefixes start with 42.

.. _TDR dev in the Google Cloud Console: https://console.cloud.google.com/bigquery?project=platform-hca-dev
The ``post_deploy_tdr.py`` script will fail if the computed common prefix
contains an unacceptable number of subgraphs. If the script reports that the
common prefix is too long, truncate it by 1 character. If it's too short, append
1 arbitrary hexadecimal character. Pass the updated prefix as a keyword argument
to the ``mksrc`` function for the affected source(s), including a partition
prefix length of 1. Then refresh the environment and re-attempt the deployment.

Adding snapshots to ``prod``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Unless specifically agreed with the system admin (tech lead), PRs which update
or add new snapshots to ``prod`` should be filed against the ``prod`` branch
instead of ``develop``. When deciding whether to perform snapshot channges
directly to ``prod`` or include them in a routine promotion, the system admin
considers the scope of changes to be promoted. It would be a mistake to promote
large changes in combination with snapshots because that would make it difficult
to diagnose whether indexing failures are caused by the changes or the
snapshots.

Add new or updated snapshots on an ad hoc basis, when requested. Do not sync
with regular promotions.

Add a checklist item at the end of the operator's PR checklist to file a
back-merge PR from ``prod`` to ``develop``.
We decide on a case-by-case basis whether PRs which update or add new snapshots
to ``prod`` should be filed against the ``prod`` branch instead of ``develop``.
When deciding whether to perform snapshot changes directly to ``prod`` or
include them in a routine promotion, the system admin considers the scope of
changes to be promoted. It would be a mistake to promote large changes in
combination with snapshots because that would make it difficult to diagnose
whether indexing failures are caused by the changes or the snapshots.

Removing catalogs from ``prod`` and setting a new default
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
11 changes: 11 additions & 0 deletions UPGRADING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,17 @@ branch that does not have the listed changes, the steps would need to be
reverted. This is all fairly informal and loosely defined. Hopefully we won't
have too many entries in this file.

#6531 Eliminate RepositoryPlugin.list_partitions
================================================

The subgraph counts of indexed sources are no longer tracked in the source tree.
For each of your personal deployments, in ``environment.py``: update the
``mksrc`` function, remove the ``subgraphs`` parameter from all of its call
sites, update the ``prefix`` parameter where is passed, and remove any functions
used to construct prefixes, e.g. ``common_prefix()``. Be careful to preserve any
flags such as ``ma`` or ``pop``. As always, use the sandbox deployment's
``environment.py`` as a model when upgrading personal deployments.


#6570 Upgrade dependencies 2024-09-16
=====================================
Expand Down
28 changes: 5 additions & 23 deletions deployments/anvilbox/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,44 +9,26 @@

is_sandbox = True


def common_prefix(n: int) -> str:
"""
For a given number of subgraphs, return a common prefix that yields around
16 subgraphs.

>>> [common_prefix(n) for n in (0, 1, 31, 32, 33, 512+15, 512+16, 512+17)]
['', '', '', '', '1', 'f', '01', '11']
"""
hex_digits = '0123456789abcdef'
m = len(hex_digits)
# Double threshold to lower probability that no subgraphs match the prefix
return hex_digits[n % m] + common_prefix(n // m) if n > 2 * m else ''


ma = 1 # managed access
pop = 2 # remove snapshot


def mksrc(source_type: Literal['bigquery', 'parquet'],
google_project,
snapshot,
subgraphs,
flags: int = 0,
/,
prefix: Optional[str] = None
prefix: str = ''
) -> tuple[str, str | None]:
project = '_'.join(snapshot.split('_')[1:-3])
assert flags <= ma | pop
if prefix is None:
prefix = common_prefix(subgraphs)
source = None if flags & pop else ':'.join([
'tdr',
source_type,
'gcp',
google_project,
snapshot,
prefix + '/0'
prefix
])
return project, source

Expand All @@ -73,9 +55,9 @@ def mkdict(previous_catalog: dict[str, str],


anvil_sources = mkdict({}, 3, mkdelta([
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),
mksrc('bigquery', 'datarepo-dev-42c70e6a', 'ANVIL_CCDG_Sample_1_20230228_ANV5_202302281520', 28),
mksrc('bigquery', 'datarepo-dev-97ad270b', 'ANVIL_CMG_Sample_1_20230225_ANV5_202302281509', 25)
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732'),
mksrc('bigquery', 'datarepo-dev-42c70e6a', 'ANVIL_CCDG_Sample_1_20230228_ANV5_202302281520'),
mksrc('bigquery', 'datarepo-dev-97ad270b', 'ANVIL_CMG_Sample_1_20230225_ANV5_202302281509')
]))


Expand Down
25 changes: 7 additions & 18 deletions deployments/anvildev/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,16 @@
Optional,
)


def partition_prefix_length(n: int) -> int:
"""
For a given number of subgraphs, return a partition prefix length that is
expected to rarely exceed 512 subgraphs per partition.

>>> [partition_prefix_length(n) for n in (0, 1, 512, 513, 16 * 512, 16 * 513 )]
[0, 0, 0, 1, 1, 2]
"""
return 1 + partition_prefix_length(n // 16) if n > 512 else 0


ma = 1 # managed access
pop = 2 # remove snapshot


def mksrc(source_type: Literal['bigquery', 'parquet'],
google_project,
snapshot,
subgraphs,
flags: int = 0
flags: int = 0,
/,
prefix: str = ''
) -> tuple[str, str | None]:
project = '_'.join(snapshot.split('_')[1:-3])
assert flags <= ma | pop
Expand All @@ -37,7 +26,7 @@ def mksrc(source_type: Literal['bigquery', 'parquet'],
'gcp',
google_project,
snapshot,
'/' + str(partition_prefix_length(subgraphs))
prefix
])
return project, source

Expand All @@ -64,9 +53,9 @@ def mkdict(previous_catalog: dict[str, str],


anvil_sources = mkdict({}, 3, mkdelta([
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804),
mksrc('bigquery', 'datarepo-dev-42c70e6a', 'ANVIL_CCDG_Sample_1_20230228_ANV5_202302281520', 28),
mksrc('bigquery', 'datarepo-dev-97ad270b', 'ANVIL_CMG_Sample_1_20230225_ANV5_202302281509', 25)
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732'),
mksrc('bigquery', 'datarepo-dev-42c70e6a', 'ANVIL_CCDG_Sample_1_20230228_ANV5_202302281520'),
mksrc('bigquery', 'datarepo-dev-97ad270b', 'ANVIL_CMG_Sample_1_20230225_ANV5_202302281509')
]))


Expand Down
Loading
Loading