Skip to content

Latest commit



176 lines (141 loc) · 6.95 KB

File metadata and controls

176 lines (141 loc) · 6.95 KB

Checklist: Metacat K8s 3.0.0 to 3.1.0 Upgrade Steps


IMPORTANT NOTES - before you begin...

  1. PURPOSE: This ordered checklist is for upgrading an existing K8s Metacat v3.0.0 instance to Metacat v3.1.0
    • Very Important: Before starting a migration, you must have a fully-functioning k8s installation of Metacat version 3.0.0/helm chart 1.1.x. Upgrade from other versions is not supported.
  2. Some references below are specific to NCEAS infrastructure (e.g. CephFS storage); adjust as needed for your own installation.
  3. Assumptions: you have a working knowledge of Kubernetes deployment, including working with yaml files, helm and kubectl commands, and your kubectl context is set for the target deployment location

1. Values: Edit your existing values override file

e.g. see the values-dev-cluster-example.yaml file.

  • Remove any uid or gid overrides, since we're now adopting the defaults of 59996 for postgres and 59997 for metacat
  • Temporarily disable probes until hashstore conversion is done

2. Prep Before Upgrading

  • set storage.hashstore.disableConversion: true, so the hashstore converter won't run yet
  • In the metacat database, verify that all the systemmetadata.checksum_algorithm entries are on the list of supported algorithms (NOTE: syntax matters! E.g. sha-1 is OK, but sha1 isn't):
    kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
      SELECT DISTINCT checksum_algorithm FROM systemmetadata WHERE checksum_algorithm NOT IN
    # then manually update each to the correct syntax; e.g:
    kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
      UPDATE systemmetadata SET checksum_algorithm='SHA-1' WHERE checksum_algorithm='SHA1';
    # ...etc


= = = = = = = = Downtime starts here = = = = = = = =

  • Change ownership ON CEPHFS as follows:

    ## postgres (59996:59996) in postgresql data directory
    sudo chown -R 59996:59996 /mnt/ceph/repos/REPO-NAME/postgresql
    ## tomcat (59997:59997) in metacat directory
    sudo chown -R 59997:59997 data dataone documents logs
  • ...then ensure all metacat data and documents files have g+rw permissions, otherwise, hashstore converter can't create hard links:

    sudo chmod -R g+rw data documents dataone
  • helm upgrade, debug any startup and configuration issues, and allow hashstore upgrade to finish.

    NOTE: while hashstore conversion is still in progress, it is expected for metacatUI to display Oops! It looks like there was a problem retrieving your search results., and for /metacat/d1/mn/v2/ api calls to display Metacat has not been configured

    See Tips, below for how to detect when hashstore conversion finishes

= = = = = = = = Downtime ends here = = = = = = = =

  • When hashstore conversion has finished, re-enable probes and helm upgrade to apply changes


Monitor Hashstore Conversion Progress and Completion

  • To monitor progress: check the number of rows in the checksums table: total # rows should be: 5 * (total objects), (approx; not accounting for conversion errors), where total object count can be found from https://HOSTNAME/CONTEXT/d1/mn/v2/object
    # get number of entries in `checksums` table -- should be approx 5*(total objects)
    kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
      select count(*) from checksums;
  • To detect when hashstore conversion finishes:
    kubectl exec ${RELEASE_NAME}-postgresql-0 -- bash -c "psql -U metacat << EOF
      select storage_upgrade_status from version_history where status='1';
    # ...OR CHECK LOGS
    # If log4j root level is INFO
    egrep "\[INFO\]: The conversion took [0-9]+ minutes.*HashStoreUpgrader:upgrade"
    # If log4j root level is WARN, can also grep for this, if errors:
    egrep "\[WARN\]: The conversion is complete"

Fix Hashstore Error - PID Doesn't Exist in identifier Table:

# If you see this in the metacat logs:
Pid <autogen pid> is missing system metadata. Since the pid starts with autogen and looks like to be
created by DataONE api, it should have the systemmetadata. Please look at the systemmetadata and
identifier table to figure out the real pid.

Steps to resolve:

  1. Given the docid, get all revisions:
    select * from identifier where docid='<docid>';
  2. Look for pid beginning 'autogen', and note its revision number
  3. pid should be the obsoleted_by from the previous revision's system metadata:
    select obsoleted_by from systemmetadata where guid='<previous revision pid>';
  4. Check by look at obsoletes from the following revision, if one exists:
    select obsoletes from systemmetadata where guid='<following revision pid>';
  5. Check if systemmetadata table has an entry for autogen pid
    select checksum from systemmetadata where guid='<autogen pid>';
    ...and the checksum matches that of the original file, found in:
    /var/metacat/(data or documents)/<'autogen' docid>.<revision number>

= = = If these exist and do not match, STOP HERE AND INVESTIGATE FURTHER! = = =

  1. If an autogen-pid entry was found, update it with the new pid:
    update systemmetadata set guid='<pid from steps 3 & 4>' where guid='<autogen pid>';
  2. Replace the 'autogen' pid with the real pid in the 'identifier' table:
    update identifier set guid='<pid from steps 3 & 4>' where guid='<autogen pid>';
  3. Set the hashstore conversion status back to pending:
    update version_history set storage_upgrade_status='pending' where status='1';
    ...and restart the metacat pod to re-run the hashstore conversion and generate the correct sysmeta file in hashstore

Monitor Indexing Progress via RabbitMQ Dashboard:

  • Enable port forwarding:

    kubectl port-forward service/${RELEASE_NAME}-rabbitmq-headless 15672:15672
  • then browse http://localhost:15672. Username metacat-rmq-guest and RabbitMQ password from metacat Secrets, or from:

    secret_name=$(kubectl get secrets | egrep ".*\-metacat-secrets" | awk '{print $1}')
    rmq_pwd=$(kubectl get secret "$secret_name" \
            -o jsonpath="{.data.rabbitmq-password}" | base64 -d)
    echo "rmq_pwd: $rmq_pwd"