Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GCP implementation #423

Merged
merged 111 commits into from
Mar 29, 2024
Merged

Add GCP implementation #423

merged 111 commits into from
Mar 29, 2024

Conversation

nweires
Copy link
Collaborator

@nweires nweires commented Dec 8, 2023

Pull Request Description

Adds a GCP implementation of Buildstock Batch. This includes:

  • gcp/README.md - Overview of the architecture of the job on GCP.
  • gcp/gcp.py - The bulk of the implementation code. It roughly follows the structure of the AWS implementation. Where possible, shared logic was already moved into docker_base.py to avoid duplicated code.
  • docs/ - Documentation for installation and setup, configuration options, and running jobs.
  • schemas/v0.3.yaml - Schema for the new GCP-related project configuration options.

I recommend doing a "Squash and Merge" of this PR - we don't need the (many!) individual commits that it includes.

Checklist

Not all may apply

  • Code changes (must work)
  • Tests exercising your feature/bug fix (check coverage report on Checks -> BuildStockBatch Tests -> Artifacts)
  • Coverage has increased or at least not decreased. Update minimum_coverage in .github/workflows/coverage.yml as necessary.
    • Added exception for the gcp/ directory, but shared code in docker_base.py has tests.
  • All other unit and integration tests passing
  • Update validation for project config yaml file changes
  • Update existing documentation
  • Run a small batch run on Kestrel/Eagle to make sure it all works if you made changes that will affect Kestrel/Eagle
  • Add to the changelog_dev.rst file and propose migration text in the pull request

lathanh and others added 30 commits October 19, 2023 21:55
This only supports building an image and pushing that image to GCP
Artifact Registry, so far.

This also shares and updates the Dockerfile used by the AWS script,
which was broken (the AWS script remains broken for other reasons).
Add the beginnings of support for buildstockbatch on GCP
* Add basic terraform file.

* Cleanup
Flesh out GCP documentation for installation and project definition
* Add postprocessing support to GCP

* Add postprocessing support to GCP
* Show batch job progress

* Show final status
* prep for release

* Update migration file for residential workflow changes.

* using semver to compare versions

* switching resstock branch to yml-resilience-args

* fixing version validation unit test

* Revert "switching resstock branch to yml-resilience-args"

This reverts commit eb676a5.

* updating .readthedocs.yml

* Adding black formatter pre-commit

* updating installation docs

* adding additional pre-commit hooks

* adding precommit to ci

* updating OpenStudio to 3.7.0-rc1

* updating line length for black

* adding black to dev extras_require

* Removing /tmp/scratch files at end of array job

* adding [dev] to development eagle environments

* removing local_project_dir and allowing a dir not to exist

* allow local singularity image to not exist

* mocking shutil.rmtree in eagle tests

* adding mocks to test if directories were deleted

* updating changelog

---------

Co-authored-by: Noel Merket <noel.merket@nrel.gov>
Co-authored-by: Joe Robertson <joseph.robertson@nrel.gov>
Merge in refactoring from NREL's branch
Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Made some comments below. A few other notes:

  • We'll need to bring this up to date with develop probably after the other PRs merge in.
  • On AWS and HPC we put the data in S3 and then run a data crawler on it so we can run sql queries on it in AWS Athena (Trino DB). I noticed you're not doing an equivalent thing for GCP here. Is that because there's no equivalent service or it just wasn't necessary for your use case?

Thanks for all your contributions here. Let me know how I can help get this across the finish line.

@nweires nweires mentioned this pull request Mar 27, 2024
8 tasks
@nweires nweires changed the title DRAFT: GCP implementation Add GCP implementation Mar 27, 2024
@nweires nweires marked this pull request as ready for review March 27, 2024 17:18
@nweires nweires requested a review from nmerket March 27, 2024 17:31
Copy link
Member

@nmerket nmerket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. One note below on imports.

@@ -18,6 +18,7 @@
from dask.dataframe.io.parquet import create_metadata_file
from fsspec.implementations.local import LocalFileSystem
from functools import partial
from gcsfs import GCSFileSystem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postprocessing.py is used by all our implementations. gcsfs is only installed if the gcp extras are installed. This will cause import errors for all other implementations where they may not have installed those extras. It looks like none of these imported fsspec filesystem classes are even used in this file. We can just remove the imports.

Suggested change
from gcsfs import GCSFileSystem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - fixed!

Developer Installaion
Developer Installation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing my typo.

Instal the buildstockbatch python library as described in :ref:`bsb-python` for
Install the buildstockbatch python library as described in :ref:`bsb-python` for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling is hard.

@nweires
Copy link
Collaborator Author

nweires commented Mar 28, 2024

  • On AWS and HPC we put the data in S3 and then run a data crawler on it so we can run sql queries on it in AWS Athena (Trino DB). I noticed you're not doing an equivalent thing for GCP here. Is that because there's no equivalent service or it just wasn't necessary for your use case?

We could do the same (probably with BigQuery), but we just haven't needed it so far.

@nweires nweires requested a review from nmerket March 29, 2024 14:55
@nmerket nmerket merged commit 92993de into NREL:develop Mar 29, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants