Skip to content

Commit

Permalink
feat: add public hub with writing access. (#94)
Browse files Browse the repository at this point in the history
* feat: add public hub with writing access.

Add public hub.

Signed-off-by: Giorgio <giorgio.giannone@ibm.com>

* fix: refactor upload public hub.

Signed-off-by: Giorgio <giorgio.giannone@ibm.com>

* fix: update readme.

Signed-off-by: Giorgio Giannone <giorgio.giannone@ibm.com>

* fix: add _HUB to env variable in upload readme.

Signed-off-by: Giorgio Giannone <giorgio.giannone@ibm.com>

* docs: update README.md

* docs: update README.md

Co-authored-by: Matteo Manica <drugilsberg@gmail.com>
  • Loading branch information
georgosgeorgos and drugilsberg authored Jun 22, 2022
1 parent e31ea58 commit 65f7267
Show file tree
Hide file tree
Showing 3 changed files with 95 additions and 38 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,16 +269,16 @@ Run the algorithm via `gt4sd-inference` (again the model produced in the example
gt4sd-inference --algorithm_name PaccMannGP --algorithm_application PaccMannGPGenerator --algorithm_version fast-example-v0 --number_of_samples 5 --target '{"molwt": {"target": 60.0}}'
```

### Uploading a trained algorithm on a server via the CLI command
### Uploading a trained algorithm on a public hub via the CLI command

If you have access to a server (local or cloud) you can upload your trained models easily. The syntax follow the saving pipeline using `gt4sd-upload`:
You can upload trained and finetuned models easily in the public hub using `gt4sd-upload`. The syntax follows the saving pipeline:

```sh
gt4sd-upload --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator
```

**NOTE:** GT4SD default COS credentials for model syncing are read-only. To upload your own models, you can rely on a self-hosted/custom COS storage and configure the following environment variables accordingly: `GT4SD_S3_HOST`, `GT4SD_S3_ACCESS_KEY`, `GT4SD_S3_SECRET_KEY`, `GT4SD_S3_SECURE`, and `GT4SD_S3_BUCKET`.
An example on self-hosting locally a COS (minio) where to upload your models can be found [here](https://gt4sd.github.io/gt4sd-core/source/gt4sd_server_upload_md.md).
**NOTE:** GT4SD can be configured to upload models to a custom or self-hosted COS.
An example on self-hosting locally a COS (minio) where to upload your models can be found [here](https://gt4sd.github.io/gt4sd-core/source/gt4sd_server_upload_md.html).

### Additional examples

Expand Down
36 changes: 18 additions & 18 deletions docs/source/gt4sd_server_upload_md.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ Here we report an example of how you can setup a custom minio server on localhos
### 1) Set environment variables

```sh
export GT4SD_S3_SECRET_KEY=''
export GT4SD_S3_ACCESS_KEY=''
export GT4SD_S3_HOST='127.0.0.1:9000'
export GT4SD_S3_SECURE=False
export GT4SD_S3_BUCKET='gt4sd-cos-algorithms-artifacts'
export GT4SD_S3_BUCKET_MODELS='gt4sd-cos-algorithms-models'
export GT4SD_S3_BUCKET_DATA='gt4sd-cos-algorithms-data'
export GT4SD_S3_SECRET_KEY_HUB=''
export GT4SD_S3_ACCESS_KEY_HUB=''
export GT4SD_S3_HOST_HUB='127.0.0.1:9000'
export GT4SD_S3_SECURE_HUB=False
export GT4SD_S3_BUCKET_HUB='gt4sd-cos-algorithms-artifacts'
export GT4SD_S3_BUCKET_MODELS_HUB='gt4sd-cos-algorithms-models'
export GT4SD_S3_BUCKET_DATA_HUB='gt4sd-cos-algorithms-data'
```

set `GT4SD_S3_SECURE` `True` or `False` if https/http server.
Expand All @@ -51,8 +51,8 @@ services:
env_file:
- env/.env.dev
environment:
MINIO_ACCESS_KEY: "${GT4SD_S3_ACCESS_KEY}"
MINIO_SECRET_KEY: "${GT4SD_S3_SECRET_KEY}"
MINIO_ACCESS_KEY: "${GT4SD_S3_ACCESS_KEY_HUB}"
MINIO_SECRET_KEY: "${GT4SD_S3_SECRET_KEY_HUB}"
command: server /export
createbuckets:
image: minio/mc
Expand All @@ -63,12 +63,12 @@ services:
# ensure there is a file in the artifacts bucket
entrypoint: >
/bin/sh -c "
/usr/bin/mc config host add myminio http://cos:9000 ${GT4SD_S3_ACCESS_KEY} ${GT4SD_S3_SECRET_KEY};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_DATA};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_MODELS};
/usr/bin/mc config host add myminio http://cos:9000 ${GT4SD_S3_ACCESS_KEY_HUB} ${GT4SD_S3_SECRET_KEY_HUB};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_HUB};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_DATA_HUB};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_MODELS_HUB};
echo 'this is an artifact' >> a_file.txt;
/usr/bin/mc cp a_file.txt myminio/${GT4SD_S3_BUCKET}/a_file.txt;
/usr/bin/mc cp a_file.txt myminio/${GT4SD_S3_BUCKET_HUB}/a_file.txt;
exit 0;
"
```
Expand All @@ -85,9 +85,9 @@ Add the new server to the minio configuration file (`~/.mc/config.json`):
"version": "10",
"aliases": {
"myminio": {
"url": "${GT4SD_S3_HOST}",
"accessKey": "${GT4SD_S3_ACCESS_KEY}",
"secretKey": "${GT4SD_S3_SECRET_KEY}",
"url": "${GT4SD_S3_HOST_HUB}",
"accessKey": "${GT4SD_S3_ACCESS_KEY_HUB}",
"secretKey": "${GT4SD_S3_SECRET_KEY_HUB}",
"api": "s3v4",
"path": "auto"
},
Expand All @@ -99,7 +99,7 @@ Add the new server to the minio configuration file (`~/.mc/config.json`):
and add `myminio` to the list of servers:

```sh
mc alias set myminio $GT4SD_S3_HOST $GT4SD_S3_ACCESS_KEY $GT4SD_S3_SECRET_KEY
mc alias set myminio $GT4SD_S3_HOST_HUB $GT4SD_S3_ACCESS_KEY_HUB $GT4SD_S3_SECRET_KEY_HUB
```

### 4) run docker
Expand Down
89 changes: 73 additions & 16 deletions src/gt4sd/configuration.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,19 +40,27 @@ class GT4SDConfiguration(BaseSettings):
"""GT4SDConfiguration settings from environment variables.
Default configurations for gt4sd including a read-only COS for algorithms' artifacts.
Default configuration for gt4sd hub including a read-write COS for algorithms' artifacts uploaded by users.
"""

gt4sd_local_cache_path: str = os.path.join(os.path.expanduser("~"), ".gt4sd")
gt4sd_local_cache_path_algorithms: str = "algorithms"
gt4sd_max_number_of_stuck_calls: int = 50
gt4sd_max_number_of_samples: int = 1000000
gt4sd_max_runtime: int = 86400

gt4sd_s3_host: str = "s3.par01.cloud-object-storage.appdomain.cloud"
gt4sd_s3_access_key: str = "6e9891531d724da89997575a65f4592e"
gt4sd_s3_secret_key: str = "5997d63c4002cc04e13c03dc0c2db9dae751293dab106ac5"
gt4sd_s3_secure: bool = True
gt4sd_s3_bucket: str = "gt4sd-cos-algorithms-artifacts"

gt4sd_s3_host_hub: str = "s3.par01.cloud-object-storage.appdomain.cloud"
gt4sd_s3_access_key_hub: str = "d9536662ebcf462f937efb9f58012830"
gt4sd_s3_secret_key_hub: str = "934d1f3afdaea55ac586f6c2f729ac2ba2694bb8e975ee0b"
gt4sd_s3_secure_hub: bool = True
gt4sd_s3_bucket_hub: str = "gt4sd-cos-hub-algorithms-artifacts"

class Config:
# immutable and in turn hashable, that is required for lru_cache
frozen = True
Expand All @@ -64,7 +72,6 @@ def get_instance() -> "GT4SDConfiguration":


gt4sd_configuration_instance = GT4SDConfiguration.get_instance()

logger.info(
f"using as local cache path: {gt4sd_configuration_instance.gt4sd_local_cache_path}"
)
Expand All @@ -75,20 +82,20 @@ def get_instance() -> "GT4SDConfiguration":


def upload_to_s3(target_filepath: str, source_filepath: str):
"""Upload an algorithm in source_filepath in target_filepath on a bucket.
"""Upload an algorithm in source_filepath in target_filepath on a bucket in the model hub.
Args:
target_filepath: path to save the objects in s3.
source_filepath: path to the file to sync.
"""
try:
upload_file_to_s3(
host=gt4sd_configuration_instance.gt4sd_s3_host,
access_key=gt4sd_configuration_instance.gt4sd_s3_access_key,
secret_key=gt4sd_configuration_instance.gt4sd_s3_secret_key,
bucket=gt4sd_configuration_instance.gt4sd_s3_bucket,
host=gt4sd_configuration_instance.gt4sd_s3_host_hub,
access_key=gt4sd_configuration_instance.gt4sd_s3_access_key_hub,
secret_key=gt4sd_configuration_instance.gt4sd_s3_secret_key_hub,
bucket=gt4sd_configuration_instance.gt4sd_s3_bucket_hub,
target_filepath=target_filepath,
source_filepath=source_filepath,
secure=gt4sd_configuration_instance.gt4sd_s3_secure,
secure=gt4sd_configuration_instance.gt4sd_s3_secure_hub,
)
except S3SyncError:
logger.exception("error in syncing the cache with S3")
Expand All @@ -108,7 +115,9 @@ def sync_algorithm_with_s3(prefix: Optional[str] = None) -> str:
gt4sd_configuration_instance.gt4sd_local_cache_path,
gt4sd_configuration_instance.gt4sd_local_cache_path_algorithms,
)

try:
# sync with the public bucket
sync_folder_with_s3(
host=gt4sd_configuration_instance.gt4sd_s3_host,
access_key=gt4sd_configuration_instance.gt4sd_s3_access_key,
Expand All @@ -118,6 +127,16 @@ def sync_algorithm_with_s3(prefix: Optional[str] = None) -> str:
prefix=prefix,
secure=gt4sd_configuration_instance.gt4sd_s3_secure,
)
# sync with the public bucket hub
sync_folder_with_s3(
host=gt4sd_configuration_instance.gt4sd_s3_host_hub,
access_key=gt4sd_configuration_instance.gt4sd_s3_access_key_hub,
secret_key=gt4sd_configuration_instance.gt4sd_s3_secret_key_hub,
bucket=gt4sd_configuration_instance.gt4sd_s3_bucket_hub,
folder_path=folder_path,
prefix=prefix,
secure=gt4sd_configuration_instance.gt4sd_s3_secure_hub,
)
except S3SyncError:
logger.exception("error in syncing the cache with S3")
return os.path.join(folder_path, prefix) if prefix is not None else folder_path
Expand All @@ -138,23 +157,61 @@ def get_cached_algorithm_path(prefix: Optional[str] = None) -> str:
)


def get_algorithm_subdirectories_from_s3_coordinates(
host: str,
access_key: str,
secret_key: str,
bucket: str,
secure: bool = True,
prefix: Optional[str] = None,
) -> Set[str]:
"""Wrapper to initialize a client and list the directories in a bucket."""
client = GT4SDS3Client(
host=host, access_key=access_key, secret_key=secret_key, secure=secure
)
return client.list_directories(bucket=bucket, prefix=prefix)


def get_algorithm_subdirectories_with_s3(prefix: Optional[str] = None) -> Set[str]:
"""Get algorithms in the s3 buckets.
Args:
prefix: the relative path in the bucket (both
on S3 and locally) to match files to download. Defaults to None.
Returns:
Set: set of available algorithms on s3 with that prefix.
"""
try:
host = gt4sd_configuration_instance.gt4sd_s3_host
access_key = gt4sd_configuration_instance.gt4sd_s3_access_key
secret_key = gt4sd_configuration_instance.gt4sd_s3_secret_key
secure = gt4sd_configuration_instance.gt4sd_s3_secure
client = GT4SDS3Client(
host=host, access_key=access_key, secret_key=secret_key, secure=secure
# directories in the read-only public bucket
dirs = get_algorithm_subdirectories_from_s3_coordinates(
host=gt4sd_configuration_instance.gt4sd_s3_host,
access_key=gt4sd_configuration_instance.gt4sd_s3_access_key,
secret_key=gt4sd_configuration_instance.gt4sd_s3_secret_key,
bucket=gt4sd_configuration_instance.gt4sd_s3_bucket,
secure=gt4sd_configuration_instance.gt4sd_s3_secure,
prefix=prefix,
)

# directories in the write public-hub bucket
dirs_hub = get_algorithm_subdirectories_from_s3_coordinates(
host=gt4sd_configuration_instance.gt4sd_s3_host_hub,
access_key=gt4sd_configuration_instance.gt4sd_s3_access_key_hub,
secret_key=gt4sd_configuration_instance.gt4sd_s3_secret_key_hub,
bucket=gt4sd_configuration_instance.gt4sd_s3_bucket_hub,
secure=gt4sd_configuration_instance.gt4sd_s3_secure_hub,
prefix=prefix,
)
bucket = gt4sd_configuration_instance.gt4sd_s3_bucket
return client.list_directories(bucket=bucket, prefix=prefix)

# set of directories in the public bucket and public hub bucket
versions = dirs.union(dirs_hub)
return versions

except Exception:
logger.exception("generic syncing error")
raise S3SyncError(
"CacheSyncingError",
f"error in getting directories of prefix={prefix} with host={host} access_key={access_key} secret_key={secret_key} secure={secure} bucket={bucket}",
f"error in getting directories of prefix={prefix}",
)


Expand Down

0 comments on commit 65f7267

Please sign in to comment.