Configuration in Bigflow is managed as Python code in bigflow.Config
objects.
It follows the Configuration as Code principle.
You will use bigflow.Config
to configure individual workflows as well
as to configure GCP deployment of your project.
Here is the example of a Config
:
from bigflow import Config
deployment_config = Config(
name='dev',
properties={
'gcp_project_id': '{env}-project-id',
'docker_repository_project': 'my-shared-docker-project-id',
'docker_repository': 'europe-west1-docker.pkg.dev/{docker_repository_project}/my-analytics',
'vault_endpoint': 'https://example.com/vault',
'dags_bucket': 'europe-west1-my-1234-bucket',
},
).add_configuration(
name='prod',
properties={
'dags_bucket': 'europe-west1-my-4321-bucket',
},
)
Config
is the combination of Python dicts
with
some extra features that are useful for configuring things:
-
The
Config
object holds multiple named configurations, one configuration per each environment (heredev
andprod
). -
Properties with constant values (here
vault_endpoint
,docker_repository_project
anddocker_repository
) are defined only once in the master configuration (here, 'dev' configuration is master). They are inherited by other configurations. -
Properties with different values per environment (here
dags_bucket
) are defined explicitly in each configuration.
Test it:
print(deployment_config)
final properties:
dev config:
{ 'dags_bucket': 'europe-west1-my-1234-bucket',
'docker_repository': 'europe-west1-docker.pkg.dev/my-shared-docker-project-id/my-analytics',
'docker_repository_project': 'my-shared-docker-project-id',
'gcp_project_id': 'dev-project-id',
'vault_endpoint': 'https://example.com/vault'}
prod config:
{ 'dags_bucket': 'europe-west1-my-4321-bucket',
'docker_repository': 'europe-west1-docker.pkg.dev/my-shared-docker-project-id/my-analytics',
'docker_repository_project': 'my-shared-docker-project-id',
'gcp_project_id': 'prod-project-id',
'vault_endpoint': 'https://example.com/vault'}
You can get config for a particular environment by calling resolve(env)
method.
It returns an instance of dict
, which contains all resolved options for a specified environment.
The environment may be omitted, in such case, bigflow
will try to resolve the current environment name
by inspecting command-line arguments and/or os environment (looking for bf_env
variable).
config = deployment_config.resolve()
print(config['dags_bucket'])
print(config['gcp_project_id'])
String properties are interpolated using values from other properties.
Thanks to that, your Config
can be super concise and smart.
For example, the docker_repository
property is resolved from:
'docker_repository_project': 'my-shared-docker-project-id',
'docker_repository': 'europe-west1-docker.pkg.dev/{docker_repository_project}/my-analytics'
to
'docker_repository': 'europe-west1-docker.pkg.dev/my-shared-docker-project-id/my-analytics'
Interpolation is contextual, meaning that, it is aware of a current environment. For example:
config = Config(
name='dev',
properties={
'offers': 'fake_offers',
'transactions': '{offers}_transactions'
}).add_configuration(
name='prod',
properties={
'offers': 'real_offers'
})
print(config)
final properties:
dev config:
{'offers': 'fake_offers', 'transactions': 'fake_offers_transactions'}
prod config:
{'offers': 'real_offers', 'transactions': 'real_offers_transactions'}
There is a special placeholder: env
for properties containing environment name. For example,
the gcp_project_id
property is resolved from:
'gcp_project_id': '{env}-project-id'
to dev-project-id
on dev
and to prod-project-id
on prod
.
In a Config
object you can define a master configuration.
Any property defined in the master configuration is inherited by other configurations.
By default, the configuration defined in the Config
init method is the master one:
config = Config(
name='dev',
properties={
'my_project': 'my_value'
}).add_configuration(
name='prod',
properties={})
print(config)
output:
dev config:
{'my_project': 'my_value'}
prod config:
{'my_project': 'my_value'}
You can disable properties inheritance by setting the is_master
flag to False
:
config = Config(
name='dev',
is_master=False,
properties={
'my_project': 'my_value'
}).add_configuration(
name='prod',
properties={})
print(config)
output:
dev config:
{'my_project': 'my_value'}
prod config:
{}
A default configuration is used when no environment name is given.
By default, the configuration defined in the Config
init method is the default one.
It is chosen while resolving properties, when env_name
is None
. For example:
config = Config(
name='dev',
properties={
'my_project': 'I_am_{env}'
}).add_configuration(
name='prod',
properties={})
print(config.pretty_print())
print(config.pretty_print('dev'))
output:
dev config:
{'my_project': 'I_am_dev'}
dev config:
{'my_project': 'I_am_dev'}
You can change the roles by setting the is_default
flag:
config = Config(
name='dev',
is_default=False,
properties={
'my_project': 'I_am_{env}'
}).add_configuration(
name='prod',
is_default=True,
properties={})
print(config.pretty_print())
print(config.pretty_print('dev'))
output:
prod config:
{'my_project': 'I_am_prod'}
dev config:
{'my_project': 'I_am_dev'}
Secrets shouldn't be stored in source code, but still, you need them in your configuration. Typically, they are stored safely on CI/CD servers and they are passed around as Operating System environment variables.
Config
object supports that approach. When a property is not defined explicitly in Python,
it is resolved from an OS environment variable. For example:
import os
from bigflow import Config
os.environ['bf_my_secret'] = '123456'
config = Config(
name='dev',
properties={
'my_project': 'I_am_{env}'})
print(config)
print('my_secret:', config.resolve()['my_secret'], 'dev')
output:
dev config:
{'my_project': 'I_am_dev'}
my_secret: 123456
There are two important aspects here.
First, OS environment variable names
must be prefixed with bf_
.
Second, since secret properties don't exist in Python code they are resolved dynamically and automatically added as master configuration.
More interestingly, you can even resolve a current configuration name from OS environment:
import os
from bigflow import Config
config = Config(
name='dev',
properties={
'my_project': 'I_am_{env}'
}).add_configuration(
name='prod',
properties={})
os.environ['bf_env'] = 'prod'
print(config.pretty_print(''))
os.environ['bf_env'] = 'dev'
print(config.pretty_print(''))
output:
prod config:
{'my_project': 'I_am_prod'}
dev config:
{'my_project': 'I_am_dev'}
DeploymentConfig
is a subclass of the Config
class with one additional, optional argument – environment_variables_prefix
.
Contrary to the Config
class, the DeploymentConfig
allows you to set a custom environment variables name prefix (overriding the default bf_
value).
It's useful in CI/CD servers. Bamboo is a good example because it adds the bamboo_
prefix for each environmental variable.
from bigflow.configuration import DeploymentConfig
deployment_config = DeploymentConfig(
name='dev',
environment_variables_prefix='bamboo_bf_',
properties={{
'docker_repository': 'your_repository',
'gcp_project_id': '{project_id}',
'dags_bucket': '{dags_bucket}'}})
Bigflow 1.3 introduces a new class-based approach for keeping configurations.
New API is optional and there is no need to migrate existing code from
bigflow.Configuration
. However, it allows you to use type hints, which enables
autocompletion in IDEs and gives you more flexibility.
Each configuration is declared as a subclass of bigflow.konfig.Konfig
.
Properties are declared as attributes, optionally with type hints.
Classes may inherit from each other and share common properties. Standard
python inheritance rules are used to resolve overridden properties.
Custom methods may be added to classes, however, it is not recommended.
import bigflow.konfig
class CommonConf(bigflow.konfig.Konfig):
project: str # type hint
region: str = 'europe-west1' # type hint + default value
machine_type = 'n2-standard-4' # only value (type is `typing.Any`)
num_workers = 4 # not only strings are allowed
class DevConf(CommonConf):
project = "dev-project"
class ProdConf(CommonConf):
project = "prod-project"
num_workers = 20
Config which corresponds to the current environment may be created by function resolve_konfig
.
It takes dictionary in form of {"environment" -> "config class"} and returns instance of the config.
The resolve_konfig
function, resolves the configuration based on the bf_env
environment variable,
same as the "classic" BigFlow configuration tool.
config = bigflow.konfig.resolve_konfig(
{ # mapping of all configs
'dev': DevConf,
'prod': ProdConf,
},
default="dev", # when no environment is provided via CLI
)
print(config.project) # => "dev-project"
print(config.num_workers) # => 4
Config instance may be used as a dict:
print(config['project'])
print(config.keys())
print(dict(config))
Also config may be created manullay in tests or interactive shell:
dev_config = DevConf()
print(dev_config.num_workers)
Attributes may be overwritten by passing parameters to the class:
patched_dev_config = DevConf(num_instances=10)
By default string properties are not interpolated (it differs from bigflow.Configuration
).
You need to be explicitly wrap interpolation pattern with function bigflow.konfig.expand()
:
import bigflow.konfig
from bigflow.konfig import expand
class MyConfig(bigflow.konfig.Konfig):
docker_repository_project = "my-shared-docker-project-id"
# no 'expand()' function - string is *not* interpolated
docker_repository_raw = "europe-west1-docker.pkg.dev/{docker_repository_project}/my-analytics"
# interpolation is enabled
docker_repository = expand("europe-west1-docker.pkg.dev/{docker_repository_project}/my-analytics")
config = MyConfig()
print(config.docker_repository_raw) # => europe-west1-docker.pkg.dev/{docker_repository_project}/my-analytics
print(config.docker_repository) # => europe-west1-docker.pkg.dev/my-shared-docker-project-id/my-analytics
Class-based configs system does not automatically read properties from OS environment.
You must explicitly declare them with bigflow.konfig.fromenv
function. Note, that the full variable
name must be used, the prefix bf_
is not prepended automatically.
import bigflow.konfig
from bigflow.konfig import fromenv
import os
os.environ['bf_my_variable'] = '123456'
class MyConfig(bigflow.konfig.Konfig):
# full variable name "bf_my_variable" is used!
my_variable = fromenv('bf_my_variable')
config = MyConfig()
print('my_variable:', config.my_variable)
Class-based configs allow defining as many common classes as needed. There is no single "master config". However, it is recommended to create a common base class and put type hints for all "public" parameters.
import bigflow.konfig
from bigflow.konfig import expand
class Conf(bigflow.konfig.Konfig):
"""Shared across all configs"""
env: str # no default value, must be overwritten
project: str = expand("my-project-{env}")
region: str = "europe-west1"
machine_type: str = "n2-standard-2"
num_machines: int = 10
class Dev(Common):
"""Development environment"""
env = "dev"
dataset = "dataset-one"
num_machines = 2
class Bench(Conf):
"""Mixin, override some properties to enable benchmarking"""
# don't change 'env'
dataset = "dataset-benchmark"
num_machines = 15
class Stage(Common):
"""Test environment"""
env = "test"
dataset = "dataset-one"
num_machines = 20
class Prod(Stage):
"""Production environment"""
env = "prod"
# combine several configs
# resolves attributes from `Bench` and then from `Stage`
class StageBench(Bench, Stage): pass
# or the same with a special sugar syntax
StageBench = Bench @ Stage
config: Conf = bigflow.konfig.resolve_konfig({
'prod': Prod,
'dev': Dev,
'stage': Stage,
'dev-bench': Bench @ Dev, # combine mixin Bench and Dev
'stage-bench': Bench @ Stage,
})
print(config.num_machines)
Configuration properties may be declared as results of some expression:
import bigflow.konfig
from bigflow.konfig import expand, dynamic
class MyConfig(bigflow.konfig.Konfig):
num_machines = 10
num_cores = 4
machine_type = expand("n2-standard-{num_cores}") # internally `epxand` also uses `dynamic`
total_cores = dynamic(lambda self: self.num_machines * self.num_cores)
print(dict(MyConfig()))
# => {'num_machines': 10, 'num_cores': 4, 'machine_type': 'n2-standard-4', 'total_cores': 40}
print(dict(MyConfig(num_cores=8)))
# => {'num_machines': 10, 'num_cores': 8, 'machine_type': 'n2-standard-8', 'total_cores': 80}
Instances of bigflow.konfig.Konfig
are immutable after instantiation.
A new "changed" instance may be created when only some properties need to be changed:
import bigflow.konfig
from bigflow.konfig import dynamic
class MyConfig(bigflow.konfig.Konfig):
one = 1
two = dynamic(lambda self: self.one + 1)
config = MyConfig()
print(config.two) # => 2
config2 = config.replace(one=101)
print(config.two) # => 2
print(config2.two) # => 102
To define a secret for your BigFlow workflow, you need to use Kubernetes Secrets mechanism (Cloud Composer is a GKE cluster under the hood).
Secrets management based on Kubernetes Secrets works only if a job is scheduled in Composer. When you want to run the job from a local machine secrets should be provided as environment variables on your computer.
Run the following commands in the Cloud Shell or a terminal on your local machine:
for Composer 1.x:
gcloud composer environments describe YOUR-COMPOSER-NAME --location <YOUR COMPOSER REGION> --format="value(config.gkeCluster)"
gcloud container clusters get-credentials <COMPOSER GKE CLUSTER NAME> --zone <GKE ZONE> --project <GKE PROJECT ID>
kubectl create secret generic <SECRET NAME> --from-literal '<SECRET_ENVIRONMENT_KEY>=<SECRET ENVIRONMENT VALUE>'
for Composer 2.x:
gcloud composer environments describe YOUR-COMPOSER-NAME --location <YOUR COMPOSER REGION> --format="value(config.gkeCluster)"
gcloud container clusters get-credentials <COMPOSER GKE CLUSTER NAME> --region <YOUR COMPOSER REGION> --project <GKE PROJECT ID>
kubectl create secret generic <SECRET NAME> --namespace composer-user-workloads --from-literal '<SECRET_ENVIRONMENT_KEY>=<SECRET ENVIRONMENT VALUE>'
If the cluster is private, you can still connect from inside your subnet.
First, create a VM
in your project using the same VPC and subnet as your cluster.
Then, connect to your VM
and run the following commands. Remember to use the --internal-ip
flag when getting credentials:
for Composer 1.x:
gcloud composer environments describe YOUR-COMPOSER-NAME --location <YOUR COMPOSER REGION> --format="value(config.gkeCluster)"
gcloud container clusters get-credentials --internal-ip <COMPOSER GKE CLUSTER NAME> --zone <GKE ZONE> --project <GKE PROJECT ID>
kubectl create secret generic <SECRET NAME> --from-literal '<SECRET_ENVIRONMENT_KEY>=<SECRET ENVIRONMENT VALUE>'
for Composer 2.x:
gcloud composer environments describe YOUR-COMPOSER-NAME --location <YOUR COMPOSER REGION> --format="value(config.gkeCluster)"
gcloud container clusters get-credentials --internal-ip <COMPOSER GKE CLUSTER NAME> --region <YOUR COMPOSER REGION> --project <GKE PROJECT ID>
kubectl create secret generic <SECRET NAME> --namespace composer-user-workloads --from-literal '<SECRET_ENVIRONMENT_KEY>=<SECRET ENVIRONMENT VALUE>'
⚠️ Note that even if you use a lowercase variable name for<SECRET_ENVIRONMENT_KEY>
, e.g.bf_my_var
, it will be exposed as uppercase -BF_MY_VAR
. See the issue in GitHub.
Let us go through an example. First, you need to get the Composer GKE cluster name:
gcloud composer environments describe my-composer --location europe-west1 --format="value(config.gkeCluster)"
>>> projects/my-gcp-project/zones/europe-west1-d/clusters/europe-west1-my-composer-6274b78f-gke
It's the last element of the whole ID – europe-west1-my-composer-6274b78f-gke
. Then, you can use the name to fetch the
credentials. You also need to specify the zone and project of your composer:
gcloud container clusters get-credentials europe-west1-my-composer-6274b78f-gke --zone europe-west1-d --project my-super-project
And you can then set a secret:
kubectl create secret generic bf-super-secret --from-literal bf_my_super_secret=passwd1234
Next, you need to specify secrets that you want to use in a workflow:
import bigflow
workflow = bigflow.Workflow(
workflow_id='example',
definition=[],
secrets=['bf_my_super_secret'])
And finally, you can use the secret in a job:
import os
import bigflow
class ExampleJob(bigflow.Job):
id = 'example_job'
def execute(self, context: bigflow.JobContext):
os.environ['BF_MY_SUPER_SECRET'] # secrets are exposed as uppercase environment variables
workflow = bigflow.Workflow(
workflow_id='example',
definition=[ExampleJob()],
secrets=['bf_my_super_secret'])
We don't recommend using secrets through the os.environ
, but instead, through
the bigflow.konfig.Konfig
configuration class (it makes sure that you can't just log a secret):
class DevConfig(Konfig):
my_super_secret = fromenv('BF_MY_SUPER_SECRET', 'None') # secrets are exposed as uppercase environment variables