A collection of templates that can be used to leverage Delphix Compliance Services in Azure.
To use Delphix Compliance Services for Azure, and specifically to leverage these pipelines you will need to have the following linked services in your data factory:
- REST service for talking to DCS for Azure
- Azure SQL Database for storing metadata about the data you have discovered and rules for masking
Run docker-compose -f docker-compose.yaml up
, this will create the latest version of all templates
and put them in the releases
directory. From there, you can import the template into your data factory using the Data
Factory Studio:
- From the
Author
tab, click the+
next to the search bar for the Factory Resources - Select
Pipeline
, thenImport from pipeline template
- This will open a file explorer window, you can select the recently built template
from the
releases
folder.
- This will open a file explorer window, you can select the recently built template
from the
- After selecting the appropriate template, you will be asked to to select the linked services for various steps in the pipeline.
- Each template will be different, please refer to the
README.md
file in the template's folder to familiarize yourself with the linked services that should be selected.
This metadata store will consist of a few tables, a series of scripts for defining the tables required by the metadata store has been provided under the metadata_store_scripts directory.
Database versions are determined by date. If you're just getting started, we've combined all required SQL commands into metadata_store_scripts/bootstrap.sql. If you've already got a version of the metadata store, you may need to apply migrations for subsequent versions in order to update the metadata store to the required version in order for a new version of the templates to work - each template will contain a README that will indicate what the minimum required version of the metadata store is required. We aim to make it so that all migrations will not break existing pipelines.
The migration scripts do not specify a schema, as you are free to create these scripts in independent schemas per your
needs. However, this means before running the script, you will need to specify the schema you are working in (unless you
choose to leverage the default dbo
schema).
The metadata store consists of the following tables that all must be in the same schema:
discovered_ruleset
- This table is used to define a ruleset. The ruleset uniquely identifies columns and the algorithms that should be applied to those columns. This table is populated in one of two ways:- Automatically, via the discovery pipeline. The data in this table is populated in two stages.
- Probe the information schema or other metadata about the specified database, schema, catalog, etc. Populate
the
discovered_ruleset
table with as much of the following information as possible:dataset
(static) - determines which type of data this is, it will be something likeSNOWFLAKE
, but will be representative of the datasource; this value and will be statically populated in the appropriate discovery pipeline, and will be checked in masking pipelines to make sure that rules are applied only when they are referring to the appropriate datasetspecified_database
(parameter) - the top-level identifier for the dataset, in the case of things like AzureSQL and Snowflake this will be thedatabase
, for things like Databricks it will mean thecatalog
speficied_schema
(parameter) - the second-level identifier for the dataset, in the case of things like AzureSQL, Snowflake, and Databricks this will be theschema
, but may vary in the future for other data sources that have an analogous concept, but with a different nameidentified_table
(determined) - the third-level identifier for the dataset, in the case of things like AzureSQL, Snowflake, and Databricks this will be thetable
, but may vary in the future for other data sources that have an analogous concept, but with a different nameidentified_column
(determined) - the fourth-level identifier for the dataset, in the case of things like AzureSQL, Snowflake, and Databricks this will be thecolumn
, but may vary in the future for other data sources that have an analogous concept, but with a different nameidentified_column_type
(determined) - either the exact data type provided by the information schema or a slightly modified version of what's provided by the information schema, this modification is used in order for it to be more easily joined with theadf_type_mapping
tableordinal_position
(determined) - the ordinal position of the column in the table, which refers to the column's location when ordering within a tablerow_count
(determined) - the number of rows in the table, if this is available in metadata then it is populated in this stage, if it is not, then it is added in the next stagemetadata
(determined) - additional JSON-structured metadata whose specific structure will vary based on the dataset and in some cases may not be required- When a custom date format is required for an algorithm (besides
yyyy-MM-dd
oryyyy-MM-dd'T'HH:mm:ss'Z'
which are the default), the keydate_format
must be specified in the metadata- As an example, if you have a column
transaction_date
in yourSnowflake
instance, and each time that column appears it has a date formated asyyyyMMdd
, you should specify this format using an update statement likeSetting this value incorrectly will cause non-conformant data errors when the data violates the specified pattern. The important thing here is that the JSON value in theUPDATE discovered_ruleset SET metadata = JSON_MODIFY(coalesce(metadata,'{}'), '$.date_format', 'yyyyMMdd') WHERE dataset = 'SNOWFLAKE' AND identified_column = 'transaction_date';
metadata
column contains adate_format
key at the root, and that the value of that key is the string that represents the format.
- As an example, if you have a column
- When a custom date format is required for an algorithm (besides
- Collect data from the specified table and column combination and perform data discovery to determine if the
data is likely to be sensitive. This is done by calling the
profile
endpoint in the discovery component of DCS for Azure services, collecting the results of profiling, and persisting them to thediscovered_ruleset
metadata table, populating for key(dataset, specified_database, specified_schema, identified_table, identified_column)
, the following values:row_count
(determined) - the number of rows in the table, if this was not populated by the previous phase, it is populated in this phase using aSELECT COUNT(1)
type operationprofiled_domain
(determined) - the domain of the data that resulted when profiling determined the existence of sensitive dataprofiled_algorithm
(determined) - the algorithm that could be applied to this data for masking as determined by profiling for the existence of sensitive dataconfidence_score
(determined) - the confidence score as it applies to the profiling result, represented as a value from[0.00000, 1.0000]
that represents how closely the profiled data matches the sensitive data classifiers for the profiled domain and algorithm pairrows_profiled
- the number of rows included in the profiling requestdiscovery_complete
(determined) - bit representing whether the data in this column has had data discovery performed on itlatest_event
(determined) - unique identified that is used to refer to events in event logging table which is useful when attempting to decipher why discovery may have failed
- Probe the information schema or other metadata about the specified database, schema, catalog, etc. Populate
the
- Manually, where users enter values to create rows for fields in each of the tables to define which rules (masking algorithms) should be applied when masking data. This is not preferred as it is error-prone.
- Automatically, via the discovery pipeline. The data in this table is populated in two stages.
adf_data_mapping
- This table is used to define source to destination mappings for the masking pipeline. The table maps a sourcedataset, database, schema, table
to a corresponding destinationdataset, database, schema, table
. The rules in the ruleset (defined indiscovered_ruleset
) will be applied when performing masking and will copy data (either masked or unmasked, depending on how the ruleset is configured) from the source to the sink.- Whether this mapping has been successfully completed is tracked in
mapping_complete
andmasked_status
, and these values are automatically updated based on the results of the pipeline, and success and failure can be tracked by leveraging the event logging table andlatest_event
- Whether this mapping has been successfully completed is tracked in
adf_type_mapping
- This table is used to track how data should be treated as it flows through masking pipeline in Azure Data Factory, where applicable. The values in this table are inserted in the migration scripts.
For data to be masked, there must be a mapping in the adf_data_mapping
table, and at least one assigned_algorithm
assigned to the table. Note that the profiled_algorithm
that is populated as a result of the profiling pipeline should
always produce a valid algorithm name.
To apply different algorithms based on data assigned to a key column, it is important to set the assigned_algorithm
correctly for the masking pipeline to work.
The key column is the column whose value must be evaluated to determine which algorithm to apply. The assigned algorithm of a key column is to be a JSON array consisting of JSON objects that contain "alias" and "condition" keys. As such the key column itself cannot be masked.
The alias
key defines the alias for the filter condition, this will be used when defining conditional algorithms. All
conditions should be mutually exclusive, otherwise you can end up with the same input rows being masked with different
algorithms. The conditions must be compatible with ADF expression language.
Note that default
can be used as an alias and should not have any associated conditions, this is because
the query to determine the conditions will automatically build the conditions for the default condition by combining
with a logical and the negation of all other conditions. Also note that we support a shortcut for referencing the value
in the key column (%
by default), so that more complex filter conditions can be built.
Consider a key column with values KEY_1
, KEY_2
, KEY_3
, KEY_4
, KEY_5
, and KEY_6
. Consider a few different
key aliases that help us identify KEY_1
, KEY_2
, KEY_3
, and set a default behavior for everything else as well.
We can construct the value of assigned_algorithm
using a structure like the following:
[
{
"alias": "K1",
"condition":"% == 'KEY_1'"
},
{
"alias": "K2",
"condition":"endsWith(%, '2')"
},
{
"alias": "K3",
"condition":"startsWith(%, 'KEY_3')"
},
{
"alias": "default"
}
]
What this means is that the table will need to me masked in 4 stages (one for each condition), K1
will mask all rows
where the value in the key column is exactly equal to KEY_1
, K2
will mask all rows where the value in the key column
ends with 2
, K3
will mask all rows where the value in the key column starts with KEY_3
, and default
will mask
all rows that are not equal to KEY_1
and that don't end with 2
and that don't start with KEY_3
.
Note that the alias names must be unique per table. Further,there can be only one key column defined in each table.
Should you need to use multiple key columns in the same table, this is only supported by referring to the additional key
columns in the conditions of the primary key column. Let's suppose we want to add a secondary key column,
secondary_row_type
. To do this, in the definitions of the conditions for row_type
(our primary key column), we would
need to refer to the value of the secondary key column, which can be done by using byName('secondary_row_type')
instead of %
. It is important to consider that conditions must still be mutually exclusive, and so if possible
try to avoid the need for multiple key columns in the same table as the number of conditions and the logic for defining
them is more complicated.
Conditional algorithms are defined by specifying the key colum, using the key column's aliases and assigning an algorithm names in those conditions.
Considering the example above that describes how to specify the key column's aliases, let's suppose that column's name
is row_type
, we can conditionally define the algorithms to the conditions based on their aliases as follows:
{
"key_column":"row_type",
"conditions": [
{
"alias":"K1",
"algorithm":"dlpx-core: Phone US"
},
{
"alias":"K2",
"algorithm":"dlpx-core: Email SL"
},
{
"alias":"K3",
"algorithm":"dlpx-core: CM Digits"
},
{
"alias":"default",
"algorithm":"dlpx-core: CM Alpha-Numeric"
}
]
}
Note that we're specifying the key column and referring to the column by name, we're specifying the conditions in which we will apply masking, which are referred to by the alias defined in the key column, and we specify the algorithms to apply to each of the conditions separately.
Conditional date formats are defined by specifying the key column, using the key column's aliases and assign a date
format in those conditions in the metadata
column.
Considering the example above that describes how to specify the key column's aliases, let's suppose that column's name
is row_type
, we can conditionally define the date format to the conditions based on their aliases as follows:
{
"key_column":"row_type",
"conditions": [
{
"alias":"K1",
"date_format":"yyyyMMdd"
},
{
"alias":"K2",
"date_format":"yyyy-MM-dd"
},
{
"alias":"K3",
"date_format":"yyyy-MM-dd'T'HH:mm"
},
{
"alias":"default",
"date_format":"yyyy-MM-dd'T'HH:mm:ss'Z'"
}
]
}
Note that this requires that the assigned_algorithm
for the row_type
column in this table to define the conditions
and their aliases. Further note that this means you should not specify date_format
at the top-level of the metadata
when there is a conditional date format.
To identify the cause of a pipeline failure, you can use the "Monitor" page and consider the failed pipeline run, however that can be difficult to follow. To make this easier, select activities (discovery and masking dataflows) and their success are tracked in the metadata datastore.
To identify if a ruleset was derived and a table had discovery performed on it, a query can be made against the
discovered_ruleset
table, joining the adf_events_log
table on the event ID, for example:
SELECT
rs.*,
lg.pipeline_run_id,
lg.pipeline_success,
lg.error_message
FROM discovered_ruleset rs
JOIN adf_events_log lg
ON (rs.latest_event = lg.event_id);
To identify if a mapping was successful, a query can be made against the adf_data_mapping
table, joining the
adf_events_log
table on the event ID, for example:
SELECT
dm.*,
lg.pipeline_run_id,
lg.pipeline_success,
lg.error_message
FROM adf_data_mapping dm
JOIN adf_events_log lg
ON (dm.latest_event = lg.event_id);
In the event of conditional masking, the adf_data_mapping
table contains a column masked_status
, that will report
whether a particular alias has been successfully mapped. If a pipeline fails, there may be more than one event for a
particular table, in which case, a query using the pipeline_run_id
can help identify errors, for example:
SELECT
*
FROM adf_events_log
WHERE
pipeline_success = CAST(0 AS BIT)
AND pipeline_run_id = '<pipeline_run_id>';
Consider adding additional clauses to the WHERE
condition, such as filter_alias IS NOT NULL
or
source_table = '<source_table>'
- Dataflows documentation
- Pipeline documentation - Documentation for each pipeline is included in the released version of the template
- dcsazure_Databricks_to_Databricks_mask_pl
- dcsazure_Databricks_to_Databricks_discovery_pl
- dcsazure_Snowflake_to_Snowflake_mask_pl
- dcsazure_Snowflake_to_Snowflake_discovery_pl
- dcsazure_adls_to_adls_mask_pl
- dcsazure_adls_to_adls_discovery_pl
- dcsazure_AzureSQL_to_AzureSQL_discovery_pl
- dcsazure_AzureSQL_to_AzureSQL_mask_pl
Contributions are welcome!
If you'd like to contribute be sure you're starting with the latest templates (should they exist) before you make
changes. Once you're ready to contribute your changes back, the best way to get the templates ready to add to this repo
is by leveraging the Export template
option from the tab of the pipeline detail view in Data Factory Studio.
(Tip: before exporting your template make sure variables are set to common-sense default values.)
To make this easier to follow, let's use an example. Suppose you'd like to add sample_template_pl
to the set of
pipelines that are available in this repository. Your exported template will come from Azure with the name
sample_template_pl.zip
. Once you unzip this file, it will produce a directory with structure like:
sample_template_pl
├── manifest.json
└── sample_template_pl.json
These files, since they are meant to be imported directly into Data Factory Studio are not well-suited to version
control (as they are one very long line). To make these files more suited to version control, the following utility
script has been provided ./scripts/format.sh. To leverage the script, export the template from
the data factory studio, this will result in a zip file downloaded to your local machine. To leverage the script, move
the template from your Downloads directory to the base directory of the repo. From there run (from a terminal)
./scripts/format.sh
.
Using the example above, the script will unzip the template, re-format the files leveraging jq
and then remove the
zip file. Note that this will only unzip the pipelines that we already have support for in this repo. Manual unzip will
need to be performed for new pipelines.
New templates will need to have an associated README.md
and for them to be built into the releases
folder, edits
will have to be made to the docker-compose.yaml
. For this example, you'd have to add a line
zip sample_template_pl.zip sample_template_pl/* &&
somewhere between apt-get install -y zip &&
and mv *.zip releases/."
.
If Metadata store scripts are needed to support your new templates or new features, please update the
metadata_store_scripts
directory to include those changes, and update the README.md
for all impacted templates to
incidate the need for the new DB version. New statements should play nicely with previous statements - isolate the
changes in a particular versioned script, and follow the version naming convention. Versioned scripts follow the naming
convention V<version_number>__<comment>.sql
, where <version_number>
is YYYY.MM.DD.#
and represents the date when
this script is added (with the final digit being used to allow multiple versions to be tagged with the same date).
You can leverage the ./scripts/migrations.sh to automatically add the new versioned migration
to the ./metadata_store_scripts/bootstrap.sql file.