Skip to content

Commit

Permalink
Lib release v1.0.0 update (#2)
Browse files Browse the repository at this point in the history
* Clarify license terms, update for library v1.0.0 release and review by technical writers
  • Loading branch information
Majea authored Feb 19, 2019
1 parent 9d4a770 commit 0d15178
Show file tree
Hide file tree
Showing 3 changed files with 60 additions and 58 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
.idea
*iml
out
/libs/profiler-api.jar
/libs/collibra-profiler-*

113 changes: 57 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,34 @@
# Purpose of Collibra Catalog Profiling library
# Purpose of Collibra Catalog Profiling Library

The goal of this library is to let Collibra users run the Catalog data profiling jobs
on their own Apache Spark clusters.

By default, Profiling jobs are executed in JobServer which is running Spark in
By default, profiling jobs are executed in JobServer which is running Spark in
local mode (single machine). Thanks to this library, Collibra customers can
leverage their infrastructure and scale up profiling jobs to get more from their
leverage their infrastructure and scale up profiling jobs to get more out of their
Catalog.

Because the profiling library users control the data that is profiled, they can also
ingest and profile data sources that are not supported out-of-the-box by Collibra
ingest and profile data sources that are not supported out-of-the-box by Collibra
Catalog. They can define their own Spark DataSet, run the profiling library and then
transfer the result to Collibra Catalog.

# Usage

## Basic usage
The library is expected to be used directly inside Spark driver code. It holds a
The library is designed to be used directly inside Spark driver code. It has a
similar position as libraries such as mllib or Spark SQL. The profiler jar should be
added to the dependencies of your Spark application.

The entry point to the profiling library is the
`com.collibra.catalog.profilers.api.Profilers` class. This class can be directly imported
and used in your Spark code. In its most simple form, you need to provide a DataSet and
define what level of profiling you want between those: basic statistics, basic statistics
and quantiles, or full profiling. Each one of those levels relies on the previous one, thus
making it longer to process.
define what level of profiling you want:
1. basic statistics
2. basic statistics and quantiles
3. full profiling

Each one of those levels relies on the previous one, thus making it longer to process.

Example:
```
Expand All @@ -34,48 +37,47 @@ ColumnProfilesUpdate profileUpdate = Profilers.profileTable(dataset, ProfileFeat

After profiling is completed, the result should be transferred to Collibra Catalog
using the Collibra Catalog profiling REST API. This API enables client applications
to send and store profiling data in Catalog assets. A typical way to do so is
to use Collibra Connect as middleware between Data Governance Center and the Spark
cluster.
to send and store profiling data in Catalog assets. Typically, Collibra Connect
is used as a middleware between Collibra Data Governance Center and the Spark cluster.

![Profiling jobs in cluster architecture](doc/profiling_jobs_in_cluster_small.png "Example of architecture for running profiling jobs in a cluster")
![Profiling jobs in cluster architecture](doc/profiling_jobs_in_cluster_small.png "Example of architecture for running profiling jobs in cluster")

The result of the profiling job is a `ColumnProfilesUpdate` object. This object is
provided in a format that is close to the one used by the profiling REST API.
Only the assets information is not included. There are 2 ways to fill the profiling
results with the missing information:
provided in a format that is close to the one used by the Collibra Catalog profiling REST API.
Only the asset information is not included. There are 2 ways to add the missing information to the profiling
results:
1. After the profiling result is received, loop over the ColumnProfile objects it
contains and fill the AssetIdentifier object in each one of them.
2. provide a method in the `Profilers.profileTable` call and let the profiling library
contains and add the AssetIdentifier object in each one of them.
2. Provide a method in the `Profilers.profileTable` call and let the profiling library
loop over the ColumnProfile objects for you.

Depending on your architecture, it's possible that the information required to fill the
Depending on your architecture, it's possible that the information required to add the
AssetIdentifiers is not available in your Spark job. In that case, only the first option
can be used and filling the missing information must be done in another node (e.g. in a
can be used and adding the missing information must be done in another node (e.g. in a
Collibra Connect script).

![Example of profiling in Collibra Catalog](doc/iris_class_profile_small.png "Example of profiling in Collibra Catalog")

## Tuning the profiling process

Next to the `ProfilesFeature` enum which enables selection of different levels of profiling,
Next to the `ProfilesFeature` enum, which allows you to select the level of profiling,
there are also some additional parameters that can be tuned to better control the profiler
behavior. Those parameters can be passed to the profiling jobs by providing a
`ProfilingConfiguration` object when calling `Profilers.profileTable`.

Those parameters are as follows:
* _CacheLevel_: The cache level tells the profiling jobs if and how to cache data when
These are the available parameters:
* _CacheLevel_: Tells the profiling jobs if and how to cache data when
they identify points where caching can improve performance. The levels
are the same as those defined in org.apache.spark.api.java.StorageLevels.
A level set to NONE actually prevents caching.
are the same as those defined in `org.apache.spark.api.java.StorageLevels`.
Set the level to NONE to prevent caching.
* _MaximumValueLength_: Defines how many characters are used by profiling jobs for handling
long text values.
* _DefaultDatePattern_: Defines the default date pattern used for date detection.
The pattern format in use is described in java.time.format.DateTimeFormatter.
* _DefaultTimePattern_: Defines the default time pattern used for times detection.
The pattern format in use is described in java.time.format.DateTimeFormatter.
The pattern format in use is described in `java.time.format.DateTimeFormatter`.
* _DefaultTimePattern_: Defines the default time pattern used for time detection.
The pattern format in use is described in `java.time.format.DateTimeFormatter`.
* _DefaultDateTimePattern_: Defines the default date-time pattern used for date-times detection.
The pattern format in use is described in java.time.format.DateTimeFormatter.
The pattern format in use is described in `java.time.format.DateTimeFormatter`.
* _MissingValuesDefinition_: A list of values that should be considered as missing or empty
when counting the number of missing values in a column.

Expand All @@ -85,47 +87,47 @@ All those parameters are initialized with sensible defaults and are therefore op

## Profiling examples

This project showcases the use of the profiling jobs used in Collibra's
This project showcases the use of the profiling jobs used in Collibra
Catalog.
The profiling result is then uploaded to an instance of Catalog through
the Collibra Catalog REST API for profiling.
The profiling result is then uploaded to an instance of Collibra Catalog through
the Collibra Catalog profiling REST API.

One example covers profiling a csv file (included in the project).
One example covers profiling a CSV file (included in the project).
A second example covers profiling a table from a database via jdbc. For the second
example, the developers are expected to adapt the code to connect to their own data
sources.

## Using the Catalog profiling REST API
## Using the Collibra Catalog profiling REST API

In order to be able to use the Catalog profiling REST API, a simple Java
REST client is included in the project. This client is by NO means a
In order to be able to use the Collibra Catalog profiling REST API, a simple Java
REST client is included in the project. This client is by no means a
suggested implementation for such functionality. It is added purely for
illustrative purposes. A more common pattern is to establish communication
illustrative purposes. A more common strategy is to establish communication
between the Spark cluster or Hadoop environment and
Collibra Data Governance Center using Collibra Connect.

## Identification of column assets

A key aspect of writing the profiles to Catalog is matching columns with
A key aspect of writing the data profiles to Catalog is matching columns with
Column assets. The two examples show a way to add asset identification
information.

Please notice the profiling REST API expects the assets to be already present
and to only update the profiling information. Hence a common pattern is to
first create the relevant assets using simple Catalog ingestion or using
a Connect script and then use Connect again to send the profiling information
to Catalog. This connect script would also be in charge of making the link
Please notice the Collibra Catalog profiling REST API expects that the assets already exist
and will only try to add the profiling information. Hence, a common strategy is to
first create the relevant assets using a simple Catalog ingestion or using
a Collibra Connect script and then use Collibra Connect again to send the profiling information
to Collibra Catalog. This Collibra Connect script would also be in charge of making the link
between a column profile and a Column asset using the `AssetIdentifier` data
structure.

## Building and running

Since the profiling library is only distributed through the Collibra Marketplace,
this project does not contain the library directly. The first step to run the example
is therefore to:
1. Download your own copy of the Collibra Catalog Profiling library at https://marketplace.collibra.com/
this project does not contain the library directly. Therefore, the first steps to run the example
are the following:
1. Download your own copy of the Collibra Catalog Profiling Library at [Collibra Marketplace](https://marketplace.collibra.com/listings/collibra-catalog-profiler/)
2. Update the project classpath by either
* storing the profiler jar file in the libs directory of this project, or
* storing the profiler jar file in the libs directory of this project or
* adapting `build.gradle` dependencies to point to a valid location of that library.

Then, depending on what example you are running, you may also need to change a
Expand All @@ -139,21 +141,17 @@ classes.

### Running with gradle

Calling the `run` gradle command will execute the csv example:
Calling the `run` gradle command will execute the CSV example:
`./gradlew run`
In order to execute the jdbc example, pass the `jdbc` parameter to
gradle: `./gradlew -Pjdbc run`

# Release notes

## v1.0
Initial release
# Compatibility chart

# Known issues
| Library version | Collibra DGC version | Apache Spark version |
|-----------------------------|----------------------|----------------------|
| collibra-profiler-1.0.0.jar | 5.6.1 | 2.2.3 |

## v1.0
* Internal repartitioning in quantiles calculation may lead to out of memory errors.
Extra partitioning before calling the profiler may help with this issue.

# Contributions

Expand All @@ -162,7 +160,10 @@ We expect contributors to follow the code of conduct defined [here](CODE_OF_COND

# License

The examples in this project are released under the following license: [LICENSE](LICENSE)
The examples in this project are released under the following license: [LICENSE](LICENSE).

Collibra Catalog profiler library available at [Collibra Marketplace](https://marketplace.collibra.com/listings/collibra-catalog-profiler/)
to Collibra Catalog licence owners under the same license terms as Collibra Catalog.

# Credits

Expand Down
2 changes: 1 addition & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ dependencies {

// This library is not distributed in this project. It needs either to be downloaded or referenced differently.
// Please check README.md for more information.
compile files('libs/profiler-api.jar')
compile files('libs/collibra-profiler-1.0.0.jar')
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: sparkVer
compile group: 'org.apache.spark', name: 'spark-mllib_2.11', version: sparkVer
// prevents version clashes for jackson databind
Expand Down

0 comments on commit 0d15178

Please sign in to comment.