Lib release v1.0.0 update (#2)

* Clarify license terms, update for library v1.0.0 release and review by technical writers
TreyM-WSS · Feb 19, 2019 · 0d15178 · 0d15178
1 parent 9d4a770
commit 0d15178
Show file tree

Hide file tree

Showing 3 changed files with 60 additions and 58 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,4 +3,5 @@
 .idea
 *iml
 out
-/libs/profiler-api.jar
+/libs/collibra-profiler-*
+
diff --git a/README.md b/README.md
@@ -1,31 +1,34 @@
-# Purpose of Collibra Catalog Profiling library
+# Purpose of Collibra Catalog Profiling Library
 
 The goal of this library is to let Collibra users run the Catalog data profiling jobs 
 on their own Apache Spark clusters. 
 
-By default, Profiling jobs are executed in JobServer which is running Spark in 
+By default, profiling jobs are executed in JobServer which is running Spark in 
 local mode (single machine). Thanks to this library, Collibra customers can 
-leverage their infrastructure and scale up profiling jobs to get more from their 
+leverage their infrastructure and scale up profiling jobs to get more out of their 
 Catalog.
 
 Because the profiling library users control the data that is profiled, they can also 
-ingest and profile data sources that are not supported out-of-the-box by Collibra 
+ingest and profile data sources that are not supported out-of-the-box by Collibra
 Catalog. They can define their own Spark DataSet, run the profiling library and then 
 transfer the result to Collibra Catalog.   
 
 # Usage
 
 ## Basic usage
-The library is expected to be used directly inside Spark driver code. It holds a 
+The library is designed to be used directly inside Spark driver code. It has a 
 similar position as libraries such as mllib or Spark SQL. The profiler jar should be
 added to the dependencies of your Spark application.
 
 The entry point to the profiling library is the 
 `com.collibra.catalog.profilers.api.Profilers` class. This class can be directly imported
 and used in your Spark code. In its most simple form, you need to provide a DataSet and 
-define what level of profiling you want between those: basic statistics, basic statistics
-and quantiles, or full profiling. Each one of those levels relies on the previous one, thus
-making it longer to process.
+define what level of profiling you want:
+1. basic statistics
+2. basic statistics and quantiles
+3. full profiling
+
+Each one of those levels relies on the previous one, thus making it longer to process.
 
 Example:  
 ```
@@ -34,48 +37,47 @@ ColumnProfilesUpdate profileUpdate = Profilers.profileTable(dataset, ProfileFeat
 
 After profiling is completed, the result should be transferred to Collibra Catalog
 using the Collibra Catalog profiling REST API. This API enables client applications 
-to send and store profiling data in Catalog assets.  A typical way to do so is 
-to use Collibra Connect as middleware between Data Governance Center and the Spark 
-cluster.
+to send and store profiling data in Catalog assets. Typically, Collibra Connect 
+is used as a middleware between Collibra Data Governance Center and the Spark cluster.
 
-![Profiling jobs in cluster architecture](doc/profiling_jobs_in_cluster_small.png "Example of architecture for running profiling jobs in a cluster")
+![Profiling jobs in cluster architecture](doc/profiling_jobs_in_cluster_small.png "Example of architecture for running profiling jobs in cluster")
 
 The result of the profiling job is a `ColumnProfilesUpdate` object. This object is 
-provided in a format that is close to the one used by the profiling REST API. 
-Only the assets information is not included. There are 2 ways to fill the profiling 
-results with the missing information:
+provided in a format that is close to the one used by the Collibra Catalog profiling REST API. 
+Only the asset information is not included. There are 2 ways to add the missing information to the profiling 
+results:
 1. After the profiling result is received, loop over the ColumnProfile objects it 
-contains and fill the AssetIdentifier object in each one of them.
-2. provide a method in the `Profilers.profileTable` call and let the profiling library
+contains and add the AssetIdentifier object in each one of them.
+2. Provide a method in the `Profilers.profileTable` call and let the profiling library
 loop over the ColumnProfile objects for you.
 
-Depending on your architecture, it's possible that the information required to fill the 
+Depending on your architecture, it's possible that the information required to add the 
 AssetIdentifiers is not available in your Spark job. In that case, only the first option 
-can be used and filling the missing information must be done in another node (e.g. in a
+can be used and adding the missing information must be done in another node (e.g. in a
 Collibra Connect script).  
 
 ![Example of profiling in Collibra Catalog](doc/iris_class_profile_small.png "Example of profiling in Collibra Catalog")
 
 ## Tuning the profiling process
 
-Next to the `ProfilesFeature` enum which enables selection of different levels of profiling,
+Next to the `ProfilesFeature` enum, which allows you to select the level of profiling,
 there are also some additional parameters that can be tuned to better control the profiler
 behavior. Those parameters can be passed to the profiling jobs by providing a 
 `ProfilingConfiguration` object when calling `Profilers.profileTable`.
 
-Those parameters are as follows:
-* _CacheLevel_: The cache level tells the profiling jobs if and how to cache data when 
+These are the available parameters:
+* _CacheLevel_: Tells the profiling jobs if and how to cache data when 
                 they identify points where caching can improve performance. The levels 
-                are the same as those defined in org.apache.spark.api.java.StorageLevels.
-                A level set to NONE actually prevents caching.
+                are the same as those defined in `org.apache.spark.api.java.StorageLevels`.
+                Set the level to NONE to prevent caching.
 * _MaximumValueLength_: Defines how many characters are used by profiling jobs for handling 
                         long text values.
 * _DefaultDatePattern_: Defines the default date pattern used for date detection.
-                        The pattern format in use is described in java.time.format.DateTimeFormatter.
-* _DefaultTimePattern_: Defines the default time pattern used for times detection.
-                        The pattern format in use is described in java.time.format.DateTimeFormatter.
+                        The pattern format in use is described in `java.time.format.DateTimeFormatter`.
+* _DefaultTimePattern_: Defines the default time pattern used for time detection.
+                        The pattern format in use is described in `java.time.format.DateTimeFormatter`.
 * _DefaultDateTimePattern_: Defines the default date-time pattern used for date-times detection.
-                            The pattern format in use is described in java.time.format.DateTimeFormatter.
+                            The pattern format in use is described in `java.time.format.DateTimeFormatter`.
 * _MissingValuesDefinition_: A list of values that should be considered as missing or empty 
                              when counting the number of missing values in a column.
 
@@ -85,47 +87,47 @@ All those parameters are initialized with sensible defaults and are therefore op
 
 ## Profiling examples
 
-This project showcases the use of the profiling jobs used in Collibra's
+This project showcases the use of the profiling jobs used in Collibra
 Catalog.
-The profiling result is then uploaded to an instance of Catalog through
-the Collibra Catalog REST API for profiling.
+The profiling result is then uploaded to an instance of Collibra Catalog through
+the Collibra Catalog profiling REST API.
 
-One example covers profiling a csv file (included in the project).
+One example covers profiling a CSV file (included in the project).
 A second example covers profiling a table from a database via jdbc. For the second
 example, the developers are expected to adapt the code to connect to their own data 
 sources.
 
-## Using the Catalog profiling REST API
+## Using the Collibra Catalog profiling REST API
 
-In order to be able to use the Catalog profiling REST API, a simple Java
-REST client is included in the project. This client is by NO means a
+In order to be able to use the Collibra Catalog profiling REST API, a simple Java
+REST client is included in the project. This client is by no means a
 suggested implementation for such functionality. It is added purely for
-illustrative purposes. A more common pattern is to establish communication 
+illustrative purposes. A more common strategy is to establish communication 
 between the Spark cluster or Hadoop environment and 
 Collibra Data Governance Center using Collibra Connect. 
 
 ## Identification of column assets
 
-A key aspect of writing the profiles to Catalog is matching columns with
+A key aspect of writing the data profiles to Catalog is matching columns with
 Column assets. The two examples show a way to add asset identification
 information.
 
-Please notice the profiling REST API expects the assets to be already present
-and to only update the profiling information. Hence a common pattern is to 
-first create the relevant assets using simple Catalog ingestion or using  
-a Connect script and then use Connect again to send the profiling information
-to Catalog. This connect script would also be in charge of making the link 
+Please notice the Collibra Catalog profiling REST API expects that the assets already exist
+and will only try to add the profiling information. Hence, a common strategy is to 
+first create the relevant assets using a simple Catalog ingestion or using  
+a Collibra Connect script and then use Collibra Connect again to send the profiling information
+to Collibra Catalog. This Collibra Connect script would also be in charge of making the link 
 between a column profile and a Column asset using the `AssetIdentifier` data 
 structure.  
 
 ## Building and running
 
 Since the profiling library is only distributed through the Collibra Marketplace, 
-this project does not contain the library directly. The first step to run the example
-is therefore to:
-1. Download your own copy of the Collibra Catalog Profiling library at https://marketplace.collibra.com/
+this project does not contain the library directly. Therefore, the first steps to run the example
+are the following:
+1. Download your own copy of the Collibra Catalog Profiling Library at [Collibra Marketplace](https://marketplace.collibra.com/listings/collibra-catalog-profiler/)
 2. Update the project classpath by either
-    * storing the profiler jar file in the libs directory of this project, or
+    * storing the profiler jar file in the libs directory of this project or
     * adapting `build.gradle` dependencies to point to a valid location of that library.
 
 Then, depending on what example you are running, you may also need to change a
@@ -139,21 +141,17 @@ classes.
 
 ### Running with gradle
 
-Calling the `run` gradle command will execute the csv example:
+Calling the `run` gradle command will execute the CSV example:
 `./gradlew run`
 In order to execute the jdbc example, pass the `jdbc` parameter to
 gradle: `./gradlew -Pjdbc run`
 
-# Release notes
-
-## v1.0 
-Initial release
+# Compatibility chart
 
-# Known issues
+| Library version             | Collibra DGC version | Apache Spark version |
+|-----------------------------|----------------------|----------------------|
+| collibra-profiler-1.0.0.jar | 5.6.1                | 2.2.3                |
 
-## v1.0
-* Internal repartitioning in quantiles calculation may lead to out of memory errors. 
-  Extra partitioning before calling the profiler may help with this issue. 
 
 # Contributions
 
@@ -162,7 +160,10 @@ We expect contributors to follow the code of conduct defined [here](CODE_OF_COND
 
 # License
 
-The examples in this project are released under the following license: [LICENSE](LICENSE)
+The examples in this project are released under the following license: [LICENSE](LICENSE).
+
+Collibra Catalog profiler library available at [Collibra Marketplace](https://marketplace.collibra.com/listings/collibra-catalog-profiler/)
+to Collibra Catalog licence owners under the same license terms as Collibra Catalog.
 
 # Credits
 

diff --git a/build.gradle b/build.gradle
@@ -22,7 +22,7 @@ dependencies {
 
     // This library is not distributed in this project. It needs either to be downloaded or referenced differently.
     // Please check README.md for more information.
-    compile files('libs/profiler-api.jar')
+    compile files('libs/collibra-profiler-1.0.0.jar')
     compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: sparkVer
     compile group: 'org.apache.spark', name: 'spark-mllib_2.11', version: sparkVer
     // prevents version clashes for jackson databind
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,4 +3,5 @@ @@
     .idea
     *iml
     out
-    /libs/profiler-api.jar
+    /libs/collibra-profiler-*