The spark-hpcc project has been moved to the hpcc4j project. All contributions to the project are now being made in the new location and this hpcc-systems/spark-hpcc repository is to be considered dormant.
⚡ Note: This project references log4j which has been reported to include security vulnerabilitie(s) in versions prior to v2.15.0 |
|
Spark classes for HPCC Systems / Spark interoperability
The DataAccess project contains the classes which expose distributed streaming of HPCC based data via Spark constructs. In addition, the HPCC data is exposed as a Dataframe for the convenience of the Spark developer.
The spark-hpcc target jar does not package any of the Spark libraries it depends on. If using a standard Spark submission pipeline such as spark-submit these dependencies will be provided as part of the Spark installation. However, if your pipeline executes a jar directly you may need to add the Spark libraries from your $SPARK_HOME to the classpath.
See: Examples for example usage of the connector as well as API documentation for the reading and writing APIs.
"In all versions of Apache Spark, its standalone resource manager accepts code to execute on a 'master' host, that then runs that code on 'worker' hosts. The master itself does not, by design, execute user code. A specially-crafted request to the master can, however, cause the master to execute code too. Note that this does not affect standalone clusters with authentication enabled. While the master host typically has less outbound access to other resources than a worker, the execution of code on the master is nevertheless unexpected. Mitigation
Enable authentication on any Spark standalone cluster that is not otherwise secured from unwanted access, for example by network-level restrictions. Use spark.authenticate and related security properties described at https://spark.apache.org/docs/latest/security.html"