-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[DOP-14042] Improve Hive documentation
- Loading branch information
Showing
15 changed files
with
562 additions
and
127 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Improve Hive documentation: | ||
* Add "Prerequisites" page describing different aspects of connecting to Hive | ||
* Improve "Reading from" and "Writing to" page of Hive documentation, add more examples and recommendations. | ||
* Improve "Executing statements in Hive" page of Hive documentation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
.. _hive-prerequisites: | ||
|
||
Prerequisites | ||
============= | ||
|
||
.. note:: | ||
|
||
onETL's Hive connection is actually SparkSession with access to `Hive Thrift Metastore <https://docs.cloudera.com/cdw-runtime/1.5.0/hive-hms-overview/topics/hive-hms-introduction.html>`_ | ||
and HDFS/S3. | ||
All data motion is made using Spark. Hive Metastore is used only to store tables and partitions metadata. | ||
|
||
This connector does **NOT** require Hive server. It also does **NOT** use Hive JDBC connector. | ||
|
||
Version Compatibility | ||
--------------------- | ||
|
||
* Hive Metastore version: 0.12 - 3.1.3 (may require to add proper .jar file explicitly) | ||
* Spark versions: 2.3.x - 3.5.x | ||
* Java versions: 8 - 20 | ||
|
||
See `official documentation <https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html>`_. | ||
|
||
|
||
Installing PySpark | ||
------------------ | ||
|
||
To use Hive connector you should have PySpark installed (or injected to ``sys.path``) | ||
BEFORE creating the connector instance. | ||
|
||
See :ref:`install-spark` installation instruction for more details. | ||
|
||
Connecting to Hive Metastore | ||
---------------------------- | ||
|
||
.. note:: | ||
|
||
If you're using managed Hadoop cluster, skip this step, as all Spark configs are should already present on the host. | ||
|
||
Create ``$SPARK_CONF_DIR/hive-site.xml`` with Hive Metastore URL: | ||
|
||
.. code:: xml | ||
<?xml version="1.0"?> | ||
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> | ||
<configuration> | ||
<property> | ||
<name>hive.metastore.uris</name> | ||
<value>thrift://metastore.host.name:9083</value> | ||
</property> | ||
</configuration> | ||
Create ``$SPARK_CONF_DIR/core-site.xml`` with warehouse location ,e.g. HDFS IPC port of Hadoop namenode, or S3 bucket address & credentials: | ||
|
||
.. tabs:: | ||
|
||
.. code-tab:: xml HDFS | ||
|
||
<?xml version="1.0" encoding="UTF-8"?> | ||
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> | ||
<configuration> | ||
<property> | ||
<name>fs.defaultFS</name> | ||
<value>hdfs://myhadoopcluster:9820</value> | ||
</property> | ||
</configuration> | ||
|
||
.. code-tab:: xml S3 | ||
|
||
<?xml version="1.0" encoding="UTF-8"?> | ||
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> | ||
<configuration> | ||
!-- See https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration | ||
<property> | ||
<name>fs.defaultFS</name> | ||
<value>s3a://mys3bucket/</value> | ||
</property> | ||
<property> | ||
<name>fs.s3a.bucket.mybucket.endpoint</name> | ||
<value>http://s3.somain</value> | ||
</property> | ||
<property> | ||
<name>fs.s3a.bucket.mybucket.connection.ssl.enabled</name> | ||
<value>false</value> | ||
</property> | ||
<property> | ||
<name>fs.s3a.bucket.mybucket.path.style.access</name> | ||
<value>true</value> | ||
</property> | ||
<property> | ||
<name>fs.s3a.bucket.mybucket.aws.credentials.provider</name> | ||
<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value> | ||
</property> | ||
<property> | ||
<name>fs.s3a.bucket.mybucket.access.key</name> | ||
<value>some-user</value> | ||
</property> | ||
<property> | ||
<name>fs.s3a.bucket.mybucket.secret.key</name> | ||
<value>mysecrettoken</value> | ||
</property> | ||
</configuration> | ||
|
||
Using Kerberos | ||
-------------- | ||
|
||
Some of Hadoop managed clusters use Kerberos authentication. In this case, you should call `kinit <https://web.mit.edu/kerberos/krb5-1.12/doc/user/user_commands/kinit.html>`_ command | ||
**BEFORE** starting Spark session to generate Kerberos ticket. See :ref:`install-kerberos`. | ||
|
||
Sometimes it is also required to pass keytab file to Spark config, allowing Spark executors to generate own Kerberos tickets: | ||
|
||
.. tabs:: | ||
|
||
.. code-tab:: python Spark 3 | ||
|
||
SparkSession.builder | ||
.option("spark.kerberos.access.hadoopFileSystems", "hdfs://namenode1.domain.com:9820,hdfs://namenode2.domain.com:9820") | ||
.option("spark.kerberos.principal", "user") | ||
.option("spark.kerberos.keytab", "/path/to/keytab") | ||
.gerOrCreate() | ||
|
||
.. code-tab:: python Spark 2 | ||
|
||
SparkSession.builder | ||
.option("spark.yarn.access.hadoopFileSystems", "hdfs://namenode1.domain.com:9820,hdfs://namenode2.domain.com:9820") | ||
.option("spark.yarn.principal", "user") | ||
.option("spark.yarn.keytab", "/path/to/keytab") | ||
.gerOrCreate() | ||
|
||
See `Spark security documentation <https://spark.apache.org/docs/latest/security.html#kerberos>`_ | ||
for more details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,100 @@ | ||
.. _hive-read: | ||
|
||
Reading from Hive | ||
================= | ||
Reading from Hive using ``DBReader`` | ||
==================================== | ||
|
||
There are 2 ways of distributed data reading from Hive: | ||
:obj:`DBReader <onetl.db.db_reader.db_reader.DBReader>` supports :ref:`strategy` for incremental data reading, | ||
but does not support custom queries, like ``JOIN``. | ||
|
||
* Using :obj:`DBReader <onetl.db.db_reader.db_reader.DBReader>` with different :ref:`strategy` | ||
* Using :obj:`Hive.sql <onetl.connection.db_connection.hive.connection.Hive.sql>` | ||
Supported DBReader features | ||
--------------------------- | ||
|
||
.. currentmodule:: onetl.connection.db_connection.hive.connection | ||
* ✅︎ ``columns`` | ||
* ✅︎ ``where`` | ||
* ✅︎ ``hwm``, supported strategies: | ||
* * ✅︎ :ref:`snapshot-strategy` | ||
* * ✅︎ :ref:`incremental-strategy` | ||
* * ✅︎ :ref:`snapshot-batch-strategy` | ||
* * ✅︎ :ref:`incremental-batch-strategy` | ||
* ❌ ``hint`` (is not supported by Hive) | ||
* ❌ ``df_schema`` | ||
* ❌ ``options`` (only Spark config params are used) | ||
|
||
.. automethod:: Hive.sql | ||
.. warning:: | ||
|
||
Actually, ``columns``, ``where`` and ``hwm.expression`` should be written using `SparkSQL <https://spark.apache.org/docs/latest/sql-ref-syntax.html#data-retrieval-statements>`_ syntax, | ||
not HiveQL. | ||
|
||
Examples | ||
-------- | ||
|
||
Snapshot strategy: | ||
|
||
.. code-block:: python | ||
from onetl.connection import Hive | ||
from onetl.db import DBReader | ||
hive = Hive(...) | ||
reader = DBReader( | ||
connection=hive, | ||
source="schema.table", | ||
columns=["id", "key", "CAST(value AS text) value", "updated_dt"], | ||
where="key = 'something'", | ||
) | ||
df = reader.run() | ||
Incremental strategy: | ||
|
||
.. code-block:: python | ||
from onetl.connection import Hive | ||
from onetl.db import DBReader | ||
from onetl.strategy import IncrementalStrategy | ||
hive = Hive(...) | ||
reader = DBReader( | ||
connection=hive, | ||
source="schema.table", | ||
columns=["id", "key", "CAST(value AS text) value", "updated_dt"], | ||
where="key = 'something'", | ||
hwm=DBReader.AutoDetectHWM(name="hive_hwm", expression="updated_dt"), | ||
) | ||
with IncrementalStrategy(): | ||
df = reader.run() | ||
Recommendations | ||
--------------- | ||
|
||
Use column-based write formats | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Prefer these write formats: | ||
* `ORC <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>`_ | ||
* `Parquet <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html>`_ | ||
* `Iceberg <https://iceberg.apache.org/spark-quickstart/>`_ | ||
* `Hudi <https://hudi.apache.org/docs/quick-start-guide/>`_ | ||
* `Delta <https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake>`_ | ||
|
||
For colum-based write formats, each file contains separated sections there column data is stored. The file footer contains | ||
location of each column section/group. Spark can use this information to load only sections required by specific query, e.g. only selected columns, | ||
to drastically speed up the query. | ||
|
||
Another advantage is high compression ratio, e.g. 10x-100x in comparison to JSON or CSV. | ||
|
||
Select only required columns | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Instead of passing ``"*"`` in ``DBReader(columns=[...])`` prefer passing exact column names. | ||
This drastically reduces the amount of data read by Spark, **if column-based file formats are used**. | ||
|
||
Use partition columns in ``where`` clause | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Queries should include ``WHERE`` clause with filters on Hive partitioning columns. | ||
This allows Spark to read only small set of files (*partition pruning*) instead of scanning the entire table, so this drastically increases performance. | ||
|
||
Supported operators are: ``=``, ``>``, ``<`` and ``BETWEEN``, and only against some **static** value. |
Oops, something went wrong.