Skip to content

Commit

Permalink
[DOP-9007] Rearrange installation documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
dolfinus committed Sep 26, 2023
1 parent b8bd6dd commit 9557268
Show file tree
Hide file tree
Showing 7 changed files with 111 additions and 42 deletions.
37 changes: 19 additions & 18 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Requirements
* **Python 3.7 - 3.11**
* PySpark 2.3.x - 3.4.x (depends on used connector)
* Java 8+ (required by Spark, see below)
* Kerberos libs & GCC (required by ``Hive`` and ``HDFS`` connectors)
* Kerberos libs & GCC (required by ``Hive``, ``HDFS`` and ``SparkHDFS`` connectors)

Supported storages
------------------
Expand Down Expand Up @@ -111,16 +111,16 @@ Documentation

See https://onetl.readthedocs.io/

.. install
How to install
---------------

.. minimal-install
.. _install:

Minimal installation
~~~~~~~~~~~~~~~~~~~~

.. _minimal-install:

Base ``onetl`` package contains:

* ``DBReader``, ``DBWriter`` and related classes
Expand All @@ -142,14 +142,16 @@ It can be installed via:
This method is recommended for use in third-party libraries which require for ``onetl`` to be installed,
but do not use its connection classes.

.. _spark-install:

With DB and FileDF connections
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _spark-install:

All DB connection classes (``Clickhouse``, ``Greenplum``, ``Hive`` and others)
and all FileDF connection classes (``SparkHDFS``, ``SparkLocalFS``, ``SparkS3``)
require PySpark to be installed.
require Spark to be installed.

.. _java-install:

Firstly, you should install JDK. The exact installation instruction depends on your OS, here are some examples:

Expand Down Expand Up @@ -178,6 +180,8 @@ Compatibility matrix
| `3.4.x <https://spark.apache.org/docs/3.4.1/#downloading>`_ | 3.7 - 3.11 | 8u362 - 20 | 2.12 |
+--------------------------------------------------------------+-------------+-------------+-------+

.. _pyspark-install:

Then you should install PySpark via passing ``spark`` to ``extras``:

.. code:: bash
Expand All @@ -193,12 +197,11 @@ or install PySpark explicitly:
or inject PySpark to ``sys.path`` in some other way BEFORE creating a class instance.
**Otherwise connection object cannot be created.**


.. _files-install:

With File connections
~~~~~~~~~~~~~~~~~~~~~

.. _files-install:

All File (but not *FileDF*) connection classes (``FTP``, ``SFTP``, ``HDFS`` and so on) requires specific Python clients to be installed.

Each client can be installed explicitly by passing connector name (in lowercase) to ``extras``:
Expand All @@ -216,18 +219,17 @@ To install all file connectors at once you can pass ``files`` to ``extras``:
**Otherwise class import will fail.**


.. _kerberos-install:

With Kerberos support
~~~~~~~~~~~~~~~~~~~~~

.. _kerberos-install:

Most of Hadoop instances set up with Kerberos support,
so some connections require additional setup to work properly.

* ``HDFS``
Uses `requests-kerberos <https://pypi.org/project/requests-kerberos/>`_ and
`GSSApi <https://pypi.org/project/gssapi/>`_ for authentication in WebHDFS.
`GSSApi <https://pypi.org/project/gssapi/>`_ for authentication.
It also uses ``kinit`` executable to generate Kerberos ticket.

* ``Hive`` and ``SparkHDFS``
Expand All @@ -252,12 +254,11 @@ Also you should pass ``kerberos`` to ``extras`` to install required Python packa
pip install onetl[kerberos]
.. _full-install:

Full bundle
~~~~~~~~~~~

.. _full-bundle:

To install all connectors and dependencies, you can pass ``all`` into ``extras``:

.. code:: bash
Expand All @@ -271,7 +272,7 @@ To install all connectors and dependencies, you can pass ``all`` into ``extras``

This method consumes a lot of disk space, and requires for Java & Kerberos libraries to be installed into your OS.

.. quick-start
.. _quick-start:

Quick start
------------
Expand Down
8 changes: 8 additions & 0 deletions docs/install/files.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.. _install-files:

File connections
=================

.. include:: ../../README.rst
:start-after: .. _files-install:
:end-before: With Kerberos support
8 changes: 8 additions & 0 deletions docs/install/full.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.. _install-full:

Full bundle
===========

.. include:: ../../README.rst
:start-after: .. _full-bundle:
:end-before: .. _quick-start:
14 changes: 12 additions & 2 deletions docs/install/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,19 @@
How to install
==============

.. include:: ../../README.rst
:start-after: .. _minimal-install:
:end-before: With DB and FileDF connections

Installation in details
-----------------------

.. toctree::
:maxdepth: 1
:caption: How to install

python_packages
java_packages
self
spark
files
kerberos
full
8 changes: 8 additions & 0 deletions docs/install/kerberos.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.. _install-kerberos:

Kerberos support
================

.. include:: ../../README.rst
:start-after: .. _kerberos-install:
:end-before: Full bundle
8 changes: 0 additions & 8 deletions docs/install/python_packages.rst

This file was deleted.

70 changes: 56 additions & 14 deletions docs/install/java_packages.rst → docs/install/spark.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,44 @@
.. _install-spark:

Spark
=====

.. include:: ../../README.rst
:start-after: .. _spark-install:
:end-before: .. _java-install:

Installing Java
---------------

.. include:: ../../README.rst
:start-after: .. _java-install:
:end-before: .. _pyspark-install:

Installing PySpark
------------------

.. include:: ../../README.rst
:start-after: .. _pyspark-install:
:end-before: With File connections

.. _java-packages:

Java packages
==============
Injecting Java packages
-----------------------

``DB`` and ``FileDF`` connection classes require specific packages to be inserted to ``CLASSPATH`` of Spark session,
Some DB and FileDF connection classes require specific packages to be inserted to ``CLASSPATH`` of Spark session,
like JDBC drivers.

This is usually done by setting up ``spark.jars.packages`` option while creating Spark session:

.. code:: python
# here is a list of packages to be downloaded:
maven_packages = Greenplum.get_packages(spark_version="3.2")
maven_packages = (
Greenplum.get_packages(spark_version="3.2")
+ MySQL.get_packages()
+ Teradata.get_packages()
)
spark = (
SparkSession.builder.config("spark.app.name", "onetl")
Expand All @@ -33,7 +60,7 @@ But sometimes it is required to:
There are several ways to do that.

Using ``spark.jars``
--------------------
^^^^^^^^^^^^^^^^^^^^

The most simple solution, but this requires to store raw ``.jar`` files somewhere on filesystem or web server.

Expand Down Expand Up @@ -67,7 +94,7 @@ The most simple solution, but this requires to store raw ``.jar`` files somewher
)

Using ``spark.jars.repositories``
---------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

Expand All @@ -84,7 +111,12 @@ Can be used if you have access both to public repos (like Maven) and a private A

.. code:: python
maven_packages = Greenplum.get_packages(spark_version="3.2")
maven_packages = (
Greenplum.get_packages(spark_version="3.2")
+ MySQL.get_packages()
+ Teradata.get_packages()
)
spark = (
SparkSession.builder.config("spark.app.name", "onetl")
.config("spark.jars.repositories", "http://nexus.mydomain.com/private-repo/")
Expand All @@ -94,7 +126,7 @@ Can be used if you have access both to public repos (like Maven) and a private A
Using ``spark.jars.ivySettings``
--------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Same as above, but can be used even if there is no network access to public repos like Maven.

Expand Down Expand Up @@ -194,7 +226,12 @@ Same as above, but can be used even if there is no network access to public repo
.. code-block:: python
:caption: script.py
maven_packages = Greenplum.get_packages(spark_version="3.2")
maven_packages = (
Greenplum.get_packages(spark_version="3.2")
+ MySQL.get_packages()
+ Teradata.get_packages()
)
spark = (
SparkSession.builder.config("spark.app.name", "onetl")
.config("spark.jars.ivySettings", "/path/to/ivysettings.xml")
Expand All @@ -203,7 +240,7 @@ Same as above, but can be used even if there is no network access to public repo
)
Place ``.jar`` file to ``-/.ivy2/jars/``
----------------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Can be used to pass already downloaded file to Ivy, and skip resolving package from Maven.

Expand All @@ -213,15 +250,20 @@ Can be used to pass already downloaded file to Ivy, and skip resolving package f

.. code:: python
maven_packages = Greenplum.get_packages(spark_version="3.2")
maven_packages = (
Greenplum.get_packages(spark_version="3.2")
+ MySQL.get_packages()
+ Teradata.get_packages()
)
spark = (
SparkSession.builder.config("spark.app.name", "onetl")
.config("spark.jars.packages", ",".join(maven_packages))
.getOrCreate()
)
Place ``.jar`` file to Spark jars folder
----------------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

Expand All @@ -235,7 +277,7 @@ Place ``.jar`` file to Spark jars folder
Can be used to embed ``.jar`` files to a default Spark classpath.

* Download ``package.jar`` file (it's usually something like ``some-package_1.0.0.jar``). Local file name does not matter, but it should be unique.
* Move it to ``$SPARK_HOME/jars/`` folder, e.g. ``~/.local/lib/python3.7/site-packages/pyspark/jars/`` or ``/opt/spark/3.2.3/jars/``.
* Move it to ``$SPARK_HOME/jars/`` folder, e.g. ``^/.local/lib/python3.7/site-packages/pyspark/jars/`` or ``/opt/spark/3.2.3/jars/``.
* Create Spark session **WITHOUT** passing Package name to ``spark.jars.packages``
.. code:: python
Expand All @@ -246,7 +288,7 @@ Can be used to embed ``.jar`` files to a default Spark classpath.
Manually adding ``.jar`` files to ``CLASSPATH``
-----------------------------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

Expand Down

0 comments on commit 9557268

Please sign in to comment.