[DOP-9007] Rearrange installation documentation

MobileTeleSystems · Sep 26, 2023 · 9557268 · 9557268
1 parent b8bd6dd
commit 9557268
Show file tree

Hide file tree

Showing 7 changed files with 111 additions and 42 deletions.
diff --git a/README.rst b/README.rst
@@ -54,7 +54,7 @@ Requirements
 * **Python 3.7 - 3.11**
 * PySpark 2.3.x - 3.4.x (depends on used connector)
 * Java 8+ (required by Spark, see below)
-* Kerberos libs & GCC (required by ``Hive`` and ``HDFS`` connectors)
+* Kerberos libs & GCC (required by ``Hive``, ``HDFS`` and ``SparkHDFS`` connectors)
 
 Supported storages
 ------------------
@@ -111,16 +111,16 @@ Documentation
 
 See https://onetl.readthedocs.io/
 
-.. install
-
 How to install
 ---------------
 
-.. minimal-install
+.. _install:
 
 Minimal installation
 ~~~~~~~~~~~~~~~~~~~~
 
+.. _minimal-install:
+
 Base ``onetl`` package contains:
 
 * ``DBReader``, ``DBWriter`` and related classes
@@ -142,14 +142,16 @@ It can be installed via:
     This method is recommended for use in third-party libraries which require for ``onetl`` to be installed,
     but do not use its connection classes.
 
-.. _spark-install:
-
 With DB and FileDF connections
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+.. _spark-install:
+
 All DB connection classes (``Clickhouse``, ``Greenplum``, ``Hive`` and others)
 and all FileDF connection classes (``SparkHDFS``, ``SparkLocalFS``, ``SparkS3``)
-require PySpark to be installed.
+require Spark to be installed.
+
+.. _java-install:
 
 Firstly, you should install JDK. The exact installation instruction depends on your OS, here are some examples:
 
@@ -178,6 +180,8 @@ Compatibility matrix
 | `3.4.x <https://spark.apache.org/docs/3.4.1/#downloading>`_  | 3.7 - 3.11  | 8u362 - 20  | 2.12  |
 +--------------------------------------------------------------+-------------+-------------+-------+
 
+.. _pyspark-install:
+
 Then you should install PySpark via passing ``spark`` to ``extras``:
 
 .. code:: bash
@@ -193,12 +197,11 @@ or install PySpark explicitly:
 or inject PySpark to ``sys.path`` in some other way BEFORE creating a class instance.
 **Otherwise connection object cannot be created.**
 
-
-.. _files-install:
-
 With File connections
 ~~~~~~~~~~~~~~~~~~~~~
 
+.. _files-install:
+
 All File (but not *FileDF*) connection classes (``FTP``,  ``SFTP``, ``HDFS`` and so on) requires specific Python clients to be installed.
 
 Each client can be installed explicitly by passing connector name (in lowercase) to ``extras``:
@@ -216,18 +219,17 @@ To install all file connectors at once you can pass ``files`` to ``extras``:
 
 **Otherwise class import will fail.**
 
-
-.. _kerberos-install:
-
 With Kerberos support
 ~~~~~~~~~~~~~~~~~~~~~
 
+.. _kerberos-install:
+
 Most of Hadoop instances set up with Kerberos support,
 so some connections require additional setup to work properly.
 
 * ``HDFS``
   Uses `requests-kerberos <https://pypi.org/project/requests-kerberos/>`_ and
-  `GSSApi <https://pypi.org/project/gssapi/>`_ for authentication in WebHDFS.
+  `GSSApi <https://pypi.org/project/gssapi/>`_ for authentication.
   It also uses ``kinit`` executable to generate Kerberos ticket.
 
 * ``Hive`` and ``SparkHDFS``
@@ -252,12 +254,11 @@ Also you should pass ``kerberos`` to ``extras`` to install required Python packa
 
     pip install onetl[kerberos]
 
-
-.. _full-install:
-
 Full bundle
 ~~~~~~~~~~~
 
+.. _full-bundle:
+
 To install all connectors and dependencies, you can pass ``all`` into ``extras``:
 
 .. code:: bash
@@ -271,7 +272,7 @@ To install all connectors and dependencies, you can pass ``all`` into ``extras``
 
     This method consumes a lot of disk space, and requires for Java & Kerberos libraries to be installed into your OS.
 
-.. quick-start
+.. _quick-start:
 
 Quick start
 ------------

diff --git a/docs/install/files.rst b/docs/install/files.rst
@@ -0,0 +1,8 @@
+.. _install-files:
+
+File connections
+=================
+
+.. include:: ../../README.rst
+    :start-after: .. _files-install:
+    :end-before: With Kerberos support
diff --git a/docs/install/full.rst b/docs/install/full.rst
@@ -0,0 +1,8 @@
+.. _install-full:
+
+Full bundle
+===========
+
+.. include:: ../../README.rst
+    :start-after: .. _full-bundle:
+    :end-before: .. _quick-start:
diff --git a/docs/install/index.rst b/docs/install/index.rst
@@ -3,9 +3,19 @@
 How to install
 ==============
 
+.. include:: ../../README.rst
+    :start-after: .. _minimal-install:
+    :end-before: With DB and FileDF connections
+
+Installation in details
+-----------------------
+
 .. toctree::
     :maxdepth: 1
     :caption: How to install
 
-    python_packages
-    java_packages
+    self
+    spark
+    files
+    kerberos
+    full
diff --git a/docs/install/kerberos.rst b/docs/install/kerberos.rst
@@ -0,0 +1,8 @@
+.. _install-kerberos:
+
+Kerberos support
+================
+
+.. include:: ../../README.rst
+    :start-after: .. _kerberos-install:
+    :end-before: Full bundle
diff --git a/docs/install/python_packages.rst b/docs/install/python_packages.rst
diff --git a/docs/install/java_packages.rst → docs/install/spark.rst b/docs/install/java_packages.rst → docs/install/spark.rst
@@ -1,17 +1,44 @@
+.. _install-spark:
+
+Spark
+=====
+
+.. include:: ../../README.rst
+    :start-after: .. _spark-install:
+    :end-before: .. _java-install:
+
+Installing Java
+---------------
+
+.. include:: ../../README.rst
+    :start-after: .. _java-install:
+    :end-before: .. _pyspark-install:
+
+Installing PySpark
+------------------
+
+.. include:: ../../README.rst
+    :start-after: .. _pyspark-install:
+    :end-before: With File connections
+
 .. _java-packages:
 
-Java packages
-==============
+Injecting Java packages
+-----------------------
 
-``DB`` and ``FileDF`` connection classes require specific packages to be inserted to ``CLASSPATH`` of Spark session,
+Some DB and FileDF connection classes require specific packages to be inserted to ``CLASSPATH`` of Spark session,
 like JDBC drivers.
 
 This is usually done by setting up ``spark.jars.packages`` option while creating Spark session:
 
 .. code:: python
 
     # here is a list of packages to be downloaded:
-    maven_packages = Greenplum.get_packages(spark_version="3.2")
+    maven_packages = (
+        Greenplum.get_packages(spark_version="3.2")
+        + MySQL.get_packages()
+        + Teradata.get_packages()
+    )
 
     spark = (
         SparkSession.builder.config("spark.app.name", "onetl")
@@ -33,7 +60,7 @@ But sometimes it is required to:
 There are several ways to do that.
 
 Using ``spark.jars``
---------------------
+^^^^^^^^^^^^^^^^^^^^
 
 The most simple solution, but this requires to store raw ``.jar`` files somewhere on filesystem or web server.
 
@@ -67,7 +94,7 @@ The most simple solution, but this requires to store raw ``.jar`` files somewher
         )
 
 Using ``spark.jars.repositories``
----------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. note::
 
@@ -84,7 +111,12 @@ Can be used if you have access both to public repos (like Maven) and a private A
 
 .. code:: python
 
-    maven_packages = Greenplum.get_packages(spark_version="3.2")
+    maven_packages = (
+        Greenplum.get_packages(spark_version="3.2")
+        + MySQL.get_packages()
+        + Teradata.get_packages()
+    )
+
     spark = (
         SparkSession.builder.config("spark.app.name", "onetl")
         .config("spark.jars.repositories", "http://nexus.mydomain.com/private-repo/")
@@ -94,7 +126,7 @@ Can be used if you have access both to public repos (like Maven) and a private A
 
 
 Using ``spark.jars.ivySettings``
---------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Same as above, but can be used even if there is no network access to public repos like Maven.
 
@@ -194,7 +226,12 @@ Same as above, but can be used even if there is no network access to public repo
 .. code-block:: python
     :caption: script.py
 
-    maven_packages = Greenplum.get_packages(spark_version="3.2")
+    maven_packages = (
+        Greenplum.get_packages(spark_version="3.2")
+        + MySQL.get_packages()
+        + Teradata.get_packages()
+    )
+
     spark = (
         SparkSession.builder.config("spark.app.name", "onetl")
         .config("spark.jars.ivySettings", "/path/to/ivysettings.xml")
@@ -203,7 +240,7 @@ Same as above, but can be used even if there is no network access to public repo
     )
 
 Place ``.jar`` file to ``-/.ivy2/jars/``
-----------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Can be used to pass already downloaded file to Ivy, and skip resolving package from Maven.
 
@@ -213,15 +250,20 @@ Can be used to pass already downloaded file to Ivy, and skip resolving package f
 
 .. code:: python
 
-    maven_packages = Greenplum.get_packages(spark_version="3.2")
+    maven_packages = (
+        Greenplum.get_packages(spark_version="3.2")
+        + MySQL.get_packages()
+        + Teradata.get_packages()
+    )
+
     spark = (
         SparkSession.builder.config("spark.app.name", "onetl")
         .config("spark.jars.packages", ",".join(maven_packages))
         .getOrCreate()
     )
 
 Place ``.jar`` file to Spark jars folder
-----------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. note::
 
@@ -235,7 +277,7 @@ Place ``.jar`` file to Spark jars folder
 Can be used to embed ``.jar`` files to a default Spark classpath.
 
 * Download ``package.jar`` file (it's usually something like ``some-package_1.0.0.jar``). Local file name does not matter, but it should be unique.
-* Move it to ``$SPARK_HOME/jars/`` folder, e.g. ``~/.local/lib/python3.7/site-packages/pyspark/jars/`` or ``/opt/spark/3.2.3/jars/``.
+* Move it to ``$SPARK_HOME/jars/`` folder, e.g. ``^/.local/lib/python3.7/site-packages/pyspark/jars/`` or ``/opt/spark/3.2.3/jars/``.
 * Create Spark session **WITHOUT** passing Package name to ``spark.jars.packages``
 .. code:: python
 
@@ -246,7 +288,7 @@ Can be used to embed ``.jar`` files to a default Spark classpath.
 
 
 Manually adding ``.jar`` files to ``CLASSPATH``
------------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. note::