Detailed Guides AmbariKave

Installation quickstart

This section contains only a quickstart guide, with most basic information. For a full discussion and step-by-step guide, read on to the next section.

KaveToolbox: install on your single centos6/7 or Ubuntu development machine, single-node machine, or vm to use iPython and related tools (installation can take up to 20 minutes). You then have the same tools locally as on your Ambari-driven KAVE cluster.

yum -y install wget curl tar zip unzip gzip rsync python
wget http://repos:kaverepos@repos.kave.io/noarch/KaveToolbox/3.0-Beta/kavetoolbox-installer-3.0-Beta.sh
bash kavetoolbox-installer-3.0-Beta.sh [--quiet]

(--quiet is for a quieter install) ( NB: the repository server uses a semi-private password only as a means of avoiding robots and reducing DOS attacks this password is intended to be widely known and is used here as an extension of the URL ) ( NB: yum is the standard package manager for Centos/redhat. To install on Ubuntu the equivalent is apt-get ) Then to browse through examples

cd /opt/KaveToolbox/pro/examples
ipython notebook

And/or visit http://nbviewer.ipython.org/

AmbariKave: Assuming you already fulfill the basic requirements shortlist: install on the centos6/7 or redhat7 admin node of your cluster to configure an entire distributed system with many optional services for you to add as/when you need them (installation is quick, deployment of services can take 20 minutes, your ambari node and ipa node must have pre-configured ssh-key based access to all nodes in the cluster)

yum -y install wget curl tar zip unzip gzip python
wget http://repos:kaverepos@repos.kave.io/noarch/AmbariKave/3.0-Beta/ambarikave-installer-3.0-Beta.sh
bash ambarikave-installer-3.0-Beta.sh

Then to provision your cluster go to: http://YOUR_AMBARI_NODE:8080 or deploy using a blueprint, see https://cwiki.apache.org/confluence/display/AMBARI/Blueprints ( NB: the repository server uses a semi-private password only as a means of avoiding robots and reducing DOS attacks this password is intended to be widely known and is used here as an extension of the URL )

(If you don't have a cluster yet, think about resource_wizard script to help you make an initial guess about provisioning, we have a dev suite to help with clusters on aws, and are working on similar azure deployment)

Relationship to Ambari

We don’t change how Ambari works or looks or feels, we only teach it how to install a few more things.

AmbariKave extends Ambari adding some more services. It does this by adding a stack to Ambari. Ambari is nicely extensible and adding a stack does not interfere with older stacks, not can it interfere with already running services.

This means there are two general ways to install these services

Install ambari however you wish to, and then add our patch for a new stack
Use our wrapper around the ambari installer to install and also patch it

Once you have installed Ambari with KAVE extentions no other Ambari functionality is modified, you can then use Ambari normally, and choose which services/software to install through Ambari. It is not the purpose of this wiki to explain Ambari in detail, because the existing documentation for Ambari is excellent see apache ambari website.

(read more here about Kave Versioning)

What does AmbariKave actually install

The installer we provide, with the installation as given above (wget AmbariKave script and run that script) actually installs only the cluster management tool called Ambari. We modify Ambari only slightly to add capabilities in terms of teaching it how to install our favorite services (see KAVE wiki). Ambari has certain prerequisites, e.g. it uses a Postgres backing database and java but otherwise installs very minimal software on your machine, then you use Ambari to deploy more software a-la-carte. It is not the purpose of this wiki to explain Ambari in detail, because the existing documentation for Ambari is excellent see apache ambari website.

Once Ambari is installed with our extentions, then Ambari can be used as a tool to install and administer a lot more software/services across a wide cluster.

(read more here about Kave Versioning)

Other installation options and ideas:

Most installations will probably be fresh installations over a centos7 cluster, but it is also possible to just use our deployment components, or just add our stack within an existing cluster (see the readme). Advanced administrators may also wish to review installation on a component-by-component basis, or install some equivalent component for which they have bought some commercial licence, and build up something like a KAVE but without using the Ambari installer, on the cloud, or using docker containers. We think this is fine and we also encourage that, especially within any running production environment. We also advocate separation of concerns by installing a data science cluster next to existing production systems, rather than within existing production systems. In the end the choice is up to you!

Basic requirements shortlist for AmbariKave

If you've decided that AmbariKave is the way to go, then you will need:

#	Item	Description
1	Sufficient computing resources	AmbariKave is installed across a cluster of machines. These can be physical or virtual, or even one very large single machine. The exact resource requirements differ for each install and are discussed below. There is also a resource_wizard script that can help you make an initial guess to what you need. We recommend testing on virtual machines or cloud before deploying in a live system.
2	Centos6, Centos7 or Redhat7	As of 2.2-Beta we support Centos6/7 and Redhat7. In previous releases only Centos6 was supported. In future releases Centos6 will be removed.
3	Fresh + nonconflicting images	Ambari expects to be installed in a fresh cluster with only the most basic software already installed. Mostly this means creating new blank virtual machines. If there is already other different software installed upon your cluster, it is not possible to guess the conflicts which may arise. Think very carefully as to if you can start a separate fresh cluster for testing conflicts before deploying in a live system.
4	Static internal IP	Ambari and several other cluster services cache the IP of other machines within the cluster. If this IP changes, re-installation may be required, or manual modification of local files, this can be very tricky. Better then if the (virtual) machines have static internal IPs.
5	Fully qualified domain names	Each machine needs a valid fully qualified domain name. On each machine, the result of hostname -s +'.'+ hostname -d must be equal to hostname -f. Also, the result of uname -n should show you the same hostname as hostname -f.
6	Correctly styled domains	KAVE uses the domain also as a Kerberos realm, and thus the domain must be correctly styled. In simple terms this entails that the domain itself needs to have at least one 'dot' in it, like google.com or eu-west.ec2.com . Therefore the fully-qualified domains of the machines will have at least two 'dots' in them. Like machine.example.com or anode.eu-west.ec2.com .
7	No capital letters	the fully qualified domain name of machines in the cluster cannot contain capital letters. Ambari is not case-insensitive in all cases. To avoid problems use only small letters in the names of machines.must work for for machines within the cluster.
8	Forward and reverse lookup	You can check this by: yum -y install bind-utils, host some.fully.qualified.name (should result in the internal IP of that machine), host some.internal.ip (should result in showing the fully qualified domain name of the machine). In most cases this implies having a correctly setup DNS chain for the cluster, perhaps meaning that there should be a DNS server installed next to the cluster with a known IP.
9	Root access	the root user on the "admin nodes" (where you will install ambari and FreeIPA) has ssh access as root on all other machines (ideally using sshkeys), without needing to type "yes" anywhere. Usually this is done separately for the two machines, once for Ambari, once for FreeIPA.
10	Configured Internal firewall	Well-configured internal firewalls to allow communication between machines on all required ports (or disable local iptables/selinux). This is difficult to get right every time, hadoop for example uses many different ports across the cluster, and so it is often simpler to rely on a wrapping firewall around the cluster and disable local firewalls and SE linux at least during install.
11	Internet access	during installation open access to the internet is recommended, either that or well cached local mirrors of at least yum, ambari and AmbariKave repos (see below).
12	NTP server	(or otherwise synchronized clocks): if the machines in the cluster do not have ntp installed to synchronize their system clocks, then ssl authenticating during Ambari installation may fail to work. The nodes must all have very similar and preferably synchronized system clocks.

AmbariKave Installation Guide

One of the core ideas about the Kave is that it will support data science technology on various levels of maturity. The possible setups or combinations of tools you can choose is very large. While this is considered as an advantage it also places a big responsibility in your hands. You are the one choosing the tools you need and deciding where they should go the AmbariKave is there to empower you and guide you in the tool selection but you remain in control.

What we do in this guide is:

Help you choose the tools you will install
Give resource guidelines
Give network layout guidelines
Give examples of clusters you can install

What we won't do in this guide is:

Show you how to provision the machines for your cluster (your infrastructure experts will do this, different for every data centre)
Help you setup the security of your Kave (your cluster admin will do this, different for every organization)
Arrange backups for your services, data, and analytics (this is outside of the scope of this installer, different for every project)

While these last points are very important and must never be overlooked it's not feasible to give a one-size-fits-all solution for these concerns.

Who is this guide for?

The beta-tester enthusiast who would like to get their hands stuck into a modern data science toolkit
The systems administrator who has been asked to install AmbariKave in their cluster

What do I need to begin

Access to the internet
A machine somewhere in the world with an ssh client, and an open access to ssh out over port 22

How should the data decide what specifications I need?

The data, the data complexity, the data size, the data variety, the data format, and what you want to do with it define the specifications you need. It is not possible to imagine one concrete KAVE system able to absorb every possible data use case in the cheapest way, but each use case needs to be thought about.

Do you have realtime data? Then think about high-bandwidth guaranteed uptime, hosted and maintained, with a storm cluster
Do you have a small amount of complex data? Then think along the lines of a single large machine with KaveToolbox installed
Do you have a large amount of static data? Then think about hadoop or a different batch system to perform whatever tasks can be done in parallel, and an environment in which you can do effective combinations.
Do you have a monthly or weekly influx of large amounts of data? This is a typical batch-processing scenario.

KAVE components are chosen so as to be horizontally scalable. As your data size grows, so can your cluster. Scalability is the only way to break out of traditional big data limitations.

How should what I want to do with the data decide what components I need?

It's not the case that scalability is only about data size, it's also about what you want to do with that data.

Do you need to output insights to some external web server? Think about using JBOSS with MongoDB
Do you need realtime results for your realtime data? Think about an optimized storm topology
Will you be developing your solution in a team? Consider the development line components, Gitlabs and Twiki
Will this be some production system eventually? If so, then consider continuous integration and code quality checks with Sonar and Jenkins

It is unlikely that your first data exploration requires every KAVE component, which is why KAVE is modular, take what you need, when you need it!

Resource guide lines

Now that you understand how the data itself and what you want to do determines the cluster size, the resource_wizard script can help you make an initial guesses to the size of the cluster you need.

The AmbariKave installer can be used for one machine, a small cluster of machines or a very large cluster of machines. The overhead associated with running so many different services drops dramatically as the number of machines in the cluster increases. For a large enough cluster, what was once pure overhead become absolute necessity for stability.

A single-node setup is very possible, perhaps on your very powerful laptop, or your extremely powerful large machine, however, to run every tool on one machine would need a static draw of almost 16GB of RAM, on at least a quad-core machine, and does not take advantage of the main power of hadoop. In a single-node setup, consider installing a very limited set of components, do you really need hadoop there? Is simply the KaveToolbox with iPython notebooks not good enough already?

Most likely then, you are going to have a large database within a big KAVE, living on the cloud, or in your datacentre, and install the client software on your laptop or within a virtual machine for ease of development.

Thin Client Laptop: you can connect to your kave with a local thin client machine, providing that your kave has a big enough gateway server. The thin client will need at least a ssh client and a vnc-viewer, with a reasonable internet connection. For added simplicity, also install firefox with the FoxyProxy plugin. At this stage you can use your preferred operating system.
Development machine: the development machine would be a local laptop or desktop or virtual machine running Centos6/Centos7 or Ubuntu on which you can install the KaveToolbox software and then perform limited analytical tasks, an 8-core machine with 1TB of disk, and 32 GB or RAM would be ideal for this, smaller machines will limit your performance. For added simplicity, also install firefox with the FoxyProxy plugin.
Nodes: It is up to you how you provision your nodes in the remote cluster. The better the nodes, the faster the analytics. The minimum size is 1-core 2 GB-ram 20GB disk per node. Only Centos6/7 is supported.
Gateways: The gateway machine is the center of your analyst's interaction with the kave. As such, it must be fast and provisioned with enough resources for your analysts. See more details below. Only Centos6/7 is supported.

Network layout guide lines

Most of the tools in a Kave are not designed with security as their first concern. Most of them are design to a specific thing (storing data or calculation) and the consequence of this is that the first and best line of data security is in controlling access to your cluster through a very strict firewall.

There are two points that cannot be stressed enough:

You are responsible for your own data privacy and security, limiting access and separation of concerns through use of multiple KAVEs can give you fine-grained control
If your own analysts cannot access your own KAVE quickly and efficienctly the data locked away inside are useless to you.

We therefore encourage establishing a gateway node on your cluster, which is accessed through ssh (secure shell) by the analysts. This can be achieved with public-private key authentication over ssh for guaranteed security. When internal resources must be contacted, port forwarding over ssh can be used, even dynamically to simulate a vpn or proxy.

A typical small cluster will be composed of:

An ambari node
An ipa node
A gateway machine
Servers/service dedicated nodes
Slave/task dedicated nodes
Edge nodes

Usually:

only the gateway will permit access into the kave from the outside world
the edge nodes will expose certain services such as a web-front-end or jboss server without granting other internal access rights
all internal machines will be able to reach each other in an unrestricted manner, with the exception of edge nodes whose access should be limited to avoid exploits.

Nodes

It is up to you how you provision your nodes in the remote cluster. The better the nodes, the faster the analytics. The minimum size is 1-core 2 GB-ram 20GB disk per node.

Centos6/7 installed
A domain-specific unique fully-qualified domain name with reverse-lookup capability for itself and the rest of the cluster.

The resource_wizard script can help you make an initial guesses to the size of the cluster you need.

Ambari

The Ambari node is the lynchpin of your system. Common practise would be to ensure it:

is only accessible by a single admin user
it is only accessible from within the KAVE network
Centos6/7 installed

Your Ambari node will be in charge of the services and clients installed within the cluster and will monitor their status.

The Ambari node must have ssh-key-based passwordless access to the remainder of the machines in your KAVE in order to install software onto them. All the machines must be able to contact the Ambari server on a wide range of ports, and the Ambari server will need to speak back to them on a wide range of ports.

Usual specs:

disk: 20 GB OS, 10 GB /opt, 2 GB /var/log, 4 GB /usr/hdp, 2 GB /var/lib/ambari-server, 2 GB /var/lib/ambari-agent, 10 GB /var/lib/ambari-metrics-collector
cpu: 2 cores
ram: 6 GB

ipa-node

The IPA node is the security hub of your cluster. Common practise would be to ensure it:

is only accessible by a single admin user
it is only accessible from within the KAVE network
Centos6/7 installed

Your IPA node will be in charge of users and groups, kerberos, dns ntp and LDAP.

The IPA node must also have ssh-key-based passwordless access to the remainder of the machines in your KAVE in order to control their users. If there is one machine in the cluster which needs guaranteed uptime, it is your freeipa node.

Usual specs:

disk: 20 GB OS, 2 GB /var/log, 4 GB /usr/hdp, 2 GB /var/lib/ambari-agent
cpu: 1 core
ram: 3.5 GB

Gateway machine

The gateway machine is the single point of ingress for your data science team. It is the only machine with an ssh demon reachable from the outside world. You may consider limiting access to this machine to a restricted list of IP addresses (your organization). At least one of you data science team should be a sudoer on the gateway node, otherwise it is very difficult to work fluidly.

(2 core + 4 GB RAM)+(1 core + 2 GB RAM)*(number of simultaneous users)
(100 GB)*(number of all-time users) home directory
20 GB "/" free on top of system size, or direct mount of 10 GB as /opt/
100 GB "/tmp" size
2 GB /var/log, 4 GB /usr/hdp, 2 GB /var/lib/ambari-agent
GB Ethernet with high upload bandwidth for VNC connections
We recommend that any servers/services requiring 100% uptime are not run on the analysis workstation (e.g. Hue/Ganglia/nagios/ldap) since analysis users will have erratic usage with a very high peak usage, we recommend running such services on dedicated servers in the network.
Centos6/7 installed

Hadoop

Hadoop is divided into different server/slave components which can be placed upon different machines. If this is your first time installing hadoop, then simply attempt to separate all of the server-components from the client/slave components, perhaps contact an expert or read up on hadoop first to understand what the components are doing.

Ideally your hadoop nodes will be separated from any of your edge nodes, requiring a hop through an intermediate database to ensure data privacy. Your hadoop nodes will need a very high file I/O capacity.

2 GB /var/log, 4 GB /usr/hdp, 2 GB /var/lib/ambari-agent
Centos6/7 installed

The resource_wizard script can help you make an initial guesses to the size of the cluster you need.

Storm

Your storm machines will need a very high network bandwidth and a lot of RAM. These nodes will be continuously in very fast contact with each other, and this means they should be physically close with as high a network bandwidth to each other as possible, for example within the same rack, or virtualized on the same host.

Centos6/7 installed
2 GB /var/log, 2 GB /var/lib/ambari-agent

Edgenodes

Have a clear idea about what your edge nodes are, and what they are for, and who they can talk to. Make this a part of your network design.

Centos6/7 installed

Disk Usage guidelines

What you notice from the nodes above is there are very common set of directories to consider as mountpoints in the file system. Allocating specific disks to certain mountpoints is a good idea to prevent run-away-logfiles from interfering with other running services.

Typical disk layout for all machines:

Mount point	Size	Explanation
/var/log	2 GB	Log files, almost all services write somewhere in this directory
/var/lib/ambari-agent	2 GB	The ambari-agents contain caches of services, their own logs and also commands they were issued, this directory can get quite heavily used on a busy machine with lots of services
/tmp	>10 GB	tmp should be separately mounted to volatile disk (not backed up)

Typical additional disks for the ambari node:

Mount point	Size	Explanation
/var/lib/ambari-server	2 GB	The ambari-server contains many different pieces of java software and also some logs and data directories

Typical additional disks for any nodes running hadoop services or clients:

Mount point	Size	Explanation
/usr/hdp	4 GB	Some HDP clients are a very large install and the latest HDP versions install into this directory.

Typical additional disks for ambari-metrics server:

Mount point	Size	Explanation
/var/lib/ambari-metrics-collector	>= 10 GB	In embedded mode ambari metrics writes a lot of service status reports to disk. Usually this is the ambari node in standard configuration, but it could very well be elsewhere in the cluster

Typical additional disks for all nodes except ambari:

Mount point	Size	Explanation
/opt	10 GB	KaveToolbox, with potentially several versions of python needs disk space in the /opt directory to install. This is configurable, but by default /opt is used

Typical additional disks for gateway machine (edge node):

Mount point	Size	Explanation
/home	100GB per user	The users on your cluster will need some space to play around with the data. More users need more space. This disk needs to be backed-up conscientiously.
/data	input data size	When transferring data into and out of your cluster it is very likely that some intermediate copy will be needed, therefore it is likely that you will need to mount a single large enough disk to /data on at least one edge node. If only used for intermediate data, no backup of this drive will be required.
/tmp	> 100GB	Users on the gateway will usually be running software which needs to write intermediate files to tmp, again, this can be a volatile disk with no backup.

You can find many examples of working layouts in the standard tests directory for our software, for example: examplelambda.aws.json. More about these json files is described below and in the deployment readme.

Getting started

With the previous concerns in mind we can get started. Building a cluster follows an ABC principle.

A: Acquire & Ambari

A plan and machines are needed to start building your Kave. So first decide what tools you need. Determine what machines you need based on that and provision them. Here provisioning means dragging them into your domain, or creating them within your virtual environment, or buying the hardware.

Decide on your firewall/security structure and implement this see also: Access Security and Privacy
Install the Centos6/7 OS onto those machines
Designate a machine to be the ambari node
Designate a machine to be the ipa node
Distribute ssh keys, such that the root user on the ambari machine, and the root user on the ipa node, both have passwordless access as the root user on all other nodes in the cluster more info,
Ensure you have registered the cluster nodes' host-keys on both the ambari node and ipa node to avoid needing to type "yes" at the ssh prompt for every first-time connection.

Then, get onto your ambari node (probably over ssh), and call:

yum –y install wget curl tar zip unzip gzip python
wget http://repos:kaverepos@repos.kave.io/noarch/AmbariKave/3.0-Beta/ambarikave-installer-3.0-Beta.sh
bash ambarikave-installer-3.0-Beta.sh

(where X.y-Beta (3.0-Beta e.g being latest as per October 2016) is the kave version you wish to install) ( NB: yum is the standard package manager for Centos/redhat. To install on Ubuntu the equivalent is apt-get )

Using Ambari to distribute servers is in principle described on the Ambari wiki itself, however the hints below may be useful for you.

You can either navigate to the website of your ambari node to http://YOUR_AMBARI_NODE:8080/ and provision your entire cluster through the web interface, or you can use blueprints.

For more about blueprints see: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints

B: Blueprint

Create a blueprint for your desired toolstack For more about blueprints see: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints We also have plenty of example blueprints in: https://github.com/KaveIO/AmbariKave/tree/master/tests/integration/blueprints

C: Cluster

Assign your provisioned nodes to the host groups in your blueprint, using a "cluster file" For more about blueprints see: https://cwiki.apache.org/confluence/display/AMBARI/Blueprints

Examples

As seen from the previous sections a lot should be taken into consideration. Following are a couple of examples which show sample clusters that can be deployed. In these examples we imaging a fictitious domain name, kave.org, and a fictitious set of machines with their own names.

Example 1 - A basic multi node setup (KaveManagement)

In this example we will be installing something suitable for a management-like kave environment, where you analysts use their own laptops to develop and work with small datasets, or where you collect together the code from other KAVEs.

In this example, the firewall would allow ssh communication on port 22 into the gitlab node only to specific corporate IPs, the gateway would also be contactable by registered users, perhaps requiring ssh-key based access for those users. All internal network communication would be allowed.

In this example a typical provisioning would be:

Gateway, 4-core, 8 GB-ram, 70 GB disk
Gitlab, 2-core, 4 GB-ram, 200 GB disk with very regular backup policy
ambari: 2-core 6 GB-ram, 50 GB disk space
ipa: 1-core 3.5 GB-ram, 20 GB disk space

See the example blueprint and the example clusterfile that match this layout.

Example 2 - A basic hadoop node setup (KaveBasic)

In this example we will be installing a basic hadoop configuration. A shared gateway with KaveToolbox allows your small team of around 4-or-so analysts to work entirely on this cluster if required, with no data or code being transferred outside the cluster.

In this example, the firewall would allow ssh communication on port 22 and 443 into the gateway node only, and all communication internally would be permitted. Analysts will use vnc or firefox to access internal resources over the secure ssh connection.

In this example a typical provisioning would be:

Gateway, 8-core, 32 GB-ram, 200 GB disk, very good internet connection for up and download
Gitlab ipa and ambari, same as in example 1
All other nodes: 8-core 36 GB-ram, with nominally 50 GB of disk space
Hadoop nodes: each with 1TB of additional JBOD disk optimized/mounted for hadoop use

See the example blueprint and the example clusterfile that match this layout.

Example 3 - An advanced cluster setup (KaveLambda)

In this example we have a complete lambda stack.

In this example, we extend example 2 with extra capabilities, the firewall configuration here is similar, but now an extra edge node, for JBOSS is configured. In this example, jboss can only make requests from the mongodb, but is otherwise accessible from within the cluster and a certain external application server, for example a pre-existing tableau instance.

In this example a typical provisioning would be the same as example 2. With an example addition of:

200GB of disk for the mongodb node (consider using SSD here)
Excellent backup provisioning of the mongodb and JBOSS nodes

See the example blueprint and the example clusterfile that match this layout.

Internet during installation, firewalls and nearside cache/mirror options

Ideally all of your nodes will have access to the internet during installation in order to download software.

If this is not the case, you can, possibly, implement a near-side cache/mirror of all required software. This is not very easy, but once it is done one time, you can keep it for later.

Centos6: Howto
EPEL: Mirror FAQ , Mirroring
Ambari: Local Repositories Deploying HDP behind a firewall

To setup a local near-side cache for the KAVE tool stack is quite easy. First either copy the entire repository website to your own internal apache server, or copy the contents of the directories to your own shared directory visible from every node.

mkdir -p /my/shared/dir
cd  /my/shared/dir
wget -R http://repos.kave.io/

Then create a /etc/kave/mirror file on each node with the new top-level directory to try first before looking for our website:

echo "/my/shared/dir" > /etc/kave/mirror
echo "http://my/local/apache/mirror" > /etc/kave/mirror

So long as the directory structure of the nearside cache is identical to our website, you can drop, remove or replace, any local packages you will never install from this directory structure, and update it as our repo server updates.

Security during the installation process

The installation of services may require the nodes to have (at least temporary) internet access. It may also involve the root user on your machines executing commands which might transfer passwords or key files. If your cluster is already exploited while the installation is taking place your entire cluster should be considered dangerous. During install you may consider downtime from production mode, disconnection from other services or systems, temporarily revoking user access.

You may also consider not modifying a production system, but instead creating a pre-production shadow system with any new configurations and services, cloning the data as a first step into the new system, and then transferring rights and privileges across before switching the new system out for the old.

Trouble shooting the install

see The Troubleshooting Guide

User management and integration, FreeIPA

Once your cluster is deployed, it might be that you wish to connect your services to an existing LDAP or Kerberos provider. Each of the tools we deploy nominally allows for such connection with sufficient configuration.

If you wish to use an integrated solution where the users and authentication is also managed within the KAVE (on the ipa node, by an administrator) the solution we use for that is FreeIPA.

FreeIPA has the advantage that it is the one-stop-shop for user management of all the components we use in the KAVE and can provide a single interface and database of users, groups, keyfiles, kerberos, ldap, CA, applied across your entire KAVE.

The KAVE offers several levels of integration with FreeIPA.

Centralized user management for LDAP enabled services
Centralized user management for terminal access
Fully kerberized HDP setup

The required integration depends on the environment in which you are deploying the KAVE.

(In previous KAVE releases <=2.1 we required you to install FreeIPA Server on the admin node along side Ambari, but we no longer require that).

For the LDAP support the FreeIPA Server component needs to be installed and the services you want to enjoy centralized user management need to be configured to have this machine as its ldap server.

For terminal access next to the FreeIPA Server component the FreeIPA Client component needs to be installed on every node in the cluster. This will register these nodes with the server and allow for SSH Access.

NOTE! by default the hbac rule allow_all is installed. This means that every FreeIPA user can log in on every machine having a FreeIPA Client component installed.

The fully kerberized HDP setup requires three things. First of the FreeIPA Server component and the FreeIPA Client components need to be installed. After this is done and your cluster is working properly the hadoop kerberization can be enabled. This is done through the Ambari interface. For KAVE version 2.X there's a complete tutorial for this as a youtube video below:

Obsolete: For KAVE versions 1.X, the FreeIPA client is supposed to take care of generating the keytabs for you, but this does not work in most cases. Most cases can be resolved by manually installing some keytabs on some specific machines, as-in the troubleshooting section below.

Fixing problems with keytabs

As stated earlier it is possible that some keytabs aren't generated correctly during the installation process (1.X). If so, see here: https://github.com/KaveIO/AmbariKave/wiki/Detailed-Guides-Troubleshooting#fixing-problems-with-freeipa-keytabs

Recommended setup

We recommend using all aspects of FreeIPA together, DNS, LDAP, Kerberos, NTP, sssd. However, this can cause problems for installation within certain networks.
Use groups wisely to configure user privileges for different services
Grant sudo rights to at least one member of each data science team in order that they can be productive in their work.

See the example blueprints above for example initial configurations of FreeIPA

Configuration of initial users/passwords with blueprints or web interface

We can set several things during installation with FreeIPA including:

Initial users and initial passwords (must be changed on first login)
Initial groups for those users (e.g. the hadoop group or the admins group)
Initial sudo rules sudoers and their rights can be configured during install

Everything that is so configured can be edited later with the FreeIPA admin web interface or command line tools. It is a very good idea to create at least one initial admin user sitting in the admins and hadoop group, with sudoer access.

To understand how to configure these parameters, the Ambari web interface will guide you through the installation process, or if using a blueprint, take a look at the blueprint examples above, which should be self explanatory. If in doubt read the FreeIPA configuration xml which details the format and use of each parameter.

Lists and/or json-formatted dictionaries are sent to the FreeIPA installer script, and from there, the rest is configured.

Management on the command line

As a user with admin privileges one must first call 'kinit' to authenticate those admin privilages, and one can then perform user admin tasks of free ipa on the command line. See "ipa help" for more details, or read the ipa documentation online

Management with the web interface

The FreeIPA web interface is a bit complex at first, but once you are used to it it gives a very powerful way to interact with FreeIPA and configure all sorts of user and host features/policies. In order to connect with the FreeIPA web interface, from within your KAVE this is easy. If your KAVE already has KaveLanding, look for the user admin link. If your KAVE does not have a gatway or KAVE landing, you will need to navigate to port 80 on the machine where FreeIPA is installed.

Caveats:

SSL: The web interface uses a certificate for hard SSL encryption. So, in order to be able to connect, you need to be able to accept this certificate in your browser's trust chain.
DNS-redirect: The FreeIPA web interface redirects to the dns-name of the machine. So, in order to be able to connect, the FQDN of the machine where FreeIPA is runing must be known to the machine where you are trying to connect from. This is no problem if connecting from within the KAVE or within the same network, and there are work-arounds possible within your machine's hostsfile.

Kave on Azure

Kave on Azure Home

For contributors

Developer Home

For someone who modifies the AmbariKave code itself and contributes to this project. Persons working on top of existing KAVEs or developing solutions on top of KAVE don't need to read any of this second part.

Detailed Guides AmbariKave

Installation quickstart

Relationship to Ambari

What does AmbariKave actually install

Other installation options and ideas:

Basic requirements shortlist for AmbariKave

AmbariKave Installation Guide

Who is this guide for?

What do I need to begin

How should the data decide what specifications I need?

How should what I want to do with the data decide what components I need?

Resource guide lines

Network layout guide lines

Nodes

Ambari

ipa-node

Gateway machine

Hadoop

Storm

Edgenodes

Disk Usage guidelines

Getting started

A: Acquire & Ambari

B: Blueprint

C: Cluster

Examples

Example 1 - A basic multi node setup (KaveManagement)

Example 2 - A basic hadoop node setup (KaveBasic)

Example 3 - An advanced cluster setup (KaveLambda)

Internet during installation, firewalls and nearside cache/mirror options

Security during the installation process

Trouble shooting the install

User management and integration, FreeIPA

Fixing problems with keytabs

Recommended setup

Configuration of initial users/passwords with blueprints or web interface

Management on the command line

Management with the web interface

Table of Contents

Kave on Azure

For contributors

Clone this wiki locally