Skip to content

Commit

Permalink
Merge pull request #33 from rug-cit-hpc/fix/only-add-to-group-when-gr…
Browse files Browse the repository at this point in the history
…oup-exists

Fix/only add to group when group exists.
  • Loading branch information
pneerincx authored Jan 22, 2019
2 parents c2a8a21 + 34329bc commit 1f5c2fb
Show file tree
Hide file tree
Showing 57 changed files with 884 additions and 623 deletions.
47 changes: 42 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,19 @@ The main ingredients for (deploying) these clusters:
* [CentOS 7](https://www.centos.org/) as OS for the virtual machines.
* [Slurm](https://slurm.schedmd.com/) as workload/resource manager to orchestrate jobs.

#### Protected branches
#### Branches and Releases
The master and develop branches of this repo are protected; updates can only be merged into these branches using reviewed pull requests.
Once a while we create releases, which are versioned using the format ```YY.MM.v``` where:

* ```YY``` is the year of release
* ```MM``` is the month of release
* ```v``` is the first release in that month and year. Hence it is not the day of the month.

E.g. ```19.01.1``` is the first release in January 2019.

#### Code style and naming conventions

We follow the [Python PEP8 naming conventions](https://www.python.org/dev/peps/pep-0008/#naming-conventions) for variable names, function names, etc.

## Clusters

Expand Down Expand Up @@ -70,18 +81,30 @@ The steps below describe how to get from machines with a bare ubuntu 16.04 insta

---

0. Clone this repo.
```bash
mkdir -p ${HOME}/git/
cd ${HOME}/git/
git clone https://github.com/rug-cit-hpc/league-of-robots.git
```

1. First import the required roles into this playbook:

```bash
ansible-galaxy install -r requirements.yml --force -p roles
ansible-galaxy install -r galaxy-requirements.yml
```

2. Generate an ansible vault password and put it in `.vault_pass.txt`. This could be done by running the following oneliner:

2. Create `.vault_pass.txt`.
* To generate a new Ansible vault password and put it in `.vault_pass.txt`, use the following oneliner:
```bash
tr -cd '[:alnum:]' < /dev/urandom | fold -w30 | head -n1 > .vault_pass.txt
```
* Or to use an existing Ansible vault password create `.vault_pass.txt` and use a text editor to add the password.
Make sure the `.vault_pass.txt` is private:
```bash
chmod go-rwx .vault_pass.txt
```

3. Configure Ansible settings including the vault.
* To create (a new) secrets.yml:
Expand All @@ -103,7 +126,21 @@ The steps below describe how to get from machines with a bare ubuntu 16.04 insta
remote_user = your_local_account_not_from_the_LDAP
```

4. Running playbooks. Some examples:
4. Build Prometheus Node Exporter
* Make sure you are a member of the `docker` group.
Otherwise you will get this error:
```ERRO[0000] failed to dial gRPC: cannot connect to the Docker daemon.
Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect:
permission denied
context canceled
```
* Execute:
```bash
cd promtools
./build.sh
```

5. Running playbooks. Some examples:
* Install the OpenStack cluster.
```bash
ansible-playbook site.yml
Expand All @@ -113,7 +150,7 @@ The steps below describe how to get from machines with a bare ubuntu 16.04 insta
ansible-playbook site.yml -i talos_hosts slurm.yml
```

5. verify operation.
6. verify operation.

#### Steps to upgrade openstack cluster.

Expand Down
4 changes: 4 additions & 0 deletions ansible.cfg
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
[defaults]
stdout_callback = debug
vault_password_file = .vault_pass.txt

[ssh_connection]
pipelining = True
ssh_args = -C -o ControlMaster=auto -o ControlPersist=60s -o ForwardAgent=yes
24 changes: 13 additions & 11 deletions cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,21 @@
- node_exporter
- cluster

- name: Install ansible on admin interfaces (DAI & SAI).
hosts:
- imperator
- sugarsnax
become: True
tasks:
- name: install Ansible
yum:
name: ansible-2.6.6-1.el7.umcg

- name: Install roles needed for jumphosts.
hosts: jumphost
become: true
roles:
- geerlingguy.repo-epel
- ldap
- cluster
- geerlingguy.security
Expand Down Expand Up @@ -43,8 +54,8 @@
- datahandling
- slurm-client

- name: Install user interface
hosts: interface
- name: Install User Interface (UI)
hosts: user-interface
become: true
tasks:
roles:
Expand All @@ -54,15 +65,6 @@
- isilon
- slurm-client

- name: Install ansible on admin interfaces (DAI & SAI).
hosts:
- imperator
- sugarsnax
become: True
tasks:
- name: install Ansible
yum:
name: ansible-2.6.6-1.el7.umcg

- name: export /home
hosts: user-interface:&talos-cluster
Expand Down
3 changes: 2 additions & 1 deletion galaxy-requirements.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
- src: chrisgavin.ansible-ssh-host-signer
- src: geerlingguy.firewall
version: 2.4.0
- src: geerlingguy.postfix
- src: chrisgavin.ansible-ssh-host-signer
- src: geerlingguy.repo-epel
- src: geerlingguy.security
File renamed without changes.
4 changes: 3 additions & 1 deletion group_vars/all/vars.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
---
admin_ranges: "129.125.249.0/24,172.23.40.1/24"
ssh_host_signer_hostnames: "{{ ansible_fqdn }},{{ ansible_hostname }},airlock+{{ ansible_hostname }}"
ssh_host_signer_hostnames: "{{ ansible_fqdn }},{{ ansible_hostname }},{% for host in groups['jumphost'] %}{{ host }}+{{ ansible_hostname }}{% endfor %}"
spacewalk_server_url: 'http://spacewalk.hpc.rug.nl/XMLRPC'
...
50 changes: 24 additions & 26 deletions group_vars/gearshift/vars.yml
Original file line number Diff line number Diff line change
@@ -1,27 +1,25 @@
---
slurm_cluster_name: gearshift
sockets: 2
CoresPerSocket: 14
ThreadsPerCore: 1
RealMemory: 240000
Feature: centos7
nodes: |
#
# Partitions
#
#
# Configure maxnodes = 1 for all nodes of all partitions,
# because we hardly use MPI and when we do never between nodes,
# but only with max the amount of cores on a single node.
# Therefore we don't have fast network interconnects between nodes.
# (We use the fast network interconnects only for nodes <-> large shared storage devices)
#
EnforcePartLimits=YES
PartitionName=DEFAULT State=UP DefMemPerCPU=2048 MaxNodes=1 MaxTime=7-00:00:01
PartitionName=regular Default=YES Nodes=gs-vcompute[01-11] MaxNodes=1 MaxCPUsPerNode=26 MaxMemPerNode=235520 TRESBillingWeights="CPU=1.0,Mem=0.125G" DenyQos=ds-short,ds-medium,ds-long
PartitionName=ds Default=No Nodes=gearshift MaxNodes=1 MaxCPUsPerNode=1 MaxMemPerNode=1024 TRESBillingWeights="CPU=1.0,Mem=1.0G" AllowQos=ds-short,ds-medium,ds-long
#
# COMPUTE NODES
#
NodeName=gs-vcompute[01-11] Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 State=UNKNOWN RealMemory=240000 TmpDisk=1063742 Feature=tmp01
NodeName=gearshift Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN RealMemory=387557 TmpDisk=0 Feature=tmp01,prm01
slurm_cluster_name: 'gearshift'
slurm_cluster_domain: 'hpc.rug.nl'
stack_prefix: 'gs'
vcompute_hostnames: "{{ stack_prefix }}-vcompute[01-11]"
vcompute_sockets: 2
vcompute_cores_per_socket: 14
vcompute_real_memory: 245760
vcompute_max_cpus_per_node: "{{ vcompute_sockets * vcompute_cores_per_socket - 2 }}"
vcompute_max_mem_per_node: "{{ vcompute_real_memory - vcompute_sockets * vcompute_cores_per_socket * 512 }}"
vcompute_local_disk: 2900
vcompute_features: 'tmp01'
ui_hostnames: "{{ slurm_cluster_name }}"
ui_sockets: 2
ui_cores_per_socket: 2
ui_real_memory: 8192
ui_local_disk: 0
ui_features: 'prm01,tmp01'
uri_ldap: 172.23.40.249
uri_ldaps: comanage-in.id.rug.nl
ldap_port: 389
ldaps_port: 636
ldap_base: ou=umcg,o=asds
ldap_binddn: cn=clusteradminumcg,o=asds
...
44 changes: 18 additions & 26 deletions group_vars/hyperchicken/vars.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,21 @@
---
slurm_cluster_name: hyperchicken
stack_prefix: hc
slurm_cluster_name: 'hyperchicken'
#slurm_cluster_domain: ''
stack_prefix: 'hc'
vcompute_hostnames: "{{ stack_prefix }}-vcompute[01-05]"
vcompute_sockets: 1
vcompute_cores_per_socket: 9
vcompute_real_memory: 20000
vcompute_max_cpus_per_node: "{{ vcompute_sockets * vcompute_cores_per_socket - 2 }}"
vcompute_max_mem_per_node: "{{ vcompute_real_memory - vcompute_sockets * vcompute_cores_per_socket * 512 }}"
vcompute_local_disk: 0
vcompute_features: 'tmp07'
ui_hostnames: "{{ slurm_cluster_name }}"
ui_sockets: 1
ui_cores_per_socket: 1
ui_real_memory: 3000
ui_local_disk: 0
ui_features: 'prm07,tmp07'
key_name: Gerben
image_cirros: cirros-0.3.4-x86_64-disk.img
image_centos7: centos7
Expand All @@ -13,30 +28,7 @@ private_subnet_id: Solve-RD_subnet
private_storage_net_id: net_provider_vlan3126
private_storage_subnet_id: subnet3126
security_group_id: SSH-and-ping-2
server_url: 'http://spacewalk.hpc.rug.nl/XMLRPC'
slurm_ldap: false
availability_zone: AZ_1
local_volume_size: 1
sockets: 1
CoresPerSocket: 9
ThreadsPerCore: 1
RealMemory: 20000
Feature: centos7
nodes: |
#
# Partitions
#
EnforcePartLimits=YES
PartitionName=DEFAULT State=UP DefMemPerCPU=2000
PartitionName=hc-ds Nodes=hc-headnode MaxTime=10-00:00:00 DefaultTime=00:30:00 DenyQos=regular,regularlong SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G" Default=NO
PartitionName=regular Nodes=hc-vcompute01,hc-vcompute02,hc-vcompute03,hc-vcompute04 MaxTime=10-00:00:00 DefaultTime=00:30:00 AllowQOS=regular,regularlong SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G" Default=YES
#
# COMPUTE NODES
#
GresTypes=gpu
NodeName=hc-headnode Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=IDLE RealMemory=3000 Feature=centos7
NodeName=hc-vcompute01 Sockets=1 CoresPerSocket=9 ThreadsPerCore=1 State=IDLE RealMemory=20000 Feature=centos7
NodeName=hc-vcompute02 Sockets=1 CoresPerSocket=9 ThreadsPerCore=1 State=IDLE RealMemory=20000 Feature=centos7
NodeName=hc-vcompute03 Sockets=1 CoresPerSocket=9 ThreadsPerCore=1 State=IDLE RealMemory=20000 Feature=centos7
NodeName=hc-vcompute04 Sockets=1 CoresPerSocket=9 ThreadsPerCore=1 State=IDLE RealMemory=20000 Feature=centos7
...
3 changes: 2 additions & 1 deletion host_vars/airlock.hpc.rug.nl.yml → group_vars/jumphost.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ firewall_allowed_tcp_ports:
- "80"
firewall_additional_rules:
- "iptables -t nat -A PREROUTING -i eth1 -p tcp --dport 80 -j REDIRECT --to-port 22"
ssh_host_signer_hostnames: "airlock.hpc.rug.nl,{{ ansible_hostname }}"
ssh_host_signer_hostnames: "{{ ansible_hostname }}.{{ slurm_cluster_domain }},{{ ansible_hostname }}"
...
43 changes: 18 additions & 25 deletions group_vars/talos/vars.yml
Original file line number Diff line number Diff line change
@@ -1,26 +1,19 @@
---
slurm_cluster_name: talos
sockets: 1
CoresPerSocket: 9
ThreadsPerCore: 1
RealMemory: 20000
Feature: centos7
nodes: |
#
# Partitions
#
EnforcePartLimits=YES
PartitionName=DEFAULT State=UP DefMemPerCPU=2000
PartitionName=tl-ds Nodes=talos MaxTime=10-00:00:00 DefaultTime=00:30:00 DenyQos=regular,regularlong SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G" Default=NO
PartitionName=regular Nodes=tl-vcompute01,tl-vcompute02,tl-vcompute03 MaxTime=10-00:00:00 DefaultTime=00:30:00 AllowQOS=regular,regularlong SelectTypeParameters=CR_Core_Memory TRESBillingWeights="CPU=1.0,Mem=0.1875G" Default=YES
#
# COMPUTE NODES
#
GresTypes=gpu
NodeName=talos Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=IDLE RealMemory=3000 Feature=centos7
NodeName=tl-vcompute01 Sockets=1 CoresPerSocket=9 ThreadsPerCore=1 State=IDLE RealMemory=20000 Feature=centos7
NodeName=tl-vcompute02 Sockets=1 CoresPerSocket=9 ThreadsPerCore=1 State=IDLE RealMemory=20000 Feature=centos7
NodeName=tl-vcompute03 Sockets=1 CoresPerSocket=9 ThreadsPerCore=1 State=IDLE RealMemory=20000 Feature=centos7
slurm_cluster_name: 'talos'
slurm_cluster_domain: 'hpc.rug.nl'
stack_prefix: 'tl'
vcompute_hostnames: "{{ stack_prefix }}-vcompute[01-03]"
vcompute_sockets: 2
vcompute_cores_per_socket: 2
vcompute_real_memory: 8192
vcompute_max_cpus_per_node: "{{ vcompute_sockets * vcompute_cores_per_socket - 2 }}"
vcompute_max_mem_per_node: "{{ vcompute_real_memory - vcompute_sockets * vcompute_cores_per_socket * 512 }}"
vcompute_local_disk: 0
vcompute_features: 'tmp08'
ui_hostnames: "{{ slurm_cluster_name }}"
ui_sockets: 2
ui_cores_per_socket: 2
ui_real_memory: 8192
ui_local_disk: 0
ui_features: 'prm08,tmp08'
...
12 changes: 6 additions & 6 deletions hc-cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
- docker
- cluster
- node_exporter
- geerlingguy.security
# - geerlingguy.security
tasks:
- cron:
name: Reboot to load new kernel.
Expand All @@ -41,17 +41,17 @@
roles:
- compute-vm
# - isilon
- datahandling
# - datahandling
- slurm-client

- name: Install user interface
hosts: interface
- name: Install User Interface (UI)
hosts: user-interface
become: true
tasks:
roles:
- slurm_exporter
- user-interface
- datahandling
# - datahandling
# - isilon
- slurm-client

Expand All @@ -65,5 +65,5 @@
# yum:
# name: ansible-2.6.6-1.el7.umcg

- import_playbook: users.yml
- import_playbook: hc-users.yml
#- import_playbook: ssh-host-signer.yml
Loading

0 comments on commit 1f5c2fb

Please sign in to comment.