Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional support for Azure ADLS Gen2 #304

Merged
merged 2 commits into from
Jan 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,9 @@ Under the `azure` section, edit following values as per your configuration
* `numnodes` to change the cluster size in terms of number of nodes deployed
* `vm_sku` to specify the VM size to use. You can choose from the
[available VM sizes](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-general).
* `use_adlsg2` to use Azure Data Lake Storage(ADLS) Gen2 as datastore for Accumulo
keith-turner marked this conversation as resolved.
Show resolved Hide resolved
[ADLS Gen2 Doc](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction).
[Setup ADLS Gen2 as datastore for Accumulo](https://accumulo.apache.org/blog/2019/10/15/accumulo-adlsgen2-notes.html).

Within Azure the `nodes` section is auto populated with the hostnames and their default roles.

Expand Down
10 changes: 10 additions & 0 deletions ansible/accumulo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@
- import_tasks: roles/accumulo/tasks/init-accumulo.yml
handlers:
- import_tasks: roles/accumulo/handlers/init-accumulo.yml
- hosts: all:!{{ azure_proxy_host }}
tasks:
- import_tasks: roles/accumulo/tasks/add-adlsgen2.yml
when: accumulo_major_version == '2' and use_adlsg2 == True
- hosts: accumulomaster[0]
tasks:
- import_tasks: roles/accumulo/tasks/init-adlsgen2.yml
when: accumulo_major_version == '2' and use_adlsg2 == True
handlers:
- import_tasks: roles/accumulo/handlers/init-adlsgen2.yml
- hosts: accumulo
tasks:
- name: "start accumulo 1.0"
Expand Down
19 changes: 19 additions & 0 deletions ansible/roles/accumulo/handlers/init-adlsgen2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

- name: "Initialize Apache Accumulo on ADLS Gen2 volume"
command: "{{ accumulo_home }}/bin/accumulo init --add-volumes"
21 changes: 21 additions & 0 deletions ansible/roles/accumulo/tasks/add-adlsgen2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
- name: Add ADLS Gen2 volume
lineinfile:
path: "{{ accumulo_home }}/conf/accumulo.properties"
regexp: '^instance.volumes='
line: "instance.volumes={{ hdfs_root }}/accumulo,{{ instance_volumes_preferred }}"
22 changes: 22 additions & 0 deletions ansible/roles/accumulo/tasks/init-adlsgen2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
- name: "determine if accumulo needs to be initialized on adlsgen2"
command: "{{ hadoop_home }}/bin/hdfs dfs -stat {{ instance_volumes_preferred[0] }}"
register: adlsgen2_stat
changed_when: adlsgen2_stat.rc != 0
failed_when: adlsgen2_stat.rc != 0 and 'No such file or directory' not in adlsgen2_stat.stderr
notify: Initialize Apache Accumulo on ADLS Gen2 volume
7 changes: 7 additions & 0 deletions ansible/roles/accumulo/templates/accumulo-env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ export HADOOP_HOME={{ hadoop_home }}
export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"

CLASSPATH="${conf}:${lib}/*:${HADOOP_CONF_DIR}:${ZOOKEEPER_HOME}/*:${HADOOP_HOME}/share/hadoop/client/*"
{% if use_adlsg2 == True %}
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/tools/lib/*"
CLASSPATH="${CLASSPATH}:${HADOOP_HOME}/share/hadoop/common/lib/*"
{% endif %}
export CLASSPATH

JAVA_OPTS=("${ACCUMULO_JAVA_OPTS[@]}"
Expand All @@ -50,6 +54,9 @@ JAVA_OPTS=("${ACCUMULO_JAVA_OPTS[@]}"
'-XX:OnOutOfMemoryError=kill -9 %p'
'-XX:-OmitStackTraceInFastThrow'
'-Djava.net.preferIPv4Stack=true'
{% if use_adlsg2 == True %}
'-Dorg.wildfly.openssl.path=/usr/lib64'
{% endif %}
"-Daccumulo.native.lib.path=${lib}/native")

case "$cmd" in
Expand Down
6 changes: 6 additions & 0 deletions ansible/roles/accumulo/templates/accumulo.properties
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,9 @@ tserver.server.threads.minimum=64

## The maximum size for each write-ahead log
tserver.walog.max.size=512M

{% if use_adlsg2 == True %}
general.volume.chooser=org.apache.accumulo.server.fs.PreferredVolumeChooser
general.custom.volume.preferred.default={{ instance_volumes_preferred }}
general.custom.volume.preferred.logger={{ hdfs_root }}/accumulo
{% endif %}
235 changes: 235 additions & 0 deletions ansible/roles/azure/tasks/create_adlsgen2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
---
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
#
# These Ansible tasks only run on the client machine where Muchos runs
# At a high level, the various sections in this file do the following:
# 1. Create an Azure ADLS Gen2 storage account.
# 2. Create User Assigned Identity.
# 3. Assign roles to storage accounts.
# 4. Create filesysystem/container in storage accounts.
# 5. Update tenant_id, client_id and instance_volumes_preferred in muchos.props.
# 6. Assign User Assigned Identity to VMSS.

- name: Generate MD5 checksum based on resource_group name, vmss_name and cluster name
shell: echo -n {{ resource_group + vmss_name + location }}|md5sum|tr -cd "[:alnum:]"|cut -c 1-16|tr '[:upper:]' '[:lower:]'
register: StorageAccountMD5

- name: Generate random names for storage account names
set_fact:
StorageAccountName: "{{ StorageAccountMD5.stdout + 99|random(seed=resource_group)|string + 99|random(seed=vmss_name)|string + 9|random(seed=location)|string }}"

- name: Initialize instance variables
set_fact:
InstanceVolumesAuto: []
InstanceVolumesManual: []

- name: Validate instance_volumes_input
fail: msg="Variable instance_volumes_input incorrectly specified, Both Manual and Auto cannot be specified at same time"
when: instance_volumes_input.split('|')[0].split(',') != [''] and instance_volumes_input.split('|')[1].split(',') != ['']

- name: Assign manual or autogenerated volumes
set_fact:
InstanceVolumesTemp: "{{ instance_volumes_input.split('|')[0].split(',')|list if instance_volumes_input.split('|')[0].split(',') != [''] else instance_volumes_input.split('|')[1].split(',')|list }}"

- name: Retrieve sequence end number to get the number of storage accounts
set_fact:
InstanceVolumesEndSequence: "{{ '1' if instance_volumes_input.split('|')[0].split(',') == [''] else InstanceVolumesTemp[0]|int }}"

- name: Generate names for Storage Accounts
set_fact:
InstanceVolumesAuto: "{{ InstanceVolumesAuto + ['abfss://'+'accumulodata'+'@'+StorageAccountName+item+'.'+InstanceVolumesTemp[1]+'/accumulo'] }}"
with_sequence: start=1 end={{ InstanceVolumesEndSequence|int }}
when: InstanceVolumesTemp[0]|int != 0

- name: Retrieve ABFSS values when specified manually
set_fact:
InstanceVolumesManual: "{{ InstanceVolumesManual + [ item ] }}"
loop:
"{{ InstanceVolumesTemp }}"
when: item.split('://')[0] == 'abfss' and instance_volumes_input.split('|')[0].split(',') == ['']

# This is final list of instance volumes
- name: Assign variables for autogeneration or manual for storage account creation
set_fact:
InstanceVolumes: "{{ InstanceVolumesManual if instance_volumes_input.split('|')[0].split(',') == [''] else InstanceVolumesAuto }}"

- name: Update instance_volumes_preferred in muchos.props
lineinfile:
path: "{{ deploy_path }}/conf/muchos.props"
regexp: '^instance_volumes_preferred\s*=\s*|^[#]instance_volumes_preferred\s*=\s*'
line: "instance_volumes_preferred = {{ InstanceVolumes|join(',') }}"

# Not registering variable because storage values are not visible immediately
- name: Create ADLS Gen2 storage acount using REST API
azure_rm_resource:
resource_group: "{{ resource_group }}"
provider: Storage
resource_type: storageAccounts
resource_name: "{{ item.split('@')[1].split('.')[0] }}"
api_version: '2019-04-01'
idempotency: yes
state: present
body:
sku:
name: "{{ adls_storage_type }}"
kind: StorageV2
properties:
isHnsEnabled: yes
location: "{{ location }}"
loop:
"{{ InstanceVolumes }}"

# Creating User Assigned identity with vmss_name suffixed by ua-msi if not specified in muchos.props
# Not registering variable because user identity values are not visible immediately
- name: Create User Assigned Identity
azure_rm_resource:
resource_group: "{{ resource_group }}"
provider: ManagedIdentity
resource_type: userAssignedIdentities
resource_name: "{{ user_assigned_identity if user_assigned_identity !='' else vmss_name + '-ua-msi' }}"
api_version: '2018-11-30'
idempotency: yes
state: present
body:
location: "{{ location }}"

# Retrieving facts about User Assigned Identity
- name: Get facts for User Assigned Identity
azure_rm_resource_facts:
resource_group: "{{ resource_group }}"
provider: ManagedIdentity
resource_type: userAssignedIdentities
resource_name: "{{ user_assigned_identity if user_assigned_identity !='' else vmss_name + '-ua-msi' }}"
api_version: '2018-11-30'
keith-turner marked this conversation as resolved.
Show resolved Hide resolved
register: UserAssignedIdentityInfo
retries: 20
delay: 15
until: UserAssignedIdentityInfo.response|map(attribute='properties')|map(attribute='principalId')|join('') is defined

- name: Update principal_id in muchos.props
lineinfile:
path: "{{ deploy_path }}/conf/muchos.props"
regexp: '^principal_id\s*=\s*|^[#]principal_id\s*=\s*'
line: "principal_id = {{ UserAssignedIdentityInfo.response|map(attribute='properties')|map(attribute='principalId')|join('') }}"

# This will be used to assign the MSI for VMSS
- name: Format User Assigned Identity for API
set_fact:
UserAssignedIdentityArr: "{{ UserAssignedIdentityInfo.response|default({})|map(attribute='id')|map('regex_replace','^(.*)$','{\"\\1\":{}}')|list}}"

# Retrieve facts about role assignment
- name: Get role definition id for "Storage Blob Data Contributor"
azure_rm_resource_facts:
resource_group: "{{ resource_group }}"
provider: Authorization
resource_type: roleDefinitions
resource_name: ba92f5b4-2d11-453d-a403-e96b0029c9fe
keith-turner marked this conversation as resolved.
Show resolved Hide resolved
api_version: '2015-07-01'
register: RoleDefinitionInfo

# Retrieve storage acount informationn.
- name: Check if the storage accounts is visible
azure_rm_storageaccount_facts:
resource_group: "{{ resource_group }}"
name: "{{ item.split('@')[1].split('.')[0] }}"
register: StorageAccountsInfo
retries: 20
delay: 15
until: StorageAccountsInfo.storageaccounts|sum(start=[])|map(attribute='id')|join('') is defined
loop:
"{{ InstanceVolumes }}"

# Retrieve storage accounts id creeated -- Used for account assignments
- name: Get the id of storage accounts created
set_fact:
StorageAccountsId: "{{StorageAccountsInfo.results|map(attribute='ansible_facts')|map(attribute='azure_storageaccounts')|sum(start=[])|map(attribute='id')|list|unique }}"

# Adding this module since role aassignment fails if it already exists.
- name: Get facts about role assignment
azure_rm_roleassignment_facts:
scope: "{{ item }}"
assignee: "{{ UserAssignedIdentityInfo.response|map(attribute='properties')|map(attribute='principalId')|list|join('') }}"
role_definition_id: "{{ RoleDefinitionInfo.response|map(attribute='id')|list|join('') }}"
register: RoleAssignmentResults
retries: 20
delay: 15
until: UserAssignedIdentityInfo.response|map(attribute='properties')|map(attribute='principalId')|join('') is defined and RoleDefinitionInfo.response|map(attribute='id')|join('') is defined
loop:
"{{ StorageAccountsId }}"

- name: Set fact for getting storage accounts that have assigned roles
set_fact:
StorageAccountRoles: "{{ item|map(attribute='scope')|list|unique }}"
no_log: True
loop:
"{{RoleAssignmentResults.results|map(attribute='roleassignments')|list }}"

# This retry logic is needed due to race condition between storage account create complete and role assignment
- name: Create a role assignment
azure_rm_roleassignment:
scope: "{{ item }}"
assignee_object_id: "{{ UserAssignedIdentityInfo.response|map(attribute='properties')|map(attribute='principalId')|list|join('') }}"
role_definition_id: "{{ RoleDefinitionInfo.response|map(attribute='id')|list|join('') }}"
state: present
retries: 30
delay: 15
register: roleassignresult
until: roleassignresult is succeeded
loop:
"{{ StorageAccountsId }}"
when: item not in StorageAccountRoles

# This retry logic is needed due to race condition between storage account creation and creating filesystem
- name: Create container/Filesystem on ADLS Gen2
azure_rm_storageblob:
resource_group: "{{ resource_group }}"
storage_account_name: "{{ item.split('@')[1].split('.')[0] }}"
container: "{{ item.split('@')[0].split('://')[1] }}"
retries: 30
delay: 15
register: createfsresult
until: createfsresult is succeeded and (createfsresult.changed == False or (createfsresult.changed == True and createfsresult.container|length > 0))
loop:
"{{ InstanceVolumes }}"

# Retrieve tenantId for core-site.xml
- name: Update tenantId in muchos.props
lineinfile:
path: "{{ deploy_path }}/conf/muchos.props"
regexp: '^azure_tenant_id\s*=\s*|^[#]azure_tenant_id\s*=\s*'
line: "azure_tenant_id = {{ UserAssignedIdentityInfo.response|map(attribute='properties')|map(attribute='tenantId')|list|join('') }}"

# Retrieve clientId for core-site.xml
- name: Update clientid in muchos.props
lineinfile:
path: "{{ deploy_path }}/conf/muchos.props"
regexp: '^azure_client_id\s*=\s*|^[#]azure_client_id\s*=\s*'
line: "azure_client_id = {{ UserAssignedIdentityInfo.response|map(attribute='properties')|map(attribute='clientId')|list|join('') }}"

- name: Assign User Assigned Identity to VMSS
azure_rm_resource:
resource_group: "{{ resource_group }}"
provider: Compute
resource_type: virtualMachineScaleSets
resource_name: "{{ vmss_name }}"
api_version: '2019-03-01'
body:
location: "{{ location }}"
identity:
type: UserAssigned
userAssignedIdentities: "{{ UserAssignedIdentityArr|join('') }}"
2 changes: 2 additions & 0 deletions ansible/roles/azure/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,5 @@

# tasks file for azure
- import_tasks: create_vmss.yml
- import_tasks: create_adlsgen2.yml
when: use_adlsg2 == True
8 changes: 8 additions & 0 deletions ansible/roles/hadoop-ha/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,11 @@
replace: "export HADOOP_LOG_DIR={{ worker_data_dirs[0] }}/logs/hadoop"
- name: "Create hadoop log dir"
file: path={{ worker_data_dirs[0] }}/logs/hadoop state=directory
- name: Insert HADOOP_OPTIONAL_TOOLS & HADOOP_OPTS in hadoop-env.sh
blockinfile:
path: "{{ hadoop_home }}/etc/hadoop/hadoop-env.sh"
insertafter: EOF
block: |
export HADOOP_OPTIONAL_TOOLS=hadoop-azure
export HADOOP_OPTS="-Dorg.wildfly.openssl.path=/usr/lib64 ${HADOOP_OPTS}"
when: hadoop_major_version == '3' and use_adlsg2 == True
Loading