You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Ansible install fails and stops at following point:
RUNNING HANDLER [grafana : restart grafana] *************************************************************************************************
Thursday 15 June 2023 13:46:30 -0400 (0:00:04.459) 0:01:31.890 *********
fatal: [grafana]: FAILED! => {"changed": false, "msg": "Unable to restart service grafana-server: Job for grafana-server.service failed because the control process exited with error code. See \"systemctl status grafana-server.service\" and \"journalctl -xe\" for details.\n"}
The grafana.log has the following error:
logger=sqlstore t=2023-06-15T17:46:34.789927255Z level=error msg="failed to insert organization with provided id" org_id=1 err="UNIQUE constraint failed: org.id"
Deleting grafana.db on the Grafana VM will allow me to manually start the Grafana service but it fails when trying to re-run ./install.sh
manually created the playbooks/grafana.ok to ignore it and move on.
Steps to Reproduce the Problem
I started from a fresh install using the following config.yml file:
---
# yaml-language-server: $schema=config.schema.json
# name of the cluster
project_name: az-hop
# azure location name as returned by the command : az account list-locations -o table
location: eastus
# Name of the resource group to create all resources
resource_group: jm-azhop-eus-wkg-rg
# If using an existing resource group set to true. Default is false
# When using an existing resource group make sure the location match the one of the existing resource group
use_existing_rg: true
# If set to true, will disable telemetry for azhop. See https://azure.github.io/az-hop/deploy/telemetry.html.
#optout_telemetry: true
# A log analytics workspace can be created for monitoring or alerting
# Alternatively, you can use an existing workspace.
# To use an existing workspace set create to false and specify the resource group, name and subscription the target workspace lives in
log_analytics:
create: false
# An existing log analytics workspace can be used instead. The resource group, name and subscription id of the workspace will need to be specified.
# Grant the role "Log Analytics Contributor" on the target Log Analytics Workspace for the identity used to deploy az-hop
#resource_group:
#name:
#subscription_id: # Optional, if not specified the current subscription will be used
# Option to install the monitoring agent on static infra VMs. Can be disabled if the agent is installed by policy.
monitoring:
install_agent: false
#If set to true, it will create alert rules associated with az-hop. Enablement of alerting will require the specification of an admin email to send alerts to.
alerting:
enabled: false
admin_email: admin.mail@contoso.com
local_volume_threshold: 80
# Additional tags to be added on the Resource Group
tags:
env: dev
project: azhop
owner: jmorey
azhop_ver: 1.0.36
# Define an ANF account, single pool and volume
# If not present assume that there is an existing NFS share for the users home directory
#anf:
# create: false
# Size of the ANF pool and unique volume (min: 4TB, max: 100TB)
# homefs_size_tb: 4
# Service level of the ANF volume, can be: Standard, Premium, Ultra
# homefs_service_level: Standard
# dual protocol
# dual_protocol: false # true to enable SMB support. false by default
# If alerting is enabled, this value will be used to determine when to trigger alerts
# alert_threshold: 80 # alert when ANF volume reaches this threshold
# For small deployments you can use Azure Files instead of ANF for the home directory
azurefiles:
create: true
size_gb: 1024
# These mounts will be listed in the Files menu of the OnDemand portal and automatically mounted on all compute nodes and remote desktop nodes
mounts:
# mount settings for the user home directory
home: # This home name can't be changed
# type of mount : existing, anf or azurefiles, default to existing. One of the three should be defined in order to mount the home directory
# When using existing, the mountpoint, server, export and options should be defined, for other cases leave the values as defined with the curly braces
type: azurefiles
mountpoint: /sharedhome # /sharedhome for example
server: '{{anf_home_ip}}' # Specify an existing NFS server name or IP, when using the ANF built in use '{{anf_home_ip}}'
export: '{{anf_home_path}}' # Specify an existing NFS export directory, when using the ANF built in use '{{anf_home_path}}'
options: '{{anf_home_opts}}' # Specify the mount options. Default to rw,hard,rsize=262144,wsize=262144,vers=3,tcp
# mount1:
# mountpoint: /mount1
# server: a.b.c.d # Specify an existing NFS server name or IP
# export: myexport1 # Specify an existing NFS export name
# options: my_options # Specify the mount options.
# mount2:
# mountpoint: /mount2
# server: a.b.c.d # Specify an existing NFS server name or IP
# export: myexport2 # Specify an existing NFS export name
# options: my_options # Specify the mount options.
# name of the admin account
admin_user: hpcadmin
# List of identities (object ids) to grant read access to az-hop key vault (optional)
# Object ID to grant key vault read access
#key_vault_readers:
# Network
network:
# Create Network and Application Security Rules, true by default, false when using an existing VNET if not specified
create_nsg: false
vnet:
# name: jm-azhop-vnet # Optional - default to hpcvnet
id: /subscriptions/<sub>/resourceGroups/jm-azhop-rg-eus/providers/Microsoft.Network/virtualNetworks/hpcvnet # If a vnet id is set then no network will be created and the provided vnet will be used
# address_space: "10.201.0.0/16" # Optional - default to "10.0.0.0/16"
# Special VNET Tags
# tags:
# key1: value1
# When using an existing VNET, only the subnet names will be used and not the adress_prefixes
subnets: # all subnets are optionals
# name values can be used to rename the default to specific names, address_prefixes to change the IP ranges to be used
# All values below are the default values
frontend:
name: frontend
# address_prefixes: "10.201.0.0/24"
# create: true # create the subnet if true. default to true when not specified, default to false if using an existing VNET when not specified
admin:
name: admin
# address_prefixes: "10.201.1.0/24"
# create: true
netapp:
name: netapp
# address_prefixes: "10.201.2.0/24"
# create: true
ad:
name: ad
# address_prefixes: "10.201.3.0/28"
# create: true
# Bastion and Gateway subnets are optional and can be added if a Bastion or a VPN need to be created in the environment
# bastion: # Bastion subnet name is always fixed to AzureBastionSubnet
# address_prefixes: "10.0.4.0/27" # CIDR minimal range must be /27
# create: true
# gateway: # Gateway subnet name is always fixed to GatewaySubnet
# address_prefixes: "10.0.4.32/27" # Recommendation is to use /27 or /28 network
# create: true
compute:
name: compute
# address_prefixes: "10.201.16.0/20"
# create: true
# Specify the Application Security Groups mapping if already existing
asg:
resource_group: jm-azhop-rg-eus # name of the resource group containing the ASG. Default to the resource group containing azhop resources
names: # list of ASG names mapping to the one defined in az-hop
asg-ssh: asg-ssh
asg-rdp: asg-rdp
asg-jumpbox: asg-jumpbox
asg-ad: asg-ad
asg-ad-client: asg-ad-client
asg-lustre: asg-lustre
asg-lustre-client: asg-lustre-client
asg-pbs: asg-pbs
asg-pbs-client: asg-pbs-client
asg-cyclecloud: asg-cyclecloud
asg-cyclecloud-client: asg-cyclecloud-client
asg-nfs-client: asg-nfs-client
asg-telegraf: asg-telegraf
asg-grafana: asg-grafana
asg-robinhood: asg-robinhood
asg-ondemand: asg-ondemand
asg-deployer: asg-deployer
asg-guacamole: asg-guacamole
asg-mariadb-client: asg-mariadb-client
# peering: # This list is optional, and can be used to create a VNet Peering in the same subscription.
# - vnet_name: #"VNET Name to Peer to"
# vnet_resource_group: #"Resource Group of the VNET to peer to"
# vnet_allow_gateway: false # optional: allow gateway transit (default: true)
# Specify DNS forwarders available in the network
#dns:
# forwarders:
# - { name: jmazhop.eus.net, ips: "208.67.222.222, 208.67.222.220" }
# - { name: foo.com, ips: "10.2.0.4, 10.2.0.5" }
# When working in a locked down network, uncomment and fill out this section
locked_down_network:
enforce: false
# grant_access_from: [ "192.168.2.0/24","10.200.0.0/24" ] # Array of CIDR to grant access from, see https://docs.microsoft.com/en-us/azure/storage/common/storage-network-security?tabs=azure-portal#grant-access-from-an-internet-ip-range
public_ip: true # Enable public IP creation for Jumpbox, OnDemand and create images. Default to true
# Base image configuration. Can be either an image reference or an image_id from the image registry or a custom managed image
linux_base_image: "OpenLogic:CentOS:7_9-gen2:latest" # publisher:offer:sku:version or image_id
# linux image plan if required, format is publisher:product:name
#linux_base_plan: # linux image plan if required, format is publisher:product:name
windows_base_image: "MicrosoftWindowsServer:WindowsServer:2019-Datacenter-smalldisk:latest" # publisher:offer:sku:version or image_id
lustre_base_image: "azhpc:azurehpc-lustre:azurehpc-lustre-2_12:latest"
# The lustre plan to use. Only needed when using the default lustre image from the marketplace. use "::" for an empty plan
lustre_base_plan: "azhpc:azurehpc-lustre:azurehpc-lustre-2_12" # publisher:product:name
domain:
name: "jmorey.net"
domain_join_ou: "OU=azhop" # OU to set the machine in. Make sure the OU exists in the domain as it won't be created for you
use_existing_dc: true # Set to true if you want to join a domain with existing DC
domain_join_user:
username: themorey
password_key_vault_name: jm-gbb-eus-kv # name_for_the_key_vault_with_the_domain_join_password
password_key_vault_resource_group_name: jm-gbb-eus-kv # resource_group_name_for_the_key_vault_with_the_domain_join_password
password_key_vault_secret_name: 'themorey-password' # key_vault_secret_name_for_the_domain_join_password
# additional settings when using an existing DC
existing_dc_details:
domain_controller_names: ["jmorey-novo-ad"]
domain_controller_ip_addresses: ["10.100.0.31"]
private_dns_servers: ["10.100.0.31"]
# Jumpbox VM configuration
jumpbox:
vm_size: Standard_B2ms
ssh_port: 2222 # SSH port used on the public IP, default to 22
# Active directory VM configuration
#ad:
# vm_size: Standard_B2ms
# hybrid_benefit: false # Enable hybrid benefit for AD, default to false
# high_availability: false # Build AD in High Availability mode (2 Domain Controlers) - default to false
# On demand VM configuration
ondemand:
vm_size: Standard_D4as_v5
#fqdn: azhop.foo.com # When provided it will be used for the certificate server name
generate_certificate: true # Generate an SSL certificate for the OnDemand portal. Default to true
# Grafana VM configuration
grafana:
vm_size: Standard_B2ms
# Guacamole VM configuration
guacamole:
vm_size: Standard_B2ms
# Scheduler VM configuration
scheduler:
vm_size: Standard_B2ms
# CycleCloud VM configuration
cyclecloud:
vm_size: Standard_B2ms
version: 8.3.0-3062 # to specify a specific version, see https://packages.microsoft.com/yumrepos/cyclecloud/
# Lustre cluster configuration
#lustre:
# rbh_sku: "Standard_D2d_v5"
# mds_sku: "Standard_D2d_v5"
# oss_sku: "Standard_D8d_v5"
# oss_count: 2
# hsm_max_requests: 8
# mdt_device: "/dev/sdb"
# ost_device: "/dev/sdb"
# hsm:
# optional to use existing storage for the archive
# if not included it will use the azhop storage account that is created
# storage_account: #existing_storage_account_name
# storage_container: #only_used_with_existing_storage_account
# List of users to be created on this environment
users:
# name: username - must be less than 20 characters
# uid: uniqueid
# shell: /bin/bash # default to /bin/bash
# home: /anfhome/<user_name> # default to /homedir_mountpoint/user_name
# groups: list of groups the user belongs to
- { name: clusteradmin, uid: 10001, groups: [5001, 6000, 6001] }
- { name: clusteruser, uid: 10002 }
- { name: jmorey, uid: 10003, groups: [5001, 6000, 6001] }
# - { name: jemorey, uid: 10006, groups: [5001, 5002, 6000, 6001] }
# - { name: user1, uid: 10004, groups: [6000] }
# - { name: user2, uid: 10005, groups: [6001] }
usergroups:
# These groups can’t be changed
- name: Domain Users # All users will be added to this one by default
gid: 5000
- name: az-hop-admins
gid: 5001
description: "For users with azhop admin privileges"
- name: az-hop-localadmins
gid: 5002
description: "For users with sudo right or local admin right on nodes"
# For custom groups use gid >= 6000
- name: project1 # For project1 users
gid: 6000
- name: project2 # For project2 users
gid: 6001
# Enable cvmfs-eessi - disabled by default
#cvmfs_eessi:
# enabled: true
# scheduler to be installed and configured (openpbs, slurm)
queue_manager: slurm
# Specific SLURM configuration
slurm:
# Enable SLURM accounting, this will create a SLURM accounting database in a managed MySQL server instance
accounting_enabled: true
# SLURM version to install. Currently supported: only 20.11.9 and 22.05.3.
# Other versions can be installed by building from source (See build_rpms setting in the slurmserver role)
slurm_version: 20.11.9
# Name of the SLURM cluster for accounting (optional, default to 'slurm')
# WARNING: changing this value on a running cluster will cause slurmctld to fail to start. This is a
# safety check to prevent accounting errors. To override, remove /var/spool/slurmd/clustername
cluster_name: jm_azhop
enroot:
enroot_version: 3.4.1
# If using an existing Managed MariaDB instance for SLURM accounting and/or Guacamole, specify these values
#database:
# Admin user of the database for which the password will be retrieved from the azhop keyvault
# user: sqladmin
# FQDN of the managed instance
# fqdn:
# IP of the managed private endpoint if the FQDN is not registered in a private DNS
# ip:
# Create a Bastion in the bastion subnet when defined
bastion:
create: false
# Create a VPN Gateway in the gateway subnet when specified
vpn_gateway:
create: false
# Authentication configuration for accessing the az-hop portal
# Default is basic authentication. For oidc authentication you have to specify the following values
# The OIDCClient secret need to be stored as a secret named <oidc-client-id>-password in the keyvault used by az-hop
authentication:
httpd_auth: oidc # oidc or basic
# User mapping https://osc.github.io/ood-documentation/latest/reference/files/ood-portal-yml.html#ood-portal-generator-user-map-match
# Domain users are mapped to az-hop users with the same name and without the domain name
user_map_match: '^([^@]+)@microsoft.com$'
ood_auth_openidc:
OIDCProviderMetadataURL: 'https://sts.windows.net/{{tenant_id}}/.well-known/openid-configuration' # for AAD use 'https://sts.windows.net/{{tenant_id}}/.well-known/openid-configuration'
OIDCClientID: 'redacted'
OIDCRemoteUserClaim: 'upn' # for AAD use 'upn'
OIDCScope: 'openid profile email groups' # for AAD use 'openid profile email groups'
OIDCPassIDTokenAs: 'serialized' # for AAD use 'serialized'
OIDCPassRefreshToken: 'On' # for AAD use 'On'
OIDCPassClaimsAs: 'environment' # for AAD use 'environment'
#image_gallery:
# create: false # Create the shared image gallery to store custom images
# List of images to be defined
images:
# - name: image_definition_name # Should match the packer configuration file name, one per packer file
# publisher: azhop
# offer: CentOS
# sku: 7_9-gen2
# hyper_v: V2 # V1 or V2 (V1 is the default)
# os_type: Linux # Linux or Windows
# version: 7.9 # Version of the image to create the image definition in SIG
# Pre-defined images
- name: azhop-compute-almalinux-8_7
publisher: azhop
offer: almalinux
sku: 8_7-hpc-gen2
hyper_v: V2
os_type: Linux
version: 8.7
- name: azhop-centos79-v2-rdma-gpgpu
publisher: azhop
offer: CentOS
sku: 7.9-gen2
hyper_v: V2
os_type: Linux
version: 7.9
# Image definition when using a custom image to build compute nodes images
- name: azhop-centos79-v2-rdma-ci
publisher: azhop
offer: CentOS
sku: 7.9-gen2-ci
hyper_v: V2
os_type: Linux
version: 7.9
# Image definition when using a custom image to build remote viz nodes images
- name: azhop-centos79-desktop3d-ci
publisher: azhop
offer: CentOS
sku: 7.9-gen2-desktop3d-ci
hyper_v: V2
os_type: Linux
version: 7.9
- name: azhop-centos79-desktop3d
publisher: azhop
offer: CentOS
sku: 7.9-gen2-desktop3d
hyper_v: V2
os_type: Linux
version: 7.9
- name: azhop-compute-centos-7_9
publisher: azhpc
offer: azhop-compute
sku: centos-7_9
hyper_v: V2
os_type: Linux
version: 7.9
- name: azhop-desktop-centos-7_9
publisher: azhpc
offer: azhop-desktop
sku: centos-7_9
hyper_v: V2
os_type: Linux
version: 7.9
- name: azhop-compute-ubuntu-1804
publisher: azhpc
offer: azhop-compute
sku: ubuntu-1804
hyper_v: V2
os_type: Linux
version: 18.04
- name: azhop-compute-ubuntu-2004
publisher: azhpc
offer: azhop-compute
sku: ubuntu-2004
hyper_v: V2
os_type: Linux
version: 20.04
- name: azhop-win10
publisher: azhop
offer: Windows-10
sku: 21h1-pron
hyper_v: V1
os_type: Windows
version: 10.19043
# Base image when building your own HPC image and not using the HPC marketplace images
- name: base-centos79-v2-rdma
publisher: azhop
offer: CentOS
sku: 7.9-gen2-rdma-nogpu
hyper_v: V2
os_type: Linux
version: 7.9
# Autoscale default settings for all queues, can be overriden on each queue depending on the VM type if needed
autoscale:
idle_timeout: 1800 # Idle time in seconds before shutting down VMs - default to 1800 like in CycleCloud
# List of queues (node arrays in Cycle) to be defined
# don't use queue names longer than 8 characters in order to leave space for node suffix, as hostnames are limited to 15 chars due to domain join and NETBIOS constraints.
queues:
- name: execute # name of the Cycle Cloud node array
# Azure VM Instance type
vm_size: Standard_F2s_v2
# maximum number of cores that can be instanciated
max_core_count: 20
# marketplace image name or custom image id
image: azhpc:azhop-compute:ubuntu-20_04:latest
#image: /subscriptions/<sub>/resourceGroups/jm-azhop-eus-wkg-rg/providers/Microsoft.Compute/galleries/azhop_kgasc3me/images/azhop-compute-ubuntu-2004/versions/2023.0305.1748
# Image plan specification (when needed for the image). Terms must be accepted prior to deployment
# plan: publisher:product:name
# Set to true if AccelNet need to be enabled. false is the default value
EnableAcceleratedNetworking: false
# spot instance support. Default is false
spot: false
# Set to false to disable creation of placement groups (for SLURM only). Default is true
ColocateNodes: false
# Specific idle time in seconds before shutting down VMs, make sure it's lower than autoscale.idle_timeout
idle_timeout: 300
# Set the max number of vm's in a VMSS; requires additional limit raise through support ticket for >100;
# 100 is default value; lower numbers will improve scaling for single node jobs or jobs with small number of nodes
MaxScaleSetSize: 100
- name: hc44rs
vm_size: Standard_HC44rs
max_core_count: 200
image: azhpc:azhop-compute:centos-7_9:latest
# image: /subscriptions/{{subscription_id}}/resourceGroups/{{resource_group}}/providers/Microsoft.Compute/galleries/{{sig_name}}/images/azhop-centos79-v2-rdma-gpgpu/latest
spot: true
EnableAcceleratedNetworking: true
- name: hb120v2
vm_size: Standard_HB120rs_v2
max_core_count: 480
image: azhpc:azhop-compute:centos-7_9:latest
# image: /subscriptions/{{subscription_id}}/resourceGroups/{{resource_group}}/providers/Microsoft.Compute/galleries/{{sig_name}}/images/azhop-centos79-v2-rdma-gpgpu/latest
spot: true
EnableAcceleratedNetworking: true
- name: hb120v3
vm_size: Standard_HB120rs_v3
max_core_count: 480
image: azhpc:azhop-compute:centos-7_9:latest
# image: /subscriptions/{{subscription_id}}/resourceGroups/{{resource_group}}/providers/Microsoft.Compute/galleries/{{sig_name}}/images/azhop-centos79-v2-rdma-gpgpu/latest
spot: true
EnableAcceleratedNetworking: true
# Queue dedicated to GPU remote viz nodes. This name is fixed and can't be changed
- name: viz3d
vm_size: Standard_NV12s_v3
max_core_count: 24
# Use the pre-built azhop image from the marketplace
image: azhpc:azhop-desktop:centos-7_9:latest
# Use this image ID when building your own custom images
#image: /subscriptions/<sub>/resourceGroups/jm-azhop-eus-wkg-rg/providers/Microsoft.Compute/galleries/azhop_kgasc3me/images/azhop-compute-ubuntu-2004/versions/2023.0305.1748
ColocateNodes: false
spot: false
EnableAcceleratedNetworking: true
max_hours: 12 # Maximum session duration
min_hours: 1 # Minimum session duration - 0 is infinite
# Queue dedicated to share GPU remote viz nodes. This name is fixed and can't be changed
- name: largeviz3d
vm_size: Standard_NV48s_v3
max_core_count: 96
image: azhpc:azhop-desktop:centos-7_9:latest
# image: /subscriptions/{{subscription_id}}/resourceGroups/{{resource_group}}/providers/Microsoft.Compute/galleries/{{sig_name}}/images/azhop-centos79-desktop3d/latest
ColocateNodes: false
EnableAcceleratedNetworking: true
spot: false
max_hours: 12
min_hours: 1
# Queue dedicated to non GPU remote viz nodes. This name is fixed and can't be changed
- name: viz
vm_size: Standard_D4s_v5
max_core_count: 16
image: azhpc:azhop-desktop:centos-7_9:latest
#image: /subscriptions/<sub>/resourceGroups/jm-azhop-eus-wkg-rg/providers/Microsoft.Compute/galleries/azhop_kgasc3me/images/azhop-compute-ubuntu-2004/versions/2023.0305.1748
ColocateNodes: false
spot: false
EnableAcceleratedNetworking: true
max_hours: 12
min_hours: 1
# Remote Visualization definitions
enable_remote_winviz: false # Set to true to enable windows remote visualization
remoteviz:
- name: winviz # This name is fixed and can't be changed
vm_size: Standard_NV8as_v4 # Standard_NV12s_v3 Only NVsv3 and NVsV4 are supported
max_core_count: 24
image: "MicrosoftWindowsDesktop:Windows-10:21h1-pron:latest"
ColocateNodes: false
spot: false
EnableAcceleratedNetworking: true
# Application settings
applications:
bc_codeserver:
enabled: true
bc_jupyter:
enabled: true
bc_amlsdk:
enabled: true
bc_rstudio:
enabled: true
bc_ansys_workbench:
enabled: true
bc_vmd:
enabled: true
bc_paraview:
enabled: true
bc_vizer:
enabled: true
The text was updated successfully, but these errors were encountered:
Version
1.0.36
In what area(s)?
/area configuration
Expected Behavior
Grafana installs and the Ansible config continues
Actual Behavior
The Ansible install fails and stops at following point:
The
grafana.log
has the following error:Deleting
grafana.db
on the Grafana VM will allow me to manually start the Grafana service but it fails when trying to re-run./install.sh
manually created the
playbooks/grafana.ok
to ignore it and move on.Steps to Reproduce the Problem
I started from a fresh install using the following
config.yml
file:The text was updated successfully, but these errors were encountered: