Skip to content

WMAgent deployment

todor-ivanov edited this page Dec 3, 2019 · 56 revisions

Pre-requisites

  • A condor_schedd daemon must be deployed and running in your node.
  • It needs to be added to the glideinWMS pool (if not yet).
  • Create an environment setup file under /data/admin/wmagent/env.sh (check other agents to see its content). This file needs to be sourced each time you want to operate WMAgent.
  • Create a secrets file with services information/url and databases credentials under /data/admin/wmagent/WMAgent.secrets (check other agents to see its content). This file is used during WMAgent deployment in order to override some default configuration.
  • NOTE: you need to be very very careful with this file, especially if you are copying it from another agent. Make sure:
  • to overwrite the oracle settings or replace them by MYSQL credentials. Otherwise, you may delete production oracle database!!!
  • update COUCH_HOST with the proper node IP
  • and update the service URLs in case you are using cmsweb-testbed or your own private virtual machine...
  • Copy the service certificate files (service{cert,key}.pem from vocms0230) over /data/certs/ directory. Notice their permission must be at least 600.
  • Copy the short-term proxy (myproxy.pem from vocms0230) over /data/certs directory.
  • Finally, this script will be used for the deployment: https://github.com/dmwm/WMCore/blob/master/deploy/deploy-wmagent.sh

Deployment procedure

1. Initial setup (example for CERN agents)

At this point, you should have gone through the pre-requisites, especially the changes required to WMAgent.secrets (if not, go back there!) From lxplus or aiadm, access the node with your own account and then switch to cmst1.

$ ssh vocmsXXX

$ sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc

Download the deployment script:

$ cd /data/srv

$ wget -nv https://raw.githubusercontent.com/dmwm/WMCore/[wmcore_tag]/deploy/deploy-wmagent.sh

2. Deploying the agent

First, read the help/usage of the script by:

$ sh deploy-wmagent.sh

There are several things you need to provide in the command line, again, read the script help from the above command. Otherwise, this would be an example of WMAgent deployment:

$ sh deploy-wmagent.sh -w 1.1.18.patch2 -c HG1811b -t testbed-dev -p "6986" -r comp=comp

The command above would deploy WMAgent version 1.0.7.pre10, using HG1506c dmwm/deployment tag, setting the agent name to alan-devvm, applying the 5932 and 5949 official pull requests from WMCore repo and, finally, retrieving the wmagent RPM from Alan's private repository (comp.pre.amaltaro).

3. Final check and starting services

Once you finish the deployment of the agent, it's worth it to check whether the config.py contains the correct configuration (according to arguments from the command line and the secrets file). Run:

$ source /data/admin/wmagent/env.sh $ less config/wmagent/config.py

IF everything is Ok, you just need to start the components, since the services (couchdb and mysql) are started during the deployment procedure. To start all the components (the agent itself), run:

$ $manage start-agent

4. Additional commands

If you made some changes to the code and want to restart the agent (all components), type:

$ $manage stop-agent $ $manage start-agent

If you want restart only specific components, type:

$ $manage execute-agent wmcoreD --restart --components=DBS3Upload

Deployment of a new agent in production

CERN agents

1. New machine

Ask for a new machine configured by puppet from the VOC. The machine needs to be registered as a proper schedd in the the CERN HTCondor global pool. Then follow the procedure explained above.

2. Upgrading an existing agent

Login to the machine and setup the environment:

$ ssh vocmsXXX

$ sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc

$ agentenv

Check for the version of the agent currently installed and if it is drained for sure.

!!! DO NOT START !!! any further actions if the agent is not completely drained.

$ condor_q

You should see and empty queue:

-- Schedd: vocms0283.cern.ch : <137.138.153.30:4080?... @ 11/15/19 16:21:15
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

$ runningagent

cmst1    1376194  0.0  0.0 112712   948 pts/1    S+   16:22   0:00 grep -E couch|wmcore|mysql|beam

Check the status of the agent:

$ $manage status

And in case there is something still running:

$ $manage stop-agent

$ $manage stop-services

Clean the database

$ $manage execute-agent clean-oracle

Executing clean-oracle  ...
Are you sure you want to wipe out CMS_WMBS_PROD13 oracle database (yes/no): yes
Alright, dropping and purging everything

SQL*Plus: Release 11.2.0.4.0 Production on Fri Nov 15 16:26:10 2019

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> 

SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
Done!

Copy the old config:

$ cp /data/srv/wmagent/current/config/wmagent/config.py /data/srv/config.py.$(date -I)

Remove the old agent

$ rm -fr /data/srv/wmagent/v1.2.4.patch2/

Restart the whole node

Logout from cmst1 account and reboot

$ exit
$ sudo reboot

Run puppet manually

Once the machine is up again login and run puppet manually. Even though the machines are running puppet on startup sometimes it is needed more than a single run to apply a new change:

[lxplus** ]$ ssh vocms**.cern.ch
$ sudo -s 
$ puppet agent -tv

Delete any leftovers from previous deployments

$ sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
$ cd /data/srv
$ rm -rf deploy*

Check WMAgent.seecrets file:

$ vi /data/admin/wmagent/WMAgent.secrets

Watch for 'ORACLE_TNS' and 'RUCIO_ACCOUNT'

Download the wmagent deploy script from master this time.

$ wget -nv https://raw.githubusercontent.com/dmwm/WMCore/master/deploy/deploy-wmagent.sh

Run the wmagent deployment script

Before executing the command check for the correct versions of:

  • agent tag: example "1.2.8"
  • deployment script tag: "HG1911c"
  • team name: "production"
  • agent number: "13"

Take those from the previous wmagent config file, and run:

$  sh deploy-wmagent.sh -w 1.2.8 -d HG1911c -t production -c cmsweb.cern.ch -n 13 |tee -a /data/srv/deployment.log.$(date -I) 

Or in case we need a patched deployment:

$  sh deploy-wmagent.sh -w 1.2.8 -d HG1911c -t production -c cmsweb.cern.ch -n 13 -p "9439" |tee -a /data/srv/deployment.log.$(date -I) 

Watch out for errors. Need to go through every step in the installation and confirm that it finished with no errors. Especially the parts related to CouchDB

Check status

Check the status of the agent in its local couchdb by visiting the following (change the machine name):

https://cmsweb.cern.ch/couchdb/_utils/document.html?reqmgr_auxiliary/WMAGENT_CONFIG_vocms0283.cern.ch

Run the agent

$ agentenv
$ manage start-agent

Once the agent is validated you do not need deployment output and the olf config, clean:

$ rm /data/srv/*$(date -I)

Change status in trello

Move the relevant card in trello from 'Drained' to 'Ready to start' https://trello.com.

Clone this wiki locally