-
Notifications
You must be signed in to change notification settings - Fork 107
WMAgent deployment
Pre-requisites
- A condor_schedd daemon must be deployed and running in your node.
- It needs to be added to the glideinWMS pool (if not yet).
- Create an environment setup file under /data/admin/wmagent/env.sh (check other agents to see its content). This file needs to be sourced each time you want to operate WMAgent.
- Create a secrets file with services information/url and databases credentials under /data/admin/wmagent/WMAgent.secrets (check other agents to see its content). This file is used during WMAgent deployment in order to override some default configuration.
- NOTE: you need to be very very careful with this file, especially if you are copying it from another agent. Make sure:
- to overwrite the oracle settings or replace them by MYSQL credentials. Otherwise, you may delete production oracle database!!!
- update COUCH_HOST with the proper node IP
- and update the service URLs in case you are using cmsweb-testbed or your own private virtual machine...
- Copy the service certificate files (service{cert,key}.pem from vocms0230) over /data/certs/ directory. Notice their permission must be at least 600.
- Copy the short-term proxy (myproxy.pem from vocms0230) over /data/certs directory.
- Finally, this script will be used for the deployment: https://github.com/dmwm/WMCore/blob/master/deploy/deploy-wmagent.sh
At this point, you should have gone through the pre-requisites, especially the changes required to WMAgent.secrets (if not, go back there!) From lxplus or aiadm, access the node with your own account and then switch to cmst1.
$ ssh vocmsXXX
$ sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
Download the deployment script:
$ cd /data/srv
$ wget -nv https://raw.githubusercontent.com/dmwm/WMCore/[wmcore_tag]/deploy/deploy-wmagent.sh
First, read the help/usage of the script by:
$ sh deploy-wmagent.sh
There are several things you need to provide in the command line, again, read the script help from the above command. Otherwise, this would be an example of WMAgent deployment:
$ sh deploy-wmagent.sh -w 1.1.18.patch2 -c HG1811b -t testbed-dev -p "6986" -r comp=comp
The command above would deploy WMAgent version 1.0.7.pre10, using HG1506c dmwm/deployment tag, setting the agent name to alan-devvm, applying the 5932 and 5949 official pull requests from WMCore repo and, finally, retrieving the wmagent RPM from Alan's private repository (comp.pre.amaltaro).
Once you finish the deployment of the agent, it's worth it to check whether the config.py contains the correct configuration (according to arguments from the command line and the secrets file). Run:
$ source /data/admin/wmagent/env.sh
$ less config/wmagent/config.py
IF everything is Ok, you just need to start the components, since the services (couchdb and mysql) are started during the deployment procedure. To start all the components (the agent itself), run:
$ $manage start-agent
If you made some changes to the code and want to restart the agent (all components), type:
$ $manage stop-agent
$ $manage start-agent
If you want restart only specific components, type:
$ $manage execute-agent wmcoreD --restart --components=DBS3Upload
Ask for a new machine configured by puppet from the VOC. The machine needs to be registered as a proper schedd in the the CERN HTCondor global pool. Then follow the procedure explained above.
$ ssh vocmsXXX
$ sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
$ agentenv
!!! DO NOT START !!! any further actions if the agent is not completely drained.
$ condor_q
You should see and empty queue:
-- Schedd: vocms0283.cern.ch : <137.138.153.30:4080?... @ 11/15/19 16:21:15
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
$ runningagent
cmst1 1376194 0.0 0.0 112712 948 pts/1 S+ 16:22 0:00 grep -E couch|wmcore|mysql|beam
Check the status of the agent:
$ $manage status
And in case there is something still running:
$ $manage stop-agent
$ $manage stop-services
Unregister the agent from WMStat - Clean the document from the WMStat database:
$ $manage execute-agent wmagent-unregister-wmstats `hostname -f`
$ $manage execute-agent clean-oracle
Executing clean-oracle ...
Are you sure you want to wipe out CMS_WMBS_PROD13 oracle database (yes/no): yes
Alright, dropping and purging everything
SQL*Plus: Release 11.2.0.4.0 Production on Fri Nov 15 16:26:10 2019
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
SQL>
SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
Done!
$ cp -av /data/srv/wmagent/current/config/wmagent/config.py /data/srv/config.py.$(date -I)
$ rm -fr /data/srv/wmagent/v1.2.4.patch2/
Logout from cmst1 account and reboot
$ exit
$ sudo reboot
Once the machine is up again login and run puppet manually. Even though the machines are running puppet on startup sometimes it is needed more than a single run to apply a new change:
[lxplus** ]$ ssh vocms**.cern.ch
$ sudo -s
$ sudo /opt/puppetlabs/bin/puppet agent -tv
$ sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
$ cd /data/srv
$ rm -rf deploy*
$ vi /data/admin/wmagent/WMAgent.secrets
Watch for 'ORACLE_TNS' and 'RUCIO_ACCOUNT'
$ wget -nv https://raw.githubusercontent.com/dmwm/WMCore/master/deploy/deploy-wmagent.sh
Before executing the command check for the correct versions of:
- agent tag: example "1.2.8"
- deployment script tag: "HG1911c"
- team name: "production"
- agent number: "13"
Take those from the previous wmagent config file, and run:
$ sh deploy-wmagent.sh -w 1.2.8 -d HG1911c -t production -c cmsweb.cern.ch -n 13 |tee -a /data/srv/deployment.log.$(date -I)
Or in case we need a patched deployment:
$ sh deploy-wmagent.sh -w 1.2.8 -d HG1911c -t production -c cmsweb.cern.ch -n 13 -p "9439" |tee -a /data/srv/deployment.log.$(date -I)
Watch out for errors. Need to go through every step in the installation and confirm that it finished with no errors. Especially the parts related to CouchDB
Check the status of the agent in its local couchdb by visiting the following (change the machine name):
$ agentenv
$ manage start-agent
$ rm /data/srv/*$(date -I)
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsWorkflowTeamWmAgentRealeases
Move the relevant card in trello from 'Drained' to 'Ready to start' https://trello.com.