HTTP server management

Table of Contents

Acquiring server nodes

Server requirements

TO FIX: this section contains old requirements for machines in the old infrastructure

The front-end nodes should be virtual machines with minimum 2 cores, 8 GB RAM, 1Gbps network and 250 GB usable local disk space, including the operating system and AFS cache. In practice only 15 GB local disk space is needed, the rest is spare capacity for unusual conditions. There should be minimum two production front-end nodes on critical power and on physically separate network routes, racks, machines and hypervisors. In practice one front-end should be in CERN on-site data centre and another in critical hosting facility in an outside data centre. In addition there needs to be two pre-production front-ends; failure tolerant configuration is preferred but not absolutely required. There should be one non-dedicated machine available stand-by to replace a failing production server on short notice, ~3 days.

General back-end nodes should be virtual machines with minimum 2 cores, 8 GB RAM, 1 Gbps network and 250 GB usable disk space including the operating system and the AFS cache. There should be three general production back-end nodes, of which at least two on physically different network routes, racks, machines and hypervisors. The network separation should be sufficient to ensure at least one back-end/front-end pair will remain connected at all times. In addition there needs to be two pre-production back-ends and one combined development front-end+back-end; again failure tolerant configuration is preferred but not required. There should be one non-dedicated machine available stand-by to replace a failing production server on short notice, ~3 days.

In addition there needs to be three physical servers with minimum 8 cores, 24 GB RAM, 1Gbps network and 8 TB usable disk space each. The disk space should typically be configured to a single high-performance partition with redundancy, normally either RAID6 or RAID60. Both operating system and data storage area should be carved from the physical array as some sort of logical volumes, either at RAID controller level or using LVM. These physical servers host FileMover, CouchDB, DQM GUI, and anything else requiring high-performance and/or large data storage. The physical servers should be configured to be backed up, either backing up each other, or backing up to TSM.

Projected future needs

It is expected virtual machines are just fine for the front-ends and general back-ends. These should be uniformly 2 cores, 8 GB RAM, 1Gbps network and 250 GB disk space each. The main constraints are that there should be reasonable number of them, there should be reasonably low latency among front-ends and back-ends, each server should be capable of reaching 1Gbps on its own, and the servers need to remain physically separate from each other in terms of network, power, servers, racks and hypervisors. It should be possible to remove at any time any two nodes without ill effects to the cluster.

We expect to continue to require a separate storage cluster made of physical machines, with minimum three servers totalling 50 TB raw storage and ~36 TB effective storage after RAID redundancy. After RAID redundancy the storage should be capable of medium performance, about 300 MB/s reads and writes. This sub-cluster is expected to be set up to do mutual back-up, but not high-availability fail-over configuration.

Requesting a physical machine

To acquire a server node, ask for a machine to the CMS VOC or write to the VOBox-VOC-CMS e-group if you don’t know whom is the VOC. He will try to reallocate to you some spare VOBox if that’s available, otherwise he will make the request to the CERN/IT. The former is preferable as the machine is available immediately and its hardware usually matches those of the other cluster machines. The later may take from one week to a month depending on the server requirement, and it is possible that only new flavors of hardware are delivered. On both cases, the following requirements must be met:

Creating a VM

Login as yourself to aiadm. Do never create the machines directly on foreman or on openstack. Instead, use ai-bs-vm and ai-kill-vm commands on aiadm.

Source the Openstack tenant project you’d like to create the VM on: use CMS Webtools Critical or CMS Webtools Critical SSD for VMs on critical power hypervisors without and with SSD as a cache respectively. Otherwise CMS Webtools for non-critical VMs or your private personal project for testing ones.

  eval $(ai-rc "CMS Webtools")

Find the hostgroup and other configuration to use and create it with ai-bs-vm. For example:

  openstack image list             # find the latest "SLC6 Server" (not CERN Server) x86_64 image version
  openstack flavor list            # find out which flavor of VM to use

  ai-bs-vm -g "vocmsweb/backend_plus_frontend" -i "CC7 - x86_64 [2017-04-06]" --foreman-environment production vocmsNNNN --nova-flavor m2.large --landb-mainuser cms-http-group --landb-responsible cms-http-group

Check in LanDB which hypervisor it was created on. Make sure that VMs from the same group (i.e. general purpose production back-ends) don’t fall in the same hypervisors. If possible, try getting the VM on a hypervisor on a different rack. Redundant VMs from server group should be as much distributed as possible. Also verify that the hypervisor is not busy (i.e. hypervisors running batch nodes).

The machine would be ready once the puppet reports on foreman have converged to the same final configuration and you have rebooted it.

If the VM has ephemeral storage, however, you should expand the root partition with it first. Then proceed with the VM tuning instructions.

Finally, if it is foreseen the use of a volume with the created VM, see the volume creation/attaching instructions.

Commissioning a node

Preparing the machine

In order to prepare a new cmsweb host, please do as follows. Instructions tagged by FEND apply to front-ends only, and BEND to back-ends only. For a combined server which runs both front- and back-ends, run both sets of instructions in the order given, unless there’s a separate FEND+BEND section. Image server parts are labelled IMAGE., and build machine parts labelled BUILD.

  1. Preliminary tasks:
    • Verify all the alarms for the node are disabled (roger show vocmsNNN in aiadm);
    • Verify the CMS-HTTP-GROUP own the machine in LANDB;
    • Verify the server meets capacity requirements;
    • (Physical machines) Verify it came from the correct CERN network block (see this and isn’t part of the CASTOR_SERVICES set in LanDB;
    • Verify the server has not landed in a busy hypervisor (i.e. hypervisors running batch nodes);
    • If it is a critical server, verify it is in critical power and has ceph volumes in critical power.
    • Verify the server is using the right environment in foreman. Production machines should be using the production (master) environment, other machines (testbed, dev, image, etc) shall use cmswebdev.
  2. (Physical machines) Set the appropriate target hostgroup.

    Make sure load-balancing is off: the machine host name should not be member of the (pre)prod DNS load balancing in the frontend.pp file.

    For physical machines, you need to set the hostgroup in foreman (i.e. https://judy.cern.ch/hosts/vocms0NNN.cern.ch/edit) if not yet done. For VMs the host should have been created with the right hostgroup already, but it is worth double checking in foreman as well.

    IMAGE: Hostgroup vocmsweb

    FEND: Hostgroup vocmsweb/frontend

    BEND: Hostgroup vocmsweb/backend

    FEND+BEND: Hostgroup vocmsweb/backend_plus_frontend

    BUILD: Hostgroup vocmsweb/dmwmbld

    If you setting these hostgroups in foreman for the first time (i.e for physical machines), then do a full server re-installation before proceeding.

  3. Log in as cmsweb to the machine in question, and execute:

    mkdir -p /tmp/foo
    cd /tmp/foo
    git clone git://github.com/dmwm/deployment.git cfg
    kinit
    cfg/Deploy -t dummy -s post $PWD system/image     # [IMAGE]
    cfg/Deploy -t dummy -s post $PWD system/frontend  # [FEND, FEND+BEND]
    cfg/Deploy -t dummy -s post $PWD system/backend   # [BEND, FEND+BEND]
    # FIX ME: need to add the system/dmwmbld variant and run it for the _BUILD_ flavor
    # Note that for a backend_plus_frontend the order is running first frontend and then backend
    less -SX /tmp/foo/.deploy/*                       # review output
    rm -rf /tmp/foo
    
  4. FEND: Log in as cmsweb to the host and do the following:

    Create log archive directory on AFS.

    kinit # get a cmsweb kerberos token
    mkdir -p /afs/cern.ch/cms/cmsweb/log-archive/$(hostname -s)
    

    Find the acronjob that generates statistics and include the new node in the list:

    acrontab -e
    37 */6 * * * vocms022 /afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc461/cms/cmsweb-analytics/2.1-comp/bin/genstats /afs/cern.ch/cms/cmsweb/analytics/geodata /afs/cern.ch/cms/cmsweb/analytics/stats {/afs/cern.ch/cms/cmsweb/log-archive,/build/cmsweb/logs}/vocms{42,43,65,104,105,160,162,0160,0162,0164,NNN}
    
  5. If the ssh host key has changed due to a reinstall, log into image server as cmsweb and use the following commands to clean the host key before trying to insert the new one:

    kinit
    
    # _FEND_
    for host in vocms0{13{4,5},16{0,2,4}}; do
      ssh $host perl -p -i -e 's/^vocmsNNN.*\\n$//g' /home/cmsweb/.ssh/known_hosts
    done
    
    # _BEND, FEND+BEND_
    for host in vocms0{12{6,7},13{1,2,6,8,9},14{0,1},16{1,3,5},306,307,318}; do
      ssh $host perl -p -i -e 's/^vocmsNNN.*\\n$//g' /home/cmsweb/.ssh/known_hosts
    done
    
    # _FEND, FEND+BEND, BEND, IMAGE_
    perl -p -i -e 's/^vocmsNNN.*\n$//g' /home/cmsweb/.ssh/known_hosts
    sudo perl -p -i -e 's/^vocmsNNN.*\n$//g' /root/.ssh/known_hosts
    
  6. Force each machine to be in the known_hosts of all machines from the same group. You will be asked to confirm “yes” for adding ssh host keys for the various combinations.

    # _FEND, FEND+BEND, BEND, IMAGE:_ Login to the image server as _cmsweb_
    # and make sure it can ssh everywhere, including the newly added vocmsNNN
    cd /data
    kinit
    sudo cfg/admin/ImageKey start
    for h in vocms0{12{6,7},13{1,2,4,5,6,8,9},14{0,1},16{0,1,2,3,4,5},306,307,318,NNN}{,.cern.ch}; do
       echo $h
       sudo cfg/admin/ImageKey run ssh cmsweb@$h uptime
       ssh cmsweb@$h uptime
    done
    sudo cfg/admin/ImageKey stop
    
    # _BEND, FEND+BEND:_ From cmsweb@vocmsNNN (not the image server!), run
    for host in vocms0{12{6,7},13{1,2,6,8,9},14{0,1},16{1,3,5},306,307,318} $(hostname -s); do
      for h in $host $host.cern.ch; do
        ssh $h uptime
        ssh -t $h ssh -t $(hostname -s) uptime
        ssh -t $h ssh -t $(hostname -f) uptime
      done
    done
    
    # _FEND:_ From cmsweb@vocmsNNN (not the image server!), run
    for host in vocms0{13{4,5},16{0,2,4}} $(hostname -s); do
      for h in $host $host.cern.ch; do
        ssh $h uptime
        ssh -t $h ssh -t $(hostname -s) uptime
        ssh -t $h ssh -t $(hostname -f) uptime
      done
    done
    
  7. BEND, FEND: Update the SyncLogs and the CmswebReport scripts:
    • Login as cmsweb to the build server (currently, vocms22) where those scripts cronjobs are running, edit /home/cmsweb/{SyncLogs,CmswebReport} scripts and update them accordingly.
    • If you have reinstalled the machine, clean up the ssh host key from known_hosts:

      perl -p -i -e 's/^vocmsNNN.*\n$//g' /home/cmsweb/.ssh/known_hosts
      
    • Re-run the cronjobs by hand, answering yes to add the ssh key when asked:

      /home/cmsweb/SyncLogs
      /home/cmsweb/CmswebReport prod
      /home/cmsweb/CmswebReport preprod
      
    • Check all is fine, then commit your changes upstream to the deployment repository in github. The scripts are under the admin folder.
  8. Reboot the server.

  9. Host registration
    • Request registration to myproxy
    • FEND: Add the front-end in the firewall rules of the back-ends here.
    • FEND: Request online reverse proxy for DQM.
    • FEND (but not for FEND+BEND): Request the external firewall openings.
    • FEND (but not for FEND+BEND): Make sure the machine state in roger is NOT production: roger update --appstate disabled. Then add the host the DNS alias as defined in the frontend.pp puppet configuration. Login to the machine and run sudo puppet agent -t to flush the new configuration. Please note the machine state must not be production, otherwise it will join the DNS alias and start receiving requests!
    • BEND, FEND+BEND, IMAGE, BUILD: set the machine state to production in roger: roger update --appstate production vocmsNNN
    • BEND: Add the metrics/alarms to the host in backend.pp
    • BEND: If it is going to run DQMGUI or need to use stager_* castor tools for any other reason, make sure the castor client ports are open for it in the firewall.pp.
  10. Host monitoring: add a plot to the host under the monitoring page.

  11. Run full machine validation.

Verifying basic system correctness

Run through the following check list to verify basic system correctness once you have had the server reinstalled.

Updating Deployment scripts

When adding a new machine to the cluster, there’s a bunch deployment scripts that needs being updated.

First, check the list of service instances under the activity page to understand what really needs to be enabled and where. If you’d be substituting a machine, please note that some cronjobs should run one and only one instance, so make sure you only do the change once the previous has been disabled.

Any server flavor:

Front-ends:

Back-ends:

General-purpose back-ends: you may need to add the host to:

Purpose-specific back-ends: workqueue/monitoring.ini and das/manage needs to be updated.

DQMGUI back-ends: apart from dqmqui/deploy you’d definitely and very carefully need to update the dqmgui/manage script and the various dqmgui/monitoring*ini files. Do also ask for the DQM team to review/double check the deploy and manage script are correct before you deploy the server with that configuration.

If possible, double check no other files under deployment need to be updated. Also double check these configurations match what has been declared in the activity page. You would grep for vocms on all the deploy/manage/monitoring scripts and service config files:

cd deployment
for f in $(grep vocms */* 2>/dev/null| cut -d: -f1|sort -u); do vim $f; done

Activating node

Once all the deployment script changes for enabling services on the new node are done, install the latest stable (production) release on it through the image server as usual.

For fron-ends, removing the /etc/iss.nologin file makes the machine to join the DNS alias and start receiving user requests. Only do that once you’re absolutely sure the machine is ready for receiving requests. Do also double check the host has been added to the update-keys cronjob for the master of that front-end group. If it is the master by itself, make sure no other front-end on that group is still running this cron.

For back-ends, it is enough to add the node to the back-ends map file in the cmsweb front-ends. Some service instances, however, should run in one, and only one, server at a time. This is the case for syncronization threads or cronjobs. Before activating such instance, make sure the old one has been disabled first. For instance, if you leave the workqueue cronjobs running on two servers by mistake at the same time, that would case a lot of problems to the workflow team!

Finally, don’t forget to enable the alarms for the node.

Updating the documentation

Once the new machine is in production, you should update at least the following documentation pages:

Decommissioning a node

Deactivating a node temporarily

If you need to remove a server from the cluster, first disable all the alarms for it on aiadm with roger update --all_alarms false vocmsNNNN.

If the host is a front-end, then sudo touch /etc/iss.nologin on the node. This will automatically remove the machine from the DNS load balance alias in the next ~five minutes. Then tail the /data/srv/logs/frontend/*.txt logs until the traffic quiets down. There are always clients which cache DNS query results for some time, so you may need to wait as long as 10-15 minutes for the server to quiet down.

If the host is a back-end that is on production or pre-production, you’d remove it from all front-end hosts’ mapping file /data/srv/current/config/frontend/backends-prod.txt and backends-preprod.txt. The change takes effect instantly upon saving the file, check you only save what you mean to save. Also tail some key bacl-end service log file to make sure incoming requests there have stopped. Note that depending on the back-end node you are deactivating temporarily, you might need to update the map file in all cmsweb frontends, including testbed and cmsweb-dev.

Deactivating a node permanently

  1. Follow the deactivating a node temporarily instructions first.

  2. Double check the alarms are disabled with roger show vocmsNNN in aiadm. If it is a front-end, double check it is out from the DNS alias.

  3. (Front-ends) Remove the host from the frontend.pp and backend.pp puppet configuration. Login to the machine and run ‘sudo puppet agent -t’ to flush the new configuration. You need to do that even if you’re just reinstalling the server! Otherwise, once reinstalled the host would automatically join the DNS alias.

  4. (Front-ends) Wait for the DNS load balancer to be updated. It shall happen every hour, but open a snow ticket to CERN/IT dns-load-balancing on urgent cases. Then check it disappeared from the loadbalancing cmsweb log or testbed log.

  5. (Front-ends) If there is some other front-end synchronizing keys to it, login to it and disable the /data/srv/current/config/frontend/update-keys /data/srv/state/frontend vocmsNNN cronjob. If the update-keys is being run in the machine being decommissioned itself, disable it and make sure that it is deployed on some other front-end so that the keys keep being updated between the remaining frontends.

  6. Log as cmsweb into the image server and stop the services.

    ADMIN=diego A=/data/cfg/admin
    HOST="vocmsNNN"  # set here the host name
    VARIANT="dev_or_preprod_or_ prod" # set here to "dev" or "(pre)prod" accordingly
    kinit
    cd /data
    $A/InstallProd -x $VARIANT -t "$HOST" -s state:maintenance -u $ADMIN
    $A/InstallProd -x $VARIANT -t "$HOST" -s shutdown
    $A/InstallProd -x $VARIANT -t "$HOST" -s checkprocs
    # if a front-end, check: $A/InstallProd -x $VARIANT -t "$HOST" -s dnswait:cmsweb
    
  7. Log into image server as cmsweb, do kinit, and then remove acron jobs for the host (acrontab -e). However, do not remove the host from the genstats acronjob if it is there.

  8. (Front-ends) Log into the machine itself as cmsweb and backup the logs.

    # Force archiving of logs first
    /data/srv/current/config/admin/LogArchive /data/srv
    
    # Force saving log archives to AFS
    kinit
    /data/srv/current/config/admin/LogSave /data/srv /afs/cern.ch/cms/cmsweb/log-archive/$(hostname -s)
    
    # Generate a backup, we don't want to loose logs upon machine reinstallation
    cp -rp /afs/cern.ch/cms/cmsweb/log-archive/$(hostname -s){,.backup}
    
    unlog
    kdestroy
    
  9. (Front-ends) Create a ticket for the HTTP Group to eventually stage out to castor the files under /afs/cern.ch/cms/cmsweb/log-archive/vocmsNNNN.backup. If the front-end would be re-intalled, be careful to merge the backed up files first with any going months logs generated after the machine re-installation. That is, after the month is gone and can be archived, you’d merge the log archives of both old and the new server installation into a single archive before staging it out to castor.

  10. On the machine itself as cmsweb, disable puppet: sudo puppet agent --disable

  11. Remove the host from the SyncLogs and CmswebReport scripts. Also clean it up from /home/cmsweb/.ssh/known_hosts file where those scripts are running (currently, vocms022).

  12. Log into the machine as cmsweb and nuke the installation. WARNING: you should NOT run this if you have an attached ceph volume you’d need to preserve

    # BE CAREFUL!! MAKE SURE YOU TYPE THIS IN THE RIGHT WINDOW.
    # echo is for safety, remove when you are sure.
    
    # nuke the world.
    cd /data
    echo sudo rm -fr [^aceu]* .??* current enabled
    
    ls -laF /data
    
    # nuke the cmsweb account files
    echo sudo rm -fr /home/cmsweb/* /home/cmsweb/.??*
    ls -laF /home/cmsweb
    
    # nuke rc.d scripts
    echo sudo rm -f /etc/rc.d/*/*httpd
    
  13. Shred the data, if data is not permanent: if the machine has a separate partition for the /data, or have a mounted volume there, login as root to it and use the following instructions to shred it. WARNING: you should NOT run this if you have an attached ceph volume you’d need to preserve (i.e. that you’d later attach to some other machine)

    # BE CAREFUL!! MAKE SURE YOU TYPE THIS IN THE RIGHT WINDOW.
    # echo is for safety, remove when you are sure.
    
    PARTITION=$(mount|egrep '/dev/[a-z0-9]* on /data type' | awk '{print $1}')
    umount $PARTITION
    echo shred -v -z -n 7 $PARTITION
    
  14. If the VM had an attached ceph volume, remove the mounting point from cmsweb.yaml file in hiera. Then go to openstack and detach the volume from the VM being decommissioned. Finally, make sure the device (i.e /dev/vdc) disappeared from the VM. On some cases you might need to reboot the machine.

  15. Shred the VM disks: go to openstack, select the VM and grab the console. Send a ctrl-alt-del and watch it rebooting. Hit Ctrl-B when that is asked as an option in order to boot through the network. On the prompt, type “autoboot” (enter). On the following prompt, simply hit enter. When a menu with various installation options (i.e SLC5, SLC6, etc) appears, select “rescue tools”. Finally, select “parted x86_64” on the menu that follows. Once the “parted” distribution has been loaded, open a Terminal and shred the main partition of the machine:

    # BE CAREFUL!! MAKE SURE YOU TYPE THIS IN THE RIGHT WINDOW.
    # echo is for safety, remove when you are sure.
    echo shred -v -z -n 7 /dev/vda
        
    # If the VM had ephemeral storage, shred it too (MAKE SURE /dev/vdb is NOT the attached volume, if any!)
    echo shred -v -z -n 7 /dev/vdb
    
  16. (Physical machine) Change the machine hostgroup in foreman to the default vocms (spare template). Then reinstall the physical host. Once it is completed, make sure the machine was reinstalled and there are no more CMSWEB files there.

  17. Request de-registration from myproxy.

  18. Make a request to the configuration management team and ask for removal of the host from the MAIL REGISTERED_INTRANET set in LanDB. After the request has been handled, double check it is not anymore listed there. For puppet managed machines, the host shall be automatically removed from the CMS WEBTOOLS set automatically after it has been reinstalled with the spare template and having passed 24hours for the LanDB set synchronization to run.

  19. (Front-ends) Request the removal of the host from the DQM online reverse proxy by opening a ticket to the online operations. After the request has been handled, check you cannot access cmsdaq anymore:

    for url in online online-playback online-test gui-test ecal lumi; do
       echo "Accessing /dqm/$url/"
       curl -s -X GET http://cmsdaq.cern.ch/dqm/$url/
    done
    
  20. (Front-ends) Remove firewall bypass in landb if it has been requested specifically to the machine. You can find that by going to the machine record in LanDB. Puppet managed front-ends get the firewall openings by joining the CMS WEBTOOLS FRONTENDS set in LanDB. Once the machine is reinstalled, it shall remove from the Set automatically having passed 24 hours.

  21. (Physical machine) Set the responsible and main user to be VOBOX-RESPONSIBLE-CMS/E-GROUP/IT/PES, if that’s not yet the case.

  22. If it is an openstack vm, login to aiadm and run ai-kill-vm command. Otherwise, tell the VOC that the machine can be retired/handed back. Under some special cases, you might need to submit a request to the Service Consolidation in SNOW to have the machine properly removed.

Deactivating DNS load-balance

FIXME

Handing servers back to CERN/IT

FIXME: deactivate permanently, clear disks of everything sensitive (e.g. /var/log, /var/mail), and ask CERN/IT to retire the machine with full system wipe (there’s a standard procedure for it).

Requesting reinstallation of a physical machine

Before proceeding, make sure the node has been properly de-activated from the cluster. In particular, back-ends should be out of from the back-ends map file. For front-ends, make sure they’ve been removed from the frontend.pp file in puppet, otherwise the machine would join the DNS alias automatically as soon as the /etc/iss.nologin file is gone. Also double-check the aren’t any incoming requests, and that the machine has all the alarms disabled.

The following instructions shall be used (re)installing a puppet managed physical machine. Please also check the official CERN/IT documentation for physical nodes for more options like renaming a machine, or in case the following instructions don’t work.

In case of installation problems, the Managing a physical node documentation provides instructions on how to get the console and debug problems.

Requesting reinstallation of a VM

Before proceeding, make sure the node has been properly de-activated from the cluster. In particular, back-ends should be out of from the back-ends map file. For front-ends, make sure they’ve been removed from the frontend.pp file in puppet, otherwise the machine would join the DNS alias automatically as soon as the /etc/iss.nologin file is gone. Also double-check the aren’t any incoming requests, and that the machine has all the alarms disabled.

The machine is then ready for (re)commissioning.

Expand the root partition with ephemeral storage

Tuning I/O for VMs

Scientific Linux changes the default I/O scheduler to deadline in order to maximize I/O throughput but in detriment of interactivity. This would be ok for purpose specifc, dedicated machines running I/O performance critical tasks, but it kills any I/O fairness effort. When running multiple services on the same machine, or whenever you’d have a background I/O intensive cronjob to run, the deadline scheduler becomes inapropriate. Reasonable recent kernels/distributions already switched the default I/O scheduler to CFQ. To check which one is in use:

  $ cat /sys/block/vd*/queue/scheduler 
  noop anticipatory [deadline] cfq 
  noop anticipatory [deadline] cfq

Then change it back to CFQ by setting the ktune symbolic link:

  $ sudo ln -f -s /etc/tune-profiles/default/ktune.sysconfig /etc/sysconfig/ktune

On next machine reboot, cat /sys/block/vd*/queue/scheduler shall show [cfq] instead.

Changing the I/O scheduler alone is not enough, however, to avoid long server I/O hangs where you can’t even login to it. It happens that the default ext4 journaling mode being ordered would not commit any metadata until the data has been written by the application to the disk. That means a second application or process would need to wait for long journal operations to finish before their simple writes get processed. Their I/O priority would not matter, because in the end everything must pass by (wait for) the journal, and the journal is a single I/O priority thing for everybody.

To avoid such painful problem, we switch the journal to the writeback mode by doing:

  $ sudo tune2fs -l /dev/mapper/VolGroup00-LogVol00
  $ sudo tune2fs -o journal_data_writeback /dev/mapper/VolGroup00-LogVol00

And rebooting the server.

In the “writeback” journal mode, it only journals the metadata, guaranteeing the filesystem consistence in the event of a crash. It does not guarantee the data of very recently written files to be correct, though. Considering that we don’t store anything critical on the local disk used by the VMs (anything requiring long term persistence is either in physical machines or in mounted Ceph volumes), using writeback is a good compromise between performance vs fault tolerance. Perhaps the only important data are the frontend logs, but loosing the last seconds worth of requests would not be a big deal. Also, these VMs are in critical power anyways, meaning they should not go down not even in case of a long powercut. Therefore, the risk is very low. In fact, we could even disable the journal if needed, but the only problem is that fsck at boot would occasionally take a very long time to complete, possibly rendering any intervention unacceptably long.

Creating a ceph volume and attaching to a VM

For CC7 VMs:

For SLC6 VMs:

Login as yourself to aiadm, source the corresponding project openrc.sh file (i.e CMS Webtools Critical-openrc.sh), then run the following commands:

$ nova list
+--------------------------------------+-----------+--------+------------+-------------+-------------------------------------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                                              |
+--------------------------------------+-----------+--------+------------+-------------+-------------------------------------------------------+
| b66d8f51-4661-4c39-8712-2b4a9e24c6ff | vocms0128 | ACTIVE | -          | Running     | CERN_NETWORK=188.184.66.6, 2001:1458:201:a4::100:200  |
| e1e3ab6c-a375-4b65-a0f7-2368db47811e | vocms0136 | ACTIVE | -          | Running     | CERN_NETWORK=188.184.65.71, 2001:1458:201:a4::100:141 |
| e3a65b03-da1a-415f-9ab0-d6e0de9d65db | vocms0140 | ACTIVE | -          | Running     | CERN_NETWORK=188.184.64.223, 2001:1458:201:a4::100:d9 |
| c55b0f5e-6456-4a4d-9cd0-915cd8716b6e | vocms0162 | ACTIVE | -          | Running     | CERN_NETWORK=188.184.65.18, 2001:1458:201:a4::100:10c |
| a48eaddd-e9ea-4bbf-9e1d-9f5edacc9154 | vocms034  | ACTIVE | -          | Running     | CERN_NETWORK=188.184.64.204, 2001:1458:201:a4::100:c6 |
+--------------------------------------+-----------+--------+------------+-------------+-------------------------------------------------------+
$ cinder list
+----+--------+--------------+------+-------------+----------+-------------+
| ID | Status | Display Name | Size | Volume Type | Bootable | Attached to |
+----+--------+--------------+------+-------------+----------+-------------+
+----+--------+--------------+------+-------------+----------+-------------+
# use "--volume_type standard" below if critical power is not needed
$ cinder create --volume_type cp1 --display_name cmswebvolNN 4000
+---------------------+--------------------------------------+
|       Property      |                Value                 |
+---------------------+--------------------------------------+
|     attachments     |                  []                  |
|  availability_zone  |                 nova                 |
|       bootable      |                false                 |
|      created_at     |      2014-10-28T15:52:38.361368      |
| display_description |                 None                 |
|     display_name    |             cmswebvolNN              |
|      encrypted      |                False                 |
|          id         | 0dc8860d-57be-4e7b-bd6f-2f1d56b2d800 |
|       metadata      |                  {}                  |
|         size        |                 4000                 |
|     snapshot_id     |                 None                 |
|     source_volid    |                 None                 |
|        status       |               creating               |
|     volume_type     |               cp1                    |
+---------------------+--------------------------------------+
$ nova volume-attach vocmsNNN 0dc8860d-57be-4e7b-bd6f-2f1d56b2d800 auto
+----------+--------------------------------------+
| Property | Value                                |
+----------+--------------------------------------+
| device   | /dev/vdc                             |
| id       | 0dc8860d-57be-4e7b-bd6f-2f1d56b2d800 |
| serverId | b66d8f51-4661-4c39-8712-2b4a9e24c6ff |
| volumeId | 0dc8860d-57be-4e7b-bd6f-2f1d56b2d800 |
+----------+--------------------------------------+

Login as cmsweb to the node where it was attached and format it with the following commands. Note that current SLC6 version of e2fsprogs do not support extN for volumes bigger than 16TB. It might be that other system tools are not prepared for that too. Therefore, use XFS for volumes greater than that. You’d also need to add the xfsprogs package for the corresponding host in the rpms.pp file.

# For volumes < 16TB in size, you can use ext4
$ sudo mkfs -t ext4 -L cmswebvolNN /dev/vdX
mke2fs 1.41.12 (17-May-2010)
Filesystem label=cmswebvol01
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
262144000 inodes, 1048576000 blocks
52428800 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
32000 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

# For volumes > 16TB, use xfs.
$ sudo mkfs -t xfs -L cmswebvolNN /dev/vdc
meta-data=/dev/vdc               isize=256    agcount=30, agsize=268435455 blks
	 =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=7864320000, imaxpct=5
	 =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
	 =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Mount it by hand under /data, force puppet to create directory structure, check everything is fine:

$ sudo mount -L cmswebvolNN /data
$ sudo puppet agent -t
$ df -h /data
# do any other checks you might find needed
$ sudo umount /data

Include an entry for it on cmsweb::volumes configuration in cmsweb hiera. If not specified, default fstype is ext4.

    cmsweb::volumes:
        vocmsNNN: {voldev: "/dev/vdX", where: "/data", fstype: "xfs"}

Then re-run sudo puppet agent -t to make sure puppet configuration is ok.

Finally, reboot the machine and double check the volume got automatically mounted on /data.

If the volume would require special needs for IOPS/bandwidth, then open a request for the cloud infrastructure and ask for it. Default volume configuration is 100 IOPS and 80MB/s, which is not enough for CMSWEB critical services in production. For instance, DQMGUI volumes use 300 IOPS and 110MB/s bandwidth. See this request as an example.

Setting up DNS load-balance in puppet

The puppet managed LBD alias log is here for cmsweb and here for cmsweb-testbed.

Puppet managed front-ends join and go the LBD alias if the file /etc/iss.nologin is present on the machine or not, but it may take a few minutes until the LBD is updated once this file is created or removed. Another switcher is the appstate of the machine in roger. It must be production, otherwise the machine would not join the alias. Check it with roger show vocmsNNN in aiadm.

The official CERN/IT documentation for LB DNS is here. From there you can find the lbweb page for creating new aliases.

Front-end flavor of machines in cmsweb join or try joining LBD aliases by default. Their configuration is done in the frontend.pp file.

Creating a host alias

To create a simple host alias for a physical machine, go to LanDB and set it.

For a Openstack VM, you do that from aiadm:

$ nova list
+--------------------------------------+---------------+--------+------------+-------------+-------------------------------------------------------+
| ID                                   | Name          | Status | Task State | Power State | Networks                                              |
+--------------------------------------+---------------+--------+------------+-------------+-------------------------------------------------------+
| d62bea4f-3c56-412d-b120-d07de13ac491 | vocms0127     | ACTIVE | -          | Running     | CERN_NETWORK=188.184.44.101, 2001:1458:201:8f::100:57 |
...

# Add a new alias
$ nova meta d62bea4f-3c56-412d-b120-d07de13ac491 set landb-alias="cmsweb-dev" 

# Remove all host aliases
$ nova meta d62bea4f-3c56-412d-b120-d07de13ac491 delete landb-alias

In both cases, the change may take several hours to propagate.

Preparing the machine for the CERN/IT webscan

Front-ends can be prepared for the CERN/IT webscan using the following setup. It assumes there is a spare back-end machine for deploying the services.

  1. Login to the image server as cmsweb and get a copy of the deployment script.

  2. Disable the cronjob that sync keys to/from the other (pre)production front-ends. That is, the FHOST below should not synchronize keys with any other existing front-end.

  3. Deploy from the image server:

    VER=YYMMx REPO="-r comp=comp.pre" A=/data/cfg/admin
    AUTH=/afs/cern.ch/user/c/cmsweb/private/conf
    
    FHOST="vocmsYYY"  # set here the front-end host name
    BHOST="vocmsZZZ"  # set here the spare back-end host name
    FPKGS="admin frontend/dev"
    BPKGS="admin couchdb bigcouch das dbs/dev dqmgui"
    BPKGS="$BPKGS mongodb phedex sitedb"
    BPKGS="$BPKGS reqmgr/dev workqueue/dev reqmon alertscollector"
    BPKGS="$BPKGS crabserver/dev crabcache dmwmmon asyncstageout"
    BPKGS="$BPKGS t0wmadatasvc dbsmigration/dev t0_reqmon"
    BPKGS="$BPKGS acdcserver/dev reqmgr2/dev gitweb"
    ADMIN=diego
    
    cd /data
    kinit
    $A/InstallProd -x dev -R comp@$VER -s image -v fe$VER -a $AUTH $REPO -p "$FPKGS"
    $A/InstallProd -x dev -R comp@$VER -s image -v be$VER -a $AUTH $REPO -p "$BPKGS" -u $ADMIN
    $A/InstallProd -x dev -s kstart
    # Check the machines are in maintenance mode 
    $A/InstallProd -x dev -t "$FHOST $BHOST" -s shutdown
    $A/InstallProd -x dev -t "$FHOST $BHOST" -s checkprocs
    $A/InstallProd -x dev -t "$FHOST" -s sync -v fe$VER
    $A/InstallProd -x dev -t "$BHOST" -s sync -v be$VER
    $A/InstallProd -x dev -t "$FHOST" -s post -v fe$VER -p "$FPKGS"
    ssh $FHOST /data/srv/current/config/frontend/update-keys /data/srv/state/frontend
    # Disable the update-keys cronjob on FHOST and make sure it is
    # disabled on the other front-end as well
    ssh $FHOST sudo -H -u _sw bashs -lc \'perl -p -i -e \"s/localhost/$BHOST.cern.ch/g\" /data/srv/current/config/frontend/backends-dev.txt\'
    # Also edit /data/srv/current/config/frontend/backends-dev.txt on 
    # FHOST and point other hostnames, in case of any, to nowhere
    # (i.e vocms138nowhere.cern.ch)
    $A/InstallProd -x dev -t "$BHOST" -s post -v be$VER -p "$BPKGS"
    $A/InstallProd -x dev -t "$FHOST $BHOST" -s reboot
    $A/InstallProd -x dev -t "$FHOST $BHOST" -s status
    $A/InstallProd -x dev -s kstop
    
  4. The proxy renewal might fail in the chosen back-end if the host does not yet have permissions in myproxy. Make sure you have a valid proxy when the CERN/IT will run the scan.

  5. Test services by accessing directly the new fron-tend https://vocmsYYY.cern.ch/.

  6. Stop puppet and iptables.

    sudo puppet agent --disable
    sudo /sbin/service iptables stop
    sudo /sbin/service iptables status
    
  7. Ask IT to run the scan.

Requesting external firewall bypass

Production and pre-production front-end nodes need an external firewall bypass for ports 80 (http) and 443 (https). The bypass request should be created by anybody in VOBox-VOC-CMS e-group. To make the firewall bypass request, the person in question needs to be marked as the owner of the servers in the LANDB.

The request process should be started as soon as the new server’s network identity is known. The security scans required for the bypass can only be completed after the server is fully deployed however. Obviously this must happen before the node can be added to the DNS alias, so schedule steps carefully.

To deploy the services, we use slighly different instructions in order to lower the risks of disrupting production data. Please follow the preparing the machine for the CERN/IT webscan instructions before requesting the firewall openings.

It sometimes happens the security scan identifies new things to fix in the servers. This invariably requires updating the front-end server configuration, so if this happens get the HTTP group responsible to fix the problems and to roll out an update. It is best to look for these proactively, especially when planning server software updates, so the issues do not come up as a surprise during operation.

Development front-end and all back-end servers should be entirely behind the firewall, with no bypass exceptions at all.

As per today, the procedure to request the firewal opening is explained in details below. However, we advice checking the official firewall request instructions before to make sure the procedure hasn’t changed.

To make the bypass request, fill in a request form in the machine’s LanDB configuration. Please note that due to a bug on LanDB, you might not be able to access the host updating page, so use https://network.cern.ch/sc/fcgi/sc.fcgi?Action=FirewallRequest&InterfaceName=VOCMSNNNN to access the formulaire directly (change VOCMSNNNN to the machine name accordinly).

When filling the form, put the following:

Once submitted, the request will be initiated. Provided you already prepared the machine for the security scan, then when asked if the firewall has been stopped you have to send a message to Computer.Security and confirm that’s the case. This would create a SNOW ticket to follow up in case of any critical security issues.

Requesting myproxy renewal access

Make a request to myproxy service to include the DN’s of all front-end and back-end server host certificates into myproxy.cern.ch trusted_retrievers, authorized_retrievers and authorized_renewers policies.

Subject: myproxy registration request for HOSTNAMEA HOSTNAMEB

Could you please add the following host certificate to myproxy.cern.ch
trusted_retrievers, authorized_retrievers, authorized_renewers policy?
This is a production server for CMS web services and requires use of
grid proxy certificates.

/DC=ch/DC=cern/OU=computers/CN=HOSTNAMEA.cern.ch
/DC=ch/DC=cern/OU=computers/CN=HOSTNAMEB.cern.ch

Would you please also remove the following DN's from old servers that
have been retired?

/DC=ch/DC=cern/OU=computers/CN=HOSTNAMEY.cern.ch
/DC=ch/DC=cern/OU=computers/CN=HOSTNAMEZ.cern.ch

After the request has been handled, check the list of servers on myproxy profile.

Requesting inclusion in cernmx trusted machines

Send a request to service-desk@cern.ch to include all cmsweb machines into the MAIL REGISTERED_INTRANET set in LanDB. This makes the machines free from the limitations on the maximum number of e-mails imposed by the cernmx policies. See this for details.

Subject: add trusted machines to the MAIL REGISTERED_INTRANET set in landb

Could you please add the following machines to the set of machines trusted
by the cernmx for anonymous smtp posting ?

vocms127 vocms13[23456] vocms13[89] vocms14[01] vocms16[012345]

According to https://espace.cern.ch/mmmservices-help/ManagingYourMailbox/Security/Pages/AnonymousPosting.aspx,
they need to be made members of the "MAIL REGISTERED_INTRANET" set
(https://network.cern.ch/sc/fcgi/sc.fcgi?Action=DisplaySet&setId=1906) in LanDB.

Requesting pass-through for online reverse proxy

When adding or removing front-end nodes, create a request in online operations to modify the allowed hosts access lists for <Location /dqm> in cmsdaq.cern.ch reverse proxy rules. The ACL should permit cmsweb.cern.ch front-end nodes by their host names, production, pre-production and development, and no other hosts. Allow several days for change to happen.

Project: CMS Online Networks and Systems
Issue Type: Task
Assignee: Automatic
Summary: Update list of cmsweb hosts for proxy pass at cmsdaq.cern.ch
Priority: Major
Components: Other
Other fields: leave blank or default value
Description:

Hi,

Could you please update to the following list the cmsweb hosts
which can access the reverse proxy on cmsdaq for the /dqm area?

vocmsAAA
vocmsBBB
vocmsZZZ

That is, add vocmsBBB, remove vocmsXXX and vocmsYYY.

The change should happen somewhere on cmsdaq.cern.ch apache httpd configuration
where you find statements like:
<Location /dqm>
    Order Deny,Allow
    Deny from all
    Allow from vocmsAAA.cern.ch vocmsXXX.cern.ch vocmsYYY.cern.ch vocmsZZZ.cern.ch
</Location>

See old requests CMSONS-6825 and CMSONS-7300 as references.

Regards,
Your Name for the CMS HTTP Group.

Debugging disk problems and changing failed disks in the big disk servers

Use the following instructions when suspecting of disk problems on the big disk servers. All the commands should be run on the corresponding machine.

From the Dell mega raid cheat sheet, use the following to quickly find if there is a failed disk:

$ sudo /usr/sbin/megacli -AdpAllInfo -aALL
...
            Device Present
            ================
Virtual Drives    : 2 
  Degraded        : 2 
  Offline         : 0 
Physical Devices  : 9 
  Disks           : 8   <--- total physical disks
  Critical Disks  : 0 
  Failed Disks    : 1   <--- number of failed disks
...

To find which disk has failed use:

$ sudo /usr/sbin/megacli -PDList -aALL
...
Enclosure Device ID: 32
Slot Number: 5                                                  <--- disk slot number
Device Id: 5
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Raw Size: 1907729MB [0xe8e088b0 Sectors]
Non Coerced Size: 1907217MB [0xe8d088b0 Sectors]
Coerced Size: 1907200MB [0xe8d00000 Sectors]
Firmware state: Failed                                          <--- failed state
SAS Address(0): 0x1221000005000000
Connected Port Number: 5(path0) 
Inquiry Data: ATA     Hitachi HUA72202A25C      JK11D1YAKGDVNZ  <--- hard disk serial number

Open a ticket to the IT Procurement Service asking to change the failed disk, providing them the slot number and the serial number. If possible, request them to exchange the disk with the machine turned off so that there is no raid rebuild if they pull out the wrong disk. See this ticket as a request example. Note that “Slot 5” may mean this is the 6th disk since slot numbers start from zero.

Since the servers use a RAID6 configuration, we would still survive if the wrong disk is pulled out, but not if done wrong twice in a row. It is not rare for these mistakes to happen, and we use RAID6 exactly because of that. It happened various times that the wrong disk is changed, so all such interventions are delicate and must be properly phrased on the ticket opened to the procurement team.

After the disk has been changed, run the last command above to find out the new Firmware state. It should change to Failed -> Offline -> Rebuild. After completing the rebuild, its status will finally change to Online like the other disks.

Finally, double-check all the services are running correctly as expected.

FIXME: we need updated instructions on how to do that for physical machines in the new infrastructure.

For virtual machines, grab the console directly through the openstack web interface.

Once connected to the console, use it to boot into single user mode to repair a system partition following from a hard disk problem, for example. To edit the boot time parameters, use the following.

The init=/bin/bash tells grub to enter a shell directly without asking for the “root password or Ctrl-D to reboot”. Alternatively, when using single fastboot, the single tells it to go into single user mode without asking for the root password, while fastboot makes it skip running the fsck. The main difference is that the former will bring you to a bare shell, without any partitions mounted. While the later brings you to a system state much more close to the production state. Usually, single fastboot is preferable, but depending on the problem it may not succeed to boot.

If all you need is to run fsck interactively on some partition, then use boot with single fastboot, run fsck manually on the offending partition, and finally request the machine to reboot with /sbin/shutdown -r -n now.

Deploying services

Adding a new service

At least the following steps are known to be needed:

  1. Allocate a port range and update the list here. (this can be skipped if the service don’t open any ports like the couch applications).
  2. Add the group/user accounts for the new service in the deploy scripts: under system/deploy;
  3. Add any secrets it needs to /afs/cern.ch/user/c/cmsweb/private/conf/newservice in case of any. Also make sure the file/directory permissions are correct.
  4. Add deploy, manage, configuration and monitoring scripts for the new service.
  5. Add the frontend redirection rules for the new service. That means creating app_newservice_{no,}ssl.conf files in the frontend area.
  6. Some services may require special rules in the backends-*.txt files under the frontend area. For instance, the couch apps and services not running on the general purpose back-ends.
  7. Register the lemon metrics/exceptions and the corresponding alarming instructions.
  8. Update the audit rules in the {pre,}prod/customization/cms/webtools/backend/audit template. Some special treatment for matching audit entries may be needed in the cmsweb-audit script too.
  9. Update the cmsweb-analytics regexps to include the new service. See this commit as an example.

Removing a service

At least the following steps are known to be needed:

  1. Remove any state files from the cmsweb machines;
  2. Set the state of metrics/excpetions on the metrics manager to “Draft” so that they can later be reused by another cmsweb service (it is not possible to remove them).
  3. Do not remove the service from the cmsweb-analytics regexps as it is needed to reprocess old logs.
  4. Check for Adding a new service to find occurrences of where it was registered and remove from there.

Updating the alarming instructions

The following steps are needed in order to have the alarming instructions properly in place to be used by the CC operators on reaction of lemon alarms. They may be used as well by our VOC or by the CRCs.

  1. Update the alarms page on the HTTP group documentation;
  2. Go to the service-portal.cern.ch, sign-in, search for “CMSWEB Service Procedures”, then click on the corresponding KB article, which is currently KB0002246;
  3. Click “View in tool”.
  4. If you are going to do small changes (i.e. fixing a typo), make them directly and then click “Update”.
  5. Otherwise, click “Review”, do the changes to the document, and then click “Submit for Functional Review”. While pasting the changes from the alarms page, please note that due to a bug on the service-now you must change any occurences of <pre><code> ... </code></pre> to <code><pre> ... </pre></code>;
  6. People from the “CMS VOBox Support” Functional Element will receive a notification to approve it. They shall do it by clicking on “Publish” or “Re-publish”.
  7. Once approved, open KB0002246 and double check things are correct.

When creating the alarming instructions for the first time, you need to submit the KB for review and make it available for the CC operators. The KB0002101 describes how to do that. See SUB14717 as a reference.

Finally, to test the updated instructions force the generation of a lemon alarm by stoping the corresponding service(s) on cmsweb-dev. After about half an hour or so, the CC operators should come in and take the needed action following the just updated instructions. You may want to trigger this test in the following day so that the instructions get propagated properly (i.e. to OPM backup pages).

In case something goes wrong and CC operators claim there is no corresponding instruction to the alarm, ask Fabio Trevisani Fabio.Trevisani@cern.ch for help.

Registering lemon metrics and exceptions

The registration of new lemon metrics/exceptions has been painful and error prone. Be careful when doing the following steps so that this gets done in the minimum amount of time and with no disastrous consequences.

  1. Go to the metrics manager, login, click on Metrics on the top menu. On the search box type cmsweb. Check if you can reuse some of the existing metrics/exceptions from any retired service. Reusing is preferable because it keeps the IDs on the ID range previously assigned to us, therefore maing it easier to match on scripts/documentation. The retired service metrics are usually put on “Draft” state, so filter by this corresponding state to see if there is some available. To get the list of existing exceptions as well, click Notifications on the top menu;
  2. Edit the metric/exceptions or add new ones. Take note of their IDs and names. These metrics would only be visible to lemon servers once their state has been changed to “Pending” and eventually “Production”. In any case, we would test them first.
  3. On the machine specific yaml configuration in hiera, define the metric/exception you’d like to test:

      # example for a metric
      lemon_13170:
          metricname: cmsweb_frontend_process_count
          sensor: linux
          metricclass: system.processCount
          timing: 60
          params:
              cmdregex: "bin/httpd -f .*/frontend/server.conf"
              uid: "_frontend"
    
      # example for an exception
      lemon_33576:
          metricname: exception.cmsweb_data_full
          sensor: exception
          metricclass: alarm.exception
          timing: 60
          params:
              Correlation: "(((9104:1 eq '/data') || (9104:1 eq '/')) && (9104:5 > 16))"
              MinOccurs: 2
    
      # Lemon notifications to cms-http-group
      lemon_all:
          notifications:
             all:
                tags:
                   egroup_name: "cms-http-group@cern.ch"
    
  4. On the manifest of that machine or its hostgroup, enable the metrics/exceptions you created:

     # Testing metrics/exceptions
     if $hostname == "vocmsNNNN" {
        lemon::metric{'33576':; '13170':;}
     }
    
  5. Push the puppet changes and force puppet to run on the corresponding machine with sudo puppet agent -t. Check the metric/exception appears in /etc/lemon/agent/metrics/.

  6. After a few minutes, check it is reporting correctly locally by running lemon-cli -n vocmsNNNN -m MMMMM -l, where MMMMM is the ID of the metric. I.e. lemon-cli -m "33576" --start 1417185457 -l.
  7. Make sure the alarms are enabled for the machine in roger: roger update --all_alarms true vocmsNNNN.
  8. If you’re creating a new alarm, update alarm instructions.
  9. Force the alarming condition. In the above example, it would be enough to create a file that makes the /data partition use go aboce 16%. Test everything is fine.
  10. Change the state of the metrics/exceptions on the metrics manager to “Pending”;
  11. Once approved by the lemon team, they should change the state of the metrics/exceptions to “Production” on the metrics manager. Double-check that’s the case. This step can take ~2 working days to receive action from CERN/IT. As per today, you’ll not receive any notification when it is done. So you’d have to check it yourself.
  12. When ready, remove the yaml configuration you set for the machine, but keep the one to include it in the manifest. Force puppet to run and check the metric is still reporting properly.
  13. On any backend node where you activated them, double-check all the metrics and exceptions are reporting centrally with lemon-cli -n vocmsNN -m NNNN -s. Note the -s option.
  14. Finally, test the generation of an alarm by forcing the alarming condition (i.e. stopping the related service on the host where the exception is enabled), and wait to see if the alarm gets triggered to CC operators properly.
  15. If all is fine, take note on the to-do to activate the exceptions on production machines during the production upgrade.

In case of problems to receive the alarms, open a ticket to the “Monitoring Tools” Functional Element, “Monitoring tools 2nd Line Support” group.

Deploying Analytics

The analytics runs an acronjob every hour to update the stats shown in the analytics page. To deploy it, login as cmsweb into vocms22 and follow the instructions below. Note you must change the version “2.1-comp” everywhere to match the one you are deploying.

(. /afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc461/external/apt/*/etc/profile.d/init.sh
 apt-get update
 # use 'apt-cache search cmsweb-analytics' to find the latest version
 apt-get install cms+cmsweb-analytics+2.1-comp)

# Check it works
/afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc461/cms/cmsweb-analytics/2.1-comp/bin/genstats \
   /afs/cern.ch/cms/cmsweb/analytics/geodata /afs/cern.ch/cms/cmsweb/analytics/stats \
   /afs/cern.ch/cms/cmsweb/log-archive/vocms{42,43,65,104,105,160,162}

Finally, set the acronjob:

37 */6 * * * vocms22 /afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc461/cms/cmsweb-analytics/2.1-comp/bin/genstats /afs/cern.ch/cms/cmsweb/analytics/geodata /afs/cern.ch/cms/cmsweb/analytics/stats /afs/cern.ch/cms/cmsweb/log-archive/vocms{42,43,65,104,105,160,162}

Deploying cmssls Script

(git clone git://github.com/dmwm/deployment.git cfg && cd cfg && git reset –hard HG180xy) (VER=HG180xy REPO=”-r comp=comp” A=/home/cmsweb/cfg/admin; PKGS=”cmssls”; cd /home/cmsweb/; $A/InstallDev -R comp@$VER -s image -v $VER $REPO -p “$PKGS” -S)

Deploying the UpdateApiDocs script

Drop the deployment/admin/UpdateApiDocs script into /afs/cern.ch/cms/cmsweb/update-api-docs/ and set the acronjob:

 0 7,19 * * * vocms100 /afs/cern.ch/cms/cmsweb/update-api-docs/UpdateApiDocs

This script will deploy the *-webdoc RPMs of the cmsweb services under the /afs/cern.ch/cms/cmsweb/sw/ and update the links to it in /afs/cern.ch/user/c/cmsweb/www/apidoc. Such docs are then browsable under the apidoc web area. The web area dir must be prepared with the following afs/cern.ch/user/c/cmsweb/www/apidoc/.htaccess file:

 # Disable authentication as it hangs for seconds to minutes when directory listing is enabled. See INC0841474.
 ShibRequireAll Off
 ShibRequireSession Off
 ShibExportAssertion Off
 Satisfy Any
 Allow from all
 Options +Indexes

If the acronjob would fail with error messages like the one below, you’d need to clean up the lock files that are left around due to the fact that this apt repository is sitting in AFS. This happens because the apt tool relies on some filesystem operations properties to do the DB locking that are not available in AFS.

 The cron command
      /afs/cern.ch/cms/cmsweb/update-api-docs/UpdateApiDocs
 produced the following output:
 E: Could not get lock /afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc434/var/lib/apt/lists/lock - open (11 Resource temporarily unavailable)
 E: Unable to lock the list directory

You can clean up the locks by doing:

 killall -9 apt-get
 ls -laF /afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc*/var/lib/rpm/
 fs flush /afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc*/var/lib/rpm/
 rm /afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc*/var/lib/rpm/.__afs*
 rm /afs/cern.ch/cms/cmsweb/sw/slc5_amd64_gcc*/var/lib/rpm/__db.*

Deploying the SyncLogs script

The SyncLogs script is typically deployed on a node that has reasonable I/O performance, normally a physical machine. The access to this machine should be restricted to members of the cms-comp-devs e-group, which is the case of the COMP build machines like vocms22.

Login to the machine (i.e. vocms22) as cmsweb and do the following:

 kinit
 cp /afs/cern.ch/user/c/cmsweb/private/admin/id_dsa_frontend /home/cmsweb/.ssh/id_dsa
 cp /afs/cern.ch/user/c/cmsweb/private/admin/id_dsa_frontend.pub /home/cmsweb/.ssh/id_dsa.pub
 cp /afs/cern.ch/user/c/cmsweb/private/admin/id_dsa_backend /home/cmsweb/.ssh/
 cp /afs/cern.ch/user/c/cmsweb/private/admin/id_dsa_backend.pub /home/cmsweb/.ssh/
 chmod 600 /home/cmsweb/.ssh/id_dsa*
 kdestroy; unlog

Clone the dmwm/deployment repository, copy the admin/SyncLogs script to /home/cmsweb/SyncLogs, and set the following cronjob (not acronjob).

 */15 * * * * /home/cmsweb/SyncLogs 2> /dev/null

After the script has run, check the files are correct under /build/srv-logs.

Deploying the report scripts

Copy the admin/{Log,Cmsweb,Release}Report and admin/SyncBranches scripts to /home/cmsweb/.

Clone cmsdist:

 cd /home/cmsweb
 git clone https://github.com/cms-sw/cmsdist

Then set the following cronjobs:

 0 5 * * 1 /home/cmsweb/CmswebReport prod | mail -s "CMSWEB report on $(date)" cms-service-webtools@cern.ch
 0 4 * * 1 /home/cmsweb/CmswebReport preprod | mail -s "CMSWEB report (testbed and dev) on $(date)" cms-http-group@cern.ch
 0 3 * * 1 cd /home/cmsweb/cmsdist && git fetch -q && /home/cmsweb/ReleaseReport -s comp| mail -s "HTTP Group release report on $(date)" cms-http-group@cern.ch

Deploying the Build Agents

Login as dmwmbld to the build machines of each operating system supported (currently, SL5, SL6 and OSX) and deploy the dmwmbld agent with the following instructions. On OSX the build area is normally under /build1 (not /build), so you need to adapt this path on the commands accordingly.

Check if the agent was previously installed, disable it and wait for any running builds to finish:

 crontab -l; acrontab -l
 # remove the (a)cronjobs for the machine in question
 ps axuwwwf|grep dmwmbld # wait builds to finish, in case of any

If re-deploying the same build-agent version, remove the previous release (note it does not remove the state nor the logs):

 rm -rf /build/dmwmbld/srv/HGXXYYz

Deploy the agent using the correct OS variant:

 cd /build/dmwmbld
 kinit
 # Use PKGS="dmwmbld/sl6" or PKGS="dmwmbld/osx" instead if deploying, respectively, on SLC6 or OSX
 (git clone git://github.com/dmwm/deployment.git cfg && cd cfg && git reset --hard HGXXYYz)
 (VER=HGXXYYz REPO="-r comp=comp.pre" A=/build/dmwmbld/cfg/admin;  AUTH=/afs/cern.ch/user/d/dmwmbld/private/conf; PKGS="dmwmbld/sl5"; cd /build/dmwmbld;  $A/InstallDev -R comp@$VER -s image -v $VER -a $AUTH $REPO -p "$PKGS" -S)

Please pay attention to the NOTE lines, in case of any. They remind about other (one time) actions to be taken, but should not be a problem if redeploying. Current configured web area is cms-dmwm-builds.

Preparing a new CMSWEB release

  1. Get updates for the cmsdist configuration, build RPMs and upload to private repository
    • Connect to lxvoadm and then ssh to the build machine (i.e. vocms106 for SLC5) as yourself
    • Create /build/yourself, and there clone cmsdist, pkgtools and deployment using github SSH authentication URLs
    • Work in the cmsdist updates, under /build/yourself/cmsdist/
    • Get pull requests for cmsdist done so far using the GetPulls script
    • In the end of your stacked git patch stack, create a ‘Bump HGXXYYz’ patch and update comp.spec
    • rm -rf w, in case you had previous builds
    • Trigger the build and then check if the RPMs are correct in w/RPMs/scl5_*
    • Upload of RPMs to your private repository (i.e comp.pre.yourself)
    • Check on cmsrep they are there: i.e. http://cmsrep.cern.ch/cmssw/comp.pre.yourself/
  2. Get updates for the deployment configuration. Also update the rsync-keep-releases rules
    • Still on the build machine, Work in the deployment updates under /build/yourself/deployment
    • Create a new patch to update the rsync-rules (all current and latest previous)
    • Get pull requests for deployment using the GetPulls script
  3. Deploy on a private devvm
    • Get a VM by following the setup-vm instructions
    • Login as yourself@your-devvm
    • Rsync your deployment configuration to yourself@your-devvm:/data/cfg
    • Follow the trac ticket instructions to deploy in devvm, using private REPO comp.pre.yourself
    • Check everything is deployable. Test services that can start without having the secrets/certificates set up.
    • Stop the services on you devvm once you satisfied with it
  4. Deploy om cmsweb-dev
    • Login to cmsweb@cmsweb-image
    • Move cfg to cfg2
    • rsync the deployment configuration to cmsweb@cmsweb-image:/data/cfg
    • Compare cfg2 to cfg, remove cfg2
    • Remove old releases /data/srv-dev/<something> (but not those from current HG release cycle nor the latest from previous cycle), if not yet done
    • Follow trac ticket instructions to redeploy cmsweb-dev, using the private REPO comp.pre.yourself
    • Double-check there were no errors in messages you get when services start at boot time
    • Once done, double check all services on cmsweb-dev are working fine as expected
  5. Create tags and push the changes to cmsdist and deployment
    • Commit stg patches on cmsdist, review with git log, tag HGXXYYz, then push it upstream (push the comp branch, not master)
    • Commit stg patches on deployment, review with git log, tag HGXXYYz, the push it upstream (master branch)
    • Wait for the build-agent to release the RPMs to comp.pre repository automatically. Double check the RPMs in comp.pre are ready.
  6. Compile the release change summary and put on the validation ticket, announce the tag

  7. Deploy cmsweb-testbed using the pre-realeased RPMs
    • Login to cmsweb@cmsweb-image
    • Move cfg to cfg2
    • Clone the deployment configuration to cmsweb@cmsweb-image:/data/cfg, reset it to the tag you want to deploy
    • Compare cfg2 to cfg, remove cfg2
    • Remove old releases /data/srv-dev/<something> (but not those from current HG release cycle nor the latest from previous cycle), if not yet done
    • Follow trac ticket instructions to redeploy cmsweb-testbed, using the comp.pre for the REPO
    • Double-check there were no errors in messages you get when services start at boot time
    • Once done, double check all services on cmsweb-testbed are working fine as expected
  8. Send a reminder about the deadline to include new material in the release a few days before the start of the validation phase.

  9. Release official RPMs to comp
    • Login as dmwmbld to the build machine (i.e. vocms106 for SLC5)
    • cd /build/dmwmbld/comp;, rm -rf w cmsdist, (git clone https://github.com/cms-sw/cmsdist.git && cd cmsdist && git reset --hard HGXXYYz)
    • Build it, check produced RPMs in w/RPMS/slc*/, then upload with --sync-back
  10. Redeploy cmsweb-testbed with RPMs from comp

  11. Send announcement saying the validation phase started and asking for FULL tests

  12. A few days before the production upgrade, remind people about missing validation results

  13. On production upgrade’s eve, send an annoucement, report on the CompOps meeting, and put an entry in the Scheduled interventions twiki

  14. Prepare/review the production deployment instructions

  15. Deploy cmsweb production

  16. Once finished, reply to the prior announcement saying it is done and pointing link to open a GGUS ticket in case of problems
    • Mark it full complete in the Scheduled interventions page

Routine maintenance

Running the monthly upgrade cycle

FIXME. This needs to be reviewed and completed.

  1. Once every three months or so, request for the management feedback on the proposed deployment dates.
  2. Create the monthly tickets, attach the monthly schedule, import tasks that could not be accomplished in the previous month.
  3. Prepare the initial CMSDIST tag, send the announcement about the open cycle schedule, pointing to the tickets and telling about needed changes (i.e. PKGTOOLS revision to use and CMSDIST tags).
  4. Deploy the initial tag in the cmsweb-dev and testbed as needed.
  5. Send a notice a few days before the closure of the release.
  6. When the release is closed, redeploy testbed and send the announcement asking for validation.
  7. Send a notice a few days before the validation phase finishes reminding people to post their test results.
  8. On the production upgrade’s eve, send the announcement. Also report on the ops weekly meeting and set the upgrade in the scheduled interventions twiki.

Daily log report analysis

All operators get summary of yesterday’s logs every morning. Check the logs for:

Daily cron, acron log analysis

All operators get mails if cron jobs generate any output. Any output at all is incorrect, except for CRL updates are noisy. Investigate all unexpected output.

Currently acron job output, including job failure, only goes to Lassi as the cmsweb account owner; this needs to be corrected, but it’s a messy process requiring recreation of the entire account.

Daily verify server status in lemon, sls

Verify lemon monitoring links. For each server check lemon alarm and exception links. While there, check for any unusual CPU activity patterns, any paging activity, or exceptional network bandwidth use.

Also check sls for the monitoring cases where the lemon does not cover. The information in SLS should be consistent with the lemon metrics as it is just a mapping from lemon.

Analyze the data generated from analytics

Every once in a while, preferably on the first days following a production upgrade, or upon suspicions on increased load to the cluster, check the analytics plots.

Investigate greedy users/hosts in case of any, sending them/their responsibles messages asking for explanation about the increased request access load. We have already found cases where one single user was responsible for 1/3 of the total cluster activity because he was running a very badly written set of scripts that were firing a bunch of requests at every few seconds. Buggy scripts/services are often the cause of unnecessary load, specialy on cases while they bug into an infinite loop.

Every browser window left open under the DQMGUI online is also known to produce significant amount of traffic as those pages get updated at every 30 seconds to meet some minimum real time requirements. You may therefore instruct DQMGUI users to always close the browser window while not in front of the computer or out of working hours, or alternatively tell them to click on “View details” on top right corner and set a filter to grab and show only the needed histograms.

Finally, the analytics might also show evidences of hackers attacking the cluster. Check for suspicious DNs, user browser identification strings, anormal crafted URLs, flood atempts and everything else your intuition suggests to you.

You may also follow up and find more detailed information by analyzing the data produced by analytics directly under /afs/cern.ch/cms/cmsweb/analytics/stats. That’s also needed when analytics fails to plot its data or there is suspicion it is not ploting correctly.

Verify pre-production RPM updates

The pre-production servers use pre-production RPM repository. Pay attention to CERN linux support notifications on RPM upgrades, and proactively check pre-production servers (vocms132-135.cern.ch) in case you suspect the RPMs might destabilise something; the most likely cause for problems are any perl updates as they can affect mod_perl and CA upgrades. Contact linux support immediately and suggest to delay the update if you find problems, and notify the HTTP group leader and facilities operations managers.

Verify CA upgrades

There have been a number of issues with CA RPM updates on front-ends in the past. If you see CA upgrade coming through RPM updates, immediately verify everything works correctly on the pre-production cluster. Note that you may need to run spma_wrapper.sh manually to force the upgrade to happen. If you do find there are problems, alert the HTTP group leader and facilities operations managers immediately.

OOM kill message response

FIXME

Renew proxy delegation

If you get short-term proxy expiration warning from the reweal cron job, log in as youself to lxvoadm.cern.ch and run the following.

mkdir -p /tmp/foo
cd /tmp/foo
git clone git://github.com/dmwm/deployment.git x
x/admin/ProxySeed -t prod -d cmsweb@vocmsNNN.cern.ch:/data/srv/current/auth/proxy
rm -rf /tmp/foo

Archive server logs into CASTOR

When you get acron error message about rsync job failing to write to the AFS volume, it’s likely the AFS quota is full. Log in to cmsweb-dev as cmsweb, get an AFS token and follow the instructions below in order to archive the logs to Castor and free up some AFS space. Be extremely careful during the entire process, and double check every step you take, we really do not want to lose the log files.

  1. Select hosts and months to stage

    HOSTS="vocms0126 vocms0127 vocms0134 vocms0135 vocms0760 vocms0162 vocms0164"
    MONTHS="201501 201502"
    
  2. Check which logs from /afs/cern.ch/cms/cmsweb/log-archive/$HOST were not yet copied to /castor/cern.ch/cms/archive/cmsweb/log-archive/$HOST.

    # Make sure the files under Castor, AFS and those on the machines are exactly the same
    kinit
    for h in $HOSTS; do
      echo "Host $h:"
      for m in $MONTHS; do
        ls -alF /afs/cern.ch/cms/cmsweb/log-archive/$h/{access,error}_log_$m.zip
        ssh $h ls -alF /data/srv/logs/frontend/{access,error}_log_$m.zip
        nsls -l /castor/cern.ch/cms/archive/cmsweb/log-archive/$h/{access,error}_log_$m.zip
      done
    done
    
  3. Copy previous months’ log files from AFS to /castor/cern.ch/cms/archive/cmsweb/log-archive/$HOST, do not copy logs of the ongoing month. You need to export a variable in order to use xrdcp.

    export STAGE_HOST=castorcms
    for h in $HOSTS; do
      echo "Host $h:"
      for m in $MONTHS; do
        for l in access error; do
          xrdcp -s /afs/cern.ch/cms/cmsweb/log-archive/$h/${l}_log_${m}.zip "root://${STAGE_HOST}//castor/cern.ch/cms/archive/cmsweb/log-archive/$h/${l}_log_${m}.zip?svcClass=archive" &
        done
      done
    done
    jobs # wait for background jobs to complete
    
  4. Wait for castor to migrate the files to tape. This will take ~day. Files that have been migrated show the ‘m’ flag from the nsls output:

    for h in $HOSTS; do
      for m in $MONTHS; do
        nsls -l /castor/cern.ch/cms/archive/cmsweb/log-archive/$h/{access,error}_log_$m.zip
      done
    done
    
    # i.e.
    mrwxr--r--   1 cmsweb   zh                294958060 Nov 19 12:08 access_log_201410.zip   <-- on tape
    ...
    -rwxr--r--   1 cmsweb   zh                452213456 Mar 09 10:15 access_log_201502.zip   <-- not yet on tape
    
  5. Rerun step 2 to double check the logs were copied properly.

  6. When files are finally on tape, delete the corresponding files on each machine (not in the AFS area).

    # Remove the echo below when you are really sure the files can be deleted
    for h in $HOSTS; do
      echo "Host $h:"
      for m in $MONTHS; do
        ssh $h echo rm /data/srv/logs/frontend/{access,error}_log_$m.zip
      done
    done
    
  7. On the next run of the LogArchive, it will delete the files from the AFS area too. The following will force the LogArchive to run.

    for h in $HOSTS; do
      echo "Host $h:"
      ssh $h /data/srv/current/config/admin/LogSave /data/srv /afs/cern.ch/cms/cmsweb/log-archive/$h
    done
    

Split CouchDB server logs

If you get error messages from the LogArchive cronjob telling some CouchDB log files size are too big to zip, you should run the following procedure to split them. Then later the LogArchive script will eventually archive them into a zip file and remove the files from the log directory.

It is also a common practice to run this during every production redeployment, to make sure the log file size is kept in reasonable size and avoid the errors mentioned above.

Note the procedure requires CouchDB to be restarted, so be careful when doing that to avoid interrupting anything critical that may be running. The service unavailability will probably not be noticed as it is really short, but if possible announce it in the cms-service-webtools with a few hours in advance. If you running this procedure during a scheduled upgrade, please make sure that some step(s) need to be skipped.

  1. Login as cmsweb in the CouchDB server and first rotate the couch.log file:

    cd /data/srv/logs/couchdb
    sudo -H -u _couchdb bashs -l -c '/data/srv/current/config/couchdb/manage stop "I did read documentation"'
    mv couch.log couch.old.log
    touch couch.log
    sudo chown _couchdb:_couchdb couch.log
    # ATTENTION: do not start couch as in the command below if you are running this procedure during the cluster upgrade.
    sudo -H -u _couchdb bashs -l -c '/data/srv/current/config/couchdb/manage start "I did read documentation"'
    
  2. Split the couch.old.log into 1 GB chunks:

    split -d --line-bytes=1024m /data/srv/logs/couchdb/couch.old.log /data/srv/logs/couchdb/couch.old.log. &
    # wait for it to finish, may take a few minutes
    sudo chown _couchdb:_couchdb couch.old.log.?*
    rm couch.old.log
    
  3. Also split any big .stdout files. However, do not touch the one in use by the running couch instance:

    # Set TSTAMP to the file you which to rotate. The following will grab
    # the latest, but you shall adjust it accordingly if rotating some older
    # .stdout file.
    TSTAMP=$(find /data/srv/logs/couchdb/ -name "*stdout"|awk -F"/" '{ print $NF; }' |cut -d. -f1 | sort -r|head -n1)
    
    split -d --line-bytes=1024m /data/srv/logs/couchdb/$TSTAMP.stdout /data/srv/logs/couchdb/$TSTAMP.stdout. &
    # rename them because only '.stdout' suffixed files are taken from log archive
    for f in $TSTAMP.stdout.?*; do mv $f ${f%stdout.*}${f##*.}.stdout; done
    sudo chown _couchdb:_couchdb $TSTAMP.?*.stdout
    rm $TSTAMP.stdout
    
  4. Login as cmsweb to the machine running the SyncLogs scripts (i.e. vocms22) and move the couch.log there as otherwise it won’t sync anymore:

    mv /build/srv-logs/vocmsNNN/couchdb/{couch.log,couch-$(date +%Y%m%d).log}
    rm -f /build/srv-logs/vocmsNNN/couchdb/couch.old.log # stale .old file created during the rotation
    

Rotating Couch databases (wmstats)

The main purpose of rotating a couch database is for cleanning it up and reducing its size. This is often the case for applications that make improper use of it and therefore write too many temporary documents that leave thumstones after being deleted. In other cases, due to application changes some document types may not be needed anymore and so it may be easier/better to start a database from scratch, but keeping the documents that are still needed.

The following instructions use wmstats as example. Please note that some steps are application specific like, for instance, the definition of the replication filter that tells which permanent documents to save. So don’t go blindly executing every step without understanding what they are supposed to be doing.

This procedure should be run at least 2 days before the scheduled intervention for the database rotation. This is needed to give couchdb enough time to replicate the permanent documents.

As a preliminar step, disable all the alarms for the corresponding machine.

Login as cmsweb to the corresponding couch back-end machine and do the following steps:

  1. Disable the backup and compaction (a)cronjobs way in advance (i.e. 24 hours before) and make sure you only proceed when all the such jobs are finished. The compaction is often a heavy process and we therefore may run out of resources to make any of the heavy operations that follow. So we disable it for safety. Use the couch status to see if the compactions are still running:

     $ sudo -H -u _couchdb bashs -l -c '/data/srv/current/config/couchdb/manage status "I did read documentation"'
     Apache CouchDB is running as process 216131, time to relax.
     [{"type":"Database Compaction","task":"wmstats","status":"Copied 64918001 of 75535165 changes (85%)","pid":"<0.7549.41>"}]       # < --- that's what we want to avoid
    

    The backup procedure generates high I/O activity that could eventually make couchdb disk I/O stuck. Check with ps auxwwwf that none of the backup processes are still running.

  2. Create the new database by setting to push any needed couchapps. Note this is only an example. You should actually edit the /data/srv/state/couchdb/stagingarea/<app> file and set to push all the needed couchapps to the new database being created.

     $ echo 'couchapp push /data/srv/current/apps/reqmon/data/couchapps/WMStats http://localhost:5984/newwmstats' >> /data/srv/state/couchdb/stagingarea/reqmon
     $ echo 'couchapp push /data/srv/current/apps/reqmon/data/couchapps/WMStatsErl http://localhost:5984/newwmstats' >> /data/srv/state/couchdb/stagingarea/reqmon
     $ sudo -H -u _couchdb bashs -l -c '/data/srv/current/config/couchdb/manage restart "I did read documentation"'   # will create and push newwmstats
    
  3. Prepare a replication filter that will be used to copy only the permanent docs you want to keep. Such a filter is application specific.

    Note that using a filter written in javascript for a huge couch database may take days to finish. So preferably write it natively in erlang, therefore significantly reducing the time needed to conclude the replication (from days to a few hours). For wmstats we wrote it in erlang:

    $ cat > /tmp/tmp.json << EOF
    {
    "_id" : "_design/stupiddoc",
    "language" : "erlang",
    "filters" : { "permanentdocs" : "fun({Doc}, {Req}) ->\n        DocType = couch_util:get_value(<<\"type\">>, Doc),\n        case DocType of\n                undefined -> false;\n                <<\"reqmgr_request\">> -> true;\n                _ -> false\n        end\nend." }
    }
    EOF
    

    Its corresponding javascript version would have looked like:

     # this is FYI only. Do not use it:
     '{"_id" : "_design/stupiddoc", "filters" : { "permanentdocs" : "function(doc, req) {  return (doc.type && doc.type == \"reqmgr_request\"); }" } }'
    
  4. Put the new design document holding the replication filter in the database to be rotated (not on the newly created one):

     $ curl -ks -X PUT http://localhost:5984/wmstats/_design/stupiddoc -d @/tmp/tmp.json
     {"ok":true,"id":"_design/stupiddoc","rev":"1-67e007871c8ca5a510669293e6b3624d"}
    
  5. Tell couch to replicate the old database to the new one, using the just defined replication filter

     $ curl -ks -X POST http://localhost:5984/_replicate -H 'Content-Type:application/json' -d '{"source":"wmstats","target":"newwmstats", "filter":"stupiddoc/permanentdocs", "continuous": true}'
     {"ok":true,"_local_id":"d46abf99e9593d14215b796a2c1fbd84"}
    

    Note that this is a one-time request for replication. It will stop whenever you stop couch or in case of any internal couch replication engine crashes. You must re-do this step whenever that happens.

    Watch the couch.log to make sure it is working without any problems, and check it appears in the couch status:

     $ sudo -H -u _couchdb bashs -l -c '/data/srv/current/config/couchdb/manage status "I did read documentation"'
     Apache CouchDB is running as process 1130139, time to relax.
     [{"type":"Replication","task":"4765b7: wmstats -> newwmstats","status":"W Processed source update #1245297","pid":"<0.1125.0>"}]
    
  6. Wait the replication to catch the latest update sequence. Note this can take hours/days to finish. For that, you need to compare the source update #NNN from the previous status command with the update_seq from:

     $ curl -ks -X GET http://localhost:5984/wmstats/
     {"db_name":"wmstats","doc_count":304372,"doc_del_count":1023333,"update_seq":2622622,"purge_seq":0,"compact_running":false,"disk_size":1236660327,"instance_start_time":"1370002687363539","disk_format_version":5,"committed_update_seq":2622622}
    

    So in this example you’d compare #1245297 with 2622622, and repeat this commands until both numbers are in sync.

  7. Once the replication is up-to-date you’re ready for the intervention. Send the announcement at least half a day before the scheduled downtime.

On the day of the scheduled intervention time, then run the following steps to do the rotation itself.

  1. Set accesses to the corresponding couch database to nowhere in the back-ends map file on all cmsweb frontends. For example:

     $ sudo vi /data/srv/current/config/frontend/backends-prod.txt
     ...
     ^/auth/complete/(?:couchdb/wmstats|wmstats)(?:/|$) vocms0307nowhere.cern.ch :affinity
    
  2. Check couch logs to make sure there aren’t any new incoming requests. Give it a few minutes and double check the update_seq is up to date. Sometimes it might never show up the very latest update number, however.

  3. Query some view from both old and new databases that should return the list of permanent documents. For wmstats it could be the requestHistory view. Double check the returned output of both match.

  4. Stop couch. It will stop the replication as well.

    $ sudo -H -u _couchdb bashs -l -c '/data/srv/current/config/couchdb/manage stop "I did read documentation"'
    
  5. Set couch to not create the temporary database we used for rotation anymore on the next restart:

    $ perl -p -i -e 's,^.*newwmstats$,,g'  /data/srv/state/couchdb/stagingarea/reqmon
    # Double check /data/srv/state/couchdb/stagingarea/reqmon is correct
    
  6. Do the database rotation. We do not delete the database, but move it elsewhere as a guarantee in case something goes wrong.

    $ mkdir -p /data/user/wmstatsold
    $ sudo mv /data/srv/state/couchdb/database/{wmstats.couch,.wmstats_design} /data/user/wmstatsold/
    
    $ sudo mv /data/srv/state/couchdb/database/{newwmstats.couch,wmstats.couch}
    $ sudo mv /data/srv/state/couchdb/database/{.newwmstats_design,.wmstats_design}
    
  7. Start the couchdb and watch couch.log to make sure nothing bad happened.

    $ sudo -H -u _couchdb bashs -l -c '/data/srv/current/config/couchdb/manage start "I did read documentation"'
    
  8. Make sure the views are up-to-date. Just query any of the views for every design document:

    $ curl -ks -X GET localhost:5984/wmstats/_design/WMStats/_view/requestHistory
    $ curl -ks -X GET localhost:5984/wmstats/_design/WMStatsErl/_view/requestHistory
    

    Then check with couch status command that there are no active tasks going on.

  9. Restore the mapping in the back-ends map file on all cmsweb frontends. Requests should therefore start coming in. Watch it out in the couch logs.

  10. Send the announcement telling the intervention is finished and that wmstats will show stale information while wmagents are replicating back documents. This usually take < 12 hours to complete.

  11. Try keeping an eye on couch till the replications are done and all views are up-to-date. When that’s the case, send a message to the workflow team to double check all replications from agents are fine and the data in wmstats is correct.

  12. Move the old database file to some other machine with lots of free space. Keep it in quarantine there and not in the production couch back-end. It is particularly important you don’t forget this backup file(s) on the SSD machines as we may easily run out of space there because of them.

  13. Re-enable the couch (a)cronjobs and reactivate alarms. The intervention is considered done.

Rotating Couch databases (asynctransfer)

This intervention would normally require two working days. Make sure you won’t overlap with the weekend. Some statistics from the rotation on 2015-08-17:

Statistics from the rotation on 2015-12-01:

IMPORTANT: before proceeding, keep in mind the following factors do have a very significant impact on the time needed for the intervention, to the point of failing completely if not given attention:

This is the procedure to follow:

Actually, you should filter out design docs too, so the right function would be:

  fun({Doc}, {Req}) ->
    DocId = couch_util:get_value(<<"_id">>, Doc),
    case DocId of
      undefined -> true;
      <<"_design",Rest/binary>> -> false;
      _ ->
        DocType = couch_util:get_value(<<"_deleted">>, Doc),
        case DocType of
          undefined -> true;
          true -> false;
          _ -> true
        end
    end
  end.

At this point things are ready for the rotation. Send an announcement for an up to 30min downtime and during that scheduled period do this:

Exchanging service secrets

Use one of the procedures below to exchange the secrets between the service responsible(s) and the HTTP group member(s) that will put them under the secret area.

If you need to exchange it with a group of people:

  1. Login to a machine where only you have access and that have access to AFS. Or use lxvoadm if you don’t have such machine at hands. Do not use lxplus and other shared machines.

    mkdir ~/public/forSomeoneOrGroup
    fs setacl -dir ~/public/forSomeoneOrGroup -acl system:anyuser "" username1 "rl" username2 "rl" # add more people here with if needed
    read -s newsecret
    (type the new secret or paste the connect string, then hit enter when done) 
    echo $newsecret | openssl enc -des3 -e -a -out ~/public/forSomeoneOrGroup/newsecret.txt
    (type a encryption password twice as asked. I.e. use the previous secret as a password here)
    unset newsecret
    
  2. Inform the people involved that you made the file available in AFS, only readable by them. Note the file is password protected to avoid any exposure in case the file leaks from the AFS protected area like when the file reside on a disk cache. You can use any password known by you and the people involved here. Most common is to use the previous password, otherwise set a new one and tell them by phone or chat. The privacy of the encryption password here is not really important as it is just to avoid the new secret to temporarily live on a disk block unencrypted.

  3. To read the content of the file, the informed people should use the following command. Note the use of less is obligatory to avoid it being saved on Terminal history caches:

    openssl enc -des3 -d -a -in ~/public/forSomeoneOrGroup/newsecret.txt < /dev/null | less
    
  4. Once everybody has taken note of it in some other safe area (i.e. Apple Keychain Access), delete the newsecret.txt file from AFS.

Note it is important to exchange the file through a protected shared area even though the file is password protected. This is needed because the method used to communicate the encryption password is weak on purpose so that we can make its communication easy.

If you only need to send the password to one person, you can alternatively use the person’s public key with simpler instructions:

  1. On any safe machine (i.e. your laptop), do the following. You don’t need to have access to AFS if the person has sent you his certificate by e-email.

    read -s newsecret
    echo $newsecret | openssl smime -encrypt -aes128 /afs/cern.ch/user/s/someone/public/usercert.pem > newsecret.txt
    unset newsecret
    
  2. Send newsecret.txt to the person by e-mail or paste its content into the chat window.

  3. The person saves the newsecret.txt file locally and then reads its content with the following command. Note the use of less is obligatory to avoid it being saved on Terminal history caches:

    openssl smime -decrypt -inkey ~/private/userkey.pem < newsecret.txt | less
    

The asymetric encryption method is stronger as the keys are longer in size and the private key is kept on a safe place. Like that we can lax on leaving the encrypted file on disk or exchanging it through unsafe communication channels like e-mails or instant messangers.

Synchronize the new secrets of a service across the cluster

  1. Login as cmsweb to the cmsweb-image and set the service secrets on /afs/cern.ch/user/c/cmsweb/private/conf.

  2. Synchronize it to all back-end machines, including dev, preprod and prod.

    # Check first
    for h in vocms0{126,127,131,132,136,138,139,140,141,161,163,165,306,307,318}; do
       ssh $h ls -alF /data/srv/current/auth/svcname/;
    done
    
    # Sync it (note that the src and dst secret filenames may be different)
    for h in vocms0{126,127,131,132,136,138,139,140,141,161,163,165,306,307,318}; do
       cat /afs/cern.ch/user/c/cmsweb/private/conf/svcname/secret-file-name-src | ssh $h sudo sh -c \"'cat - > /data/srv/current/auth/svcname/secret-file-name-dst'\";
    done
    
  3. Change the password the secret is all about, if not yet done. I.e. change the service DB account secret on account.cern.ch. You need to coordinate this change with the service responsible that owns the service secret. Note that once it has been changed, you must proceed to run the following step asap. If delays are envolved, you may want to put machines in maintenance mode and announce a short downtime before asking for this action.

  4. Restart the service only where the service is supposed to be running.

    for h in vocms0{126,127,131,132,136,161,163,165,138,139,140,141,306,307,318}; do
       ssh $h sudo -H -u _svcname bashs -l -c \""hostname -f; /data/srv/current/config/svcname/manage start 'I did read documentation'"\";
    done
    
  5. Double check everything went fine. If possible exercise some service URLs to make sure the new secrets are working.

    for h in vocms0{126,127,132,136,161,163,165138,139,140,141,306,307,318}; do
       ssh $h sudo -H -u _svcnane bashs -l -c \""hostname -f; /data/srv/current/config/svcname/manage status 'I did read documentation'"\";
    done
    
  6. In case you had announced a downtime for the password change, reply telling the intervention is over and put machines back the machines to production mode.

Synchronize SSH keys across the cluster

Login as cmsweb to cmsweb-image and run the following to sync appropriate keys from afs to $HOME/.ssh of back-end/front-end machines (remove the echo when you are sure about the commands). It is not needed to copy them to the cmsweb-image machine itself as only password authentication is accepted there.

for h in vocms0{161,163,165,126,127,136,131,132,138,139,140,141,306,307,318}; do (set -x; echo scp -p /afs/cern.ch/user/c/cmsweb/private/admin/ssh-keys-backend $h:.ssh/authorized_keys); done
for h in vocms0{161,163,165,126,127,136,131,132,138,139,140,141,306,307,318}; do (set -x; ssh $h md5sum .ssh/authorized_keys); done
for h in vocms0{160,162,164,134,135}; do (set -x; echo scp -p /afs/cern.ch/user/c/cmsweb/private/admin/ssh-keys-frontend $h:.ssh/authorized_keys); done
for h in vocms0{160,162,164,134,135}; do (set -x; ssh $h md5sum .ssh/authorized_keys); done

Synchronize oracle configuration files across the cluster

Sometimes CERN/IT may make changes on Oracle instances, either devdb11 or cmsr or any other, that require corresponding changes to the tnsnames.ora and sqlnet.ora files we use on the client side. Since we ship a static copy of those files within the distributed RPMs, you need to run the following special procedure if the changes must be flushed straightaway and therefore cannot wait until the next release.

Login as cmsweb to cmsweb-image and run:

 kinit
 # Set the path below accordingly to the RPM version in use.
 ORACLE_ENV_PATH="/data/srv/current/sw.pre/slc6_amd64_gcc481/cms/oracle-env/30/etc/"

 # Flush to cmsweb-dev and testbed machines first
 for h in vocms0{126,127,131,132,134,135}; do ssh cmsweb@$h sudo cp /afs/cern.ch/project/oracle/admin/{tnsnames,sqlnet}.ora $ORACLE_ENV_PATH/; done

 # Check it worked for all services on testbed, then do the same for production
 for h in vocms0{136,161,163,165,138,139,140,141,306,307,318,160,162,164}; do ssh cmsweb@$h sudo cp /afs/cern.ch/project/oracle/admin/{tnsnames,sqlnet}.ora $ORACLE_ENV_PATH/; done

It is not needed to restart any service as they read that file on the fly, but if you suspect something is not ok, you might try restarting services before declaring they don’t work with the new configuration.

Finally, you shall proceed to update it on cmsdist if not done yet, otherwise newer releases would not contain the changes.

Bootstraping a new architecture on the RPM repository (from CMSSW)

** WARNING ** : the procedure here is for your information only and is not workable for COMP. You can use it as a basis on finding the right instructions in case of major changes by the SDT team to the CMSSW build infrastructure, however. For officialy bootstraping a new architecture for COMP, use the COMP bootstrap instructions instead.

The procedure below is based on the official instructions from SDTHowToSetupANewArchitecture and the feedback from the SDT team (Shazhad & cia). Please have a look there before running anything from here. Also double-check with the SDT team that these instructions are still ok, and ask for the appropriate cmsdist and pkgtools tags to use.

Note that even sticking on specific tags, this procedure is often unreproductible due to the automatic system updates (i.e. new glibc version after updating from SLC 5.8 to 5.9), and to the hardcoded (but dynamic) configuration on various places (i.e. in the build_rpm.sh script).

Login as dmwmbld to the target build machine (i.e. vocms022 for a SLC6 architecture) and use the following recipe:

ROOT=/build/dmwmbld/bootstrap_gcc481; ARCH=slc6_amd64_gcc481; REPO=comp.pre
mkdir -p $ROOT/rpm{build,inst}; cd $ROOT
git clone -b V00-22-XX https://github.com/cms-sw/pkgtools
(cd $ROOT/rpmbuild; sh -e $ROOT/pkgtools/build_rpm.sh -j 8 --arch $ARCH --prefix $ROOT/rpminst > build_rpm.log 2>&1)
git clone -b IB/CMSSW_7_0_X/stable https://github.com/cms-sw/cmsdist
(export PATH=$ROOT/rpminst/bin:$PATH; \
 export LD_LIBRARY_PATH=$ROOT/rpminst/lib:$LD_LIBRARY_PATH; \
 pkgtools/cmsBuild -c cmsdist --no-bootstrap --repository $REPO --arch $ARCH --work-dir w --builders 8 -j 5 upload bootstrap-driver cms-common lcg-dummy)
# On OSX, use DYLD_FALLBACK_LIBRARY_PATH instead of LD_LIBRARY_PATH.

Then check the produced RPMs under w/RPMS/$ARCH and the distribution snapshot under w/$ARCH. Make any needed adjustements/cleanups/updates to cmsdist (i.e. update openssl bundle patch) and repeat the process until you are happy enough with it.

The above procedure uploads the new architecture minimum RPMs to the private comp.pre.dmwmbld repository. On a separate shell/machine/user, try to do bootstrap/install the RPMs from there as usual for the purpose of testing.

It is recommended to also try to build all the comp software stack using the new architecture from comp.pre.dmwmbld. If during the build of the comp package, which is a meta package depending on everything else, you notice some bootstrap related problem, then go back, fix it on the bootstrap and repeat the process above.

Once the testing results are satisfiable, then repeat the upload command above, but add the --sync-back option to it. The new architecture RPMs will then appear on the comp.pre. Browse the files under cmsrep and double-check everything is fine.

You may want to contribute back to the SDT team the fixes you did to achieve the new architecture so that they can include it so that later you don’t need to port them to any other new architecture.

Finally, switch to the cmsdist comp_bootstrap branch, port your bootstrap changes there, push and tag it. Like that we can later re-bootstrap from the configuration you used to create the architecture. Note the comp_bootstrap branch only contains the minimum set of files to build the bootstrap.

The next step is to try to build all the comp projects with it, fixing any problems that appear, re-bootstrapping if needed to include any missing things. Note you must update key dependencies like gcc and openssl in order to match what was bootstraped.

Once you managed to build all the comp software stack, commit your changes to the comp cmsdist branch or create a new comp_gccXYZ branch to avoid messing up with the configuration used in the production architecture. Then release the all the comp RPMs to comp.pre as usual. Once the new software stack has been validated, you must release the new architecture to the main comp RPM repository as well.

Bootstraping a new architecture on the RPM repository (from COMP)

** WARNING: ** please read, and only read, the CMSSW bootstrap to make sure you understand how things are supposed to be. Then proceed with the procedure below as this is what has been used for bootstraping new architectures for COMP.

Login as dmwmbld to the target build machine (i.e. vocms022 for a SLC6 architecture) and use the following recipe:

kinit
ROOT=/build/dmwmbld/bootstrap_gcc493; ARCH=slc6_amd64_gcc493; REPO=comp
mkdir -p $ROOT/rpm{build,inst}; cd $ROOT

# P.S. geneguvo's clone contains changes not yet merged to main pkgtools
git clone -b V00-22-XX https://github.com/geneguvo/pkgtools
# Do any updates to pkgtools/build_rpm.sh if possible or needed. Then
# proceed to build the rpm tool.
# Since you need a compiler to build it, use the gcc from the
# current production architecture.
(cd $ROOT/rpmbuild; source /build/dmwmbld/srv/state/dmwmbld/builds/comp_gcc481/w/slc6_amd64_gcc481/external/gcc/4.8.1/etc/profile.d/init.sh; sh -e $ROOT/pkgtools/build_rpm.sh -j 8 --arch $ARCH --prefix $ROOT/rpminst > build_rpm.log 2>&1)

# Build the bootstrap kit using the just built RPM tool
git clone -b comp_bootstrap https://github.com/cms-sw/cmsdist
# Update gcc version on cmsdist (take files from main CMSSW builds),
# update any other bootstrap libraries whenever possible (i.e. bootstrap-openssl)
(cd $ROOT; \
 source /build/dmwmbld/srv/state/dmwmbld/builds/comp_gcc481/w/slc6_amd64_gcc481/external/gcc/4.8.1/etc/profile.d/init.sh; \
 export PATH=$ROOT/rpminst/bin:$PATH; \
 export LD_LIBRARY_PATH=$ROOT/rpminst/lib:$LD_LIBRARY_PATH; \
 pkgtools/cmsBuild -c cmsdist --no-bootstrap --repository $REPO --arch $ARCH --work-dir w --builders 8 -j 5 build comp-bootstrap)
# On OSX, use DYLD_FALLBACK_LIBRARY_PATH instead of LD_LIBRARY_PATH above.

Check things are correct in w/, then redo the last command but with upload comp-bootstrap instead of build comp-bootstrap. At this point the new architecture would be available in the comp.dmwmbld repository.

On a separate shell, prepare a new comp_gccXYZ cmsdist branch with the same gcc and base dependencies versions. Then try rebuilding the whole comp software stack with it using the comp.dmwmbld repo.

Go back and fix things if needed. Otherwise, redo the last upload command with the --sync-back option. Commit your changes to comp_bootstrap and tag it comp_bootstrap_gccXYZa.

On the other shell, push the new comp_gccXYZ branch with all the cmsdist changes you needed to build the full comp software stack. Then set the build-agent to track this new branch. On next build-agent run, it should relese the comp RPMs based on the new architecture. You are ready to announce it.

Finally, commit your changes to pkgtools/build_rpm.sh, push them upstream and possibly contribute them back to the SDT.

Re-bootstraping an existent architecture

To re-bootstrap, get the cmsdist configuration used to bootstrap it (i.e. the comp_bootstrap branch) and the corresponding pkgtools branch (i.e. V00-22-XX for gcc481, V00-21-XX for gcc461). Then do the needed changes to cmsdist (i.e. update openssl to newest version).

Login as dmwmbld to the target build machine (i.e. vocms22 for SLC6 architecture) and use the following recipe:

mkdir /build/dmwmbld/bstrap$RANDOM
cd /build/dmwmbld/bstrap*
git clone -b V00-22-XX https://github.com/geneguvo/pkgtools
git clone -b comp_bootstrap https://github.com/cms-sw/cmsdist
# optionally reset cmsdist to one specific cmsdist tag
pkgtools/cmsBuild -c cmsdist --repository comp.pre \
--arch slc6_amd64_gcc481 --builders 8 -j 4 --work-dir w \
build comp-bootstrap
pkgtools/cmsBuild -c cmsdist --repository comp.pre \
--arch slc6_amd64_gcc481 --builders 8 -j 4 --work-dir w \
upload comp-bootstrap

Note that you run it on a randomly chosen directory name, which is different than the one used for the official builds. This is needed because due to bugs and improper relocations it may happen that further builds on that very same directory will fail.

Note also that a meta-package (comp-bootstrap) was created in order to avoid hitting a bug in cmsBuild, which happens when you try to upload RPMs that were not actually built (e.g. lcg-dummy). This meta-package requires eveything that is needed for re-bootstraping (i.e. “bootstrap-driver”, “cms-common” and lcg-dummy).

The commands above would upload the new bootstrap RPMs to comp.pre.dmwmbld. Before proceeding, you must move the new driver file accordingly. Use the following command:

scp w/slc6_amd64_gcc481/external/bootstrap-driver/`ls -rt w/slc6_amd64_gcc481/external/bootstrap-driver/ | tail -1`/slc6_amd64_gcc481-driver.txt cmsbuild@cmsrep.cern.ch:/data/cmssw/comp.pre.dmwmbld/

Note the driver file name suffix is -driver.txt, not -driver-comp.txt. The later is not in use anymore and therefore can be ignored.

Double-check in the cmsrep that the repository structure is correct, and that the right driver file was copied.

To test it, move /build/dmwmbld/bstrap* to some other name to make sure any hardcoded paths to it will fail. Then try to install/build RPMs using comp.pre.dmwmbld. If all went fine, move the directory name back to the original, and re-run the upload command for the new bootstrap RPMs with the --sync-back option. If you have preserved the build area correctly, you don’t need to copy the driver file again. Nevertheless, check the driver file in comp.pre is the correct one and copy it otherwise.

Later, after the RPMs in comp.pre have been validated, repeat all the process to the main comp repository.

Finally, commit your changes to the corresponding cmsdist branch (comp_bootstrap in this case) and tag it. Whenever applicable, contribute back your fixes to the SDT team.

Add a new web tool developer

FIXME: this should become a DMWM tutorial and put together with other tutorials:

Remove a web tool developer

FIXME

Add a new HTTP Group technical member

Remove a HTTP Group technical member

Follow the Remove a web tool developer instructions.

Then at least the following actions are known to be needed:

Ban a user or host

To ban a user or host IP from authenticating against CMSWEB protected services, it is enough to add the DN or the host IP into revoked-users.txt on the cmsweb front-ends:

You can confirm later that the DN or host got banned by checking the /data/srv/logs/frontend/error_log_*.txt logs. Once the user or host access it, you’ll see log entries like:

[Tue Jan 29 11:52:17 2013] [warn] [client 137.138.52.75] [cookie-auth/2901437] authorisation revoked for dn /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=diego/CN=695658/CN=Diego Da Silva Gomes; forbidding /dqm/relval
[Tue Jan 29 11:56:35 2013] [warn] [client 137.138.52.75] [cookie-auth/2902661] authorisation revoked for host 137.138.52.75; forbidding /sitedb/, referer: https://cmsweb-dev.cern.ch/