Troubleshooting on CMSWEB cluster

Reporting an incident

Note that lemon alarms exist for all CMSWEB services and would be raised in common events like a service crash or unresponsiveness. CERN/CC Operators and Sysadmins react and try recovering procedures in 5 to 30 min of the occurrence of the event, even after non-office working hours. See separate document for the lemon alarm procedures. If there are any ongoing alarms, the CMSWEB availability shall tell accordingly.

Before reporting an incident, please also check first the CMSWEB interventions page.

Otherwise, follow this procedure to report an incident with a service:

  1. Make sure a problem has really occurred. Repeat the action twice to verify the failure is repeatable.
  2. Report the problem by opening a GGUS ticket and assigning it to the “CMS Web Tools” CMS Support Unit. Provide it with as much detailed information as you can.
  3. Incidents will be handled in a few hours during working hours, and on the best-effort during non-office hours.
  4. If you need urgency, after opening the ticket you can contact the Computing Run Coordinator (CRC) on duty. On some cases the CRC might phone call the needed experts (not necessarily from the CMS HTTP Group) if he judges necessary.

Would you need any names for CMSWEB urgent issues, they are:

Phone numbers, aim, etc of each contact can be found in the CERN phonebook and in SiteDB.

Order of response

The order in which people are normally expected to respond to CMSWEB incidents:

  1. CERN/CC Operators and CERN/CC Sysadmins (if this is an lemon alarm)
  2. CMS HTTP Group (cms-http-group e-group members)
  3. Service contacts. All service responsibles are in the cms-service-webtools e-group.
  4. CRCs (cms-crc-on-duty e-group members)
  5. VOC (VOBox-VOC-CMS e-group members)

Recovering a service

IMPORTANT

That said, if someone has reported or you find a problem with a service you operate or are responsible for, please follow the troubleshooting and diagnosis steps. It is important you keep updating the GGUS ticket so others know you are active, and which actions you have taken.

Troubleshooting and diagnosis

  1. Calm down, make sure you are woken up.

  2. Communicate. File a GGUS ticket and assigning it to the “CMS Web Tools” CMS Support Unit if it has not been done yet. Add comment to the ticket that are you are working on the issue.

  3. Make a plan. Update ticket as you make progress; this is important so others know you are active.

  4. If you suspect security compromise, involve CERN security team and the CMS security board. This is unlikely, but could happen.

  5. Collect facts, paste some or all into the GGUS ticket. Attach very large output as files, small output can be included directly.

  6. For servers which run but do not respond, and pystack from fact collection didn’t tell you why:

    • Log on to the same back-end host and try curl -4vL http://localhost:PORT/URL to see if the server responds locally.

    • If you can, log on to the appropriate front-end and try the curl from there, to see if basic network connectivity works. The backend firewalls restrict connections to servers only to front-end and localhost, so failure to access the server from elsewhere at CERN or outside is not an error.

    • Try from any system at CERN curl -4kvL http://cmsweb.cern.ch/service, for dqm add --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem so it has your certificate.

    • In all of above attempts check the backend application logs for any messages, check logs /data/srv/logs/frontend/*.txt (access_log_yyyymmdd and error_log_yyyymmdd) for interesting messages: certificate rejected, proxy access failed, …

    • If all of those work, there’s a problem in client’s connectivity to cmsweb and it’s highly unlikely it’s your problem; there has been situations where cmsweb was at fault, but debugging requires rather a lot of skill so probably best to leave to others.

    • If some of those don’t work, then you’ve identified the link/pair that is not operational, and focus on try to get that solved.

    • If server uses database, and every attempt to access server just hangs, it may have a problem connecting to the database. Try connecting to the database using parameters from /data/srv/current/auth and manually running sqlplus user@tns-name yourself; do not do anything, just try to connect, then exit; also look for files like sqlnet.log for oracle errors.

  7. If the server has died, i.e. simply is not running, and you don’t find why it was killed, there doesn’t seem to be attack activity, disks aren’t full, etc., try to consult with the developer, preferably via IM chat if you can. If you don’t get any answer, or it’s middle of the night or some other similar situation, just restart the server with sudo -H -u _foo bashs -lc "/data/srv/current/config/foo/manage restart 'I did read documentation'". Similarly, if the server is stuck and you cannot get developers to investigate. However do not continually restart a server, something else must be wrong then. More than 3 restarts a day per service is “continually.”

  8. If logs filled the disk, try to find where and why, and get rid of the excess size files. Try not to destroy the logs since they might have something useful in them, so try “bzip2” compressing them somewhere else if you can. If it’s just simple blind repetition, and you can’t save the log without deleting something, you can delete some in-between repeated log entries, but leave leading first and trailing last part of the log to assist later analysis.

In case of real emergency, such as security breach or complete cluster break-down, call the HTTP group leader (Diego) or L2 managers. Note that although real emergencies can and do happen, most people vastly overestimate the significance of events. Do not confuse a few hours of inconvenience with emergency! At the very minimum the call must be preceded by genuine major failure event, significant pressure to act, a GGUS ticket copied to cms-service-webtools list several hours earlier, filled with complete technical details, and not having other follow-up on that ticket.

Collecting facts

Use wassh -h 'vocms[132,133]' "command" (but get to know the updated machine list from here) using commands below to automate data gathering across systems.

See also How to Report Bugs Effectively and How To Ask Questions The Smart Way.