Troubleshooting on CMSWEB cluster

Reporting an incident

Note that lemon alarms exist for all CMSWEB services and would be raised in common events like a service crash or unresponsiveness. CERN/CC Operators and Sysadmins react and try recovering procedures in 5 to 30 min of the occurrence of the event, even after non-office working hours. See separate document for the lemon alarm procedures. If there are any ongoing alarms, the CMSWEB availability shall tell accordingly.

Before reporting an incident, please also check first the CMSWEB interventions page.

Otherwise, follow this procedure to report an incident with a service:

Make sure a problem has really occurred. Repeat the action twice to verify the failure is repeatable.
Report the problem by opening a GGUS ticket and assigning it to the “CMS Web Tools” CMS Support Unit. Provide it with as much detailed information as you can.
Incidents will be handled in a few hours during working hours, and on the best-effort during non-office hours.
If you need urgency, after opening the ticket you can contact the Computing Run Coordinator (CRC) on duty. On some cases the CRC might phone call the needed experts (not necessarily from the CMS HTTP Group) if he judges necessary.

Would you need any names for CMSWEB urgent issues, they are:

Diego da Silva Gomes: the CMS HTTP Group leader
Bruno Moreira Coimbra: the CMSWEB operator
Alan Malta Rodrigues: the previous CMSWEB operator

Phone numbers, aim, etc of each contact can be found in the CERN phonebook and in SiteDB.

Order of response

The order in which people are normally expected to respond to CMSWEB incidents:

CERN/CC Operators and CERN/CC Sysadmins (if this is an lemon alarm)
CMS HTTP Group (cms-http-group e-group members)
Service contacts. All service responsibles are in the cms-service-webtools e-group.
CRCs (cms-crc-on-duty e-group members)
VOC (VOBox-VOC-CMS e-group members)

Recovering a service

IMPORTANT

Anyone in the the cms-http-group, VOBox-VOC-CMS and cms-crc-on-duty e-groups and also the IT 24/7 sysadmins/operators can login to CMSWEB machines and look around.
If judged necessary, they can try the recovery instructions from the alarming procedures documentation. Note, however, that in the occurrence of any alarms the CERN/CC Operators would already try them as a 1st line support.
Only people members from the cms-http-group have been trained for and are therefore allowed to take actions other than those described in the alarming procedures page.
If you belive you could fix the problem by doing something documented elsewhere or as a conclusion from your investigation, stop immediately and don’t do anything.
If there is reasonable evidence of a security incident, such as an account or server process compromise, stop immediately and touch nothing, do not even login to the systems, do not shut servers down, do not stop, restart or modify applications. If you have a shell window open, stop using the shell but leave the session open if you can. Report the incident immediately to the HTTP group, the CERN/IT security and CMS security boards. Use a very clear subject line when you report the incident; don’t scare-monger, but don’t belittle either.

That said, if someone has reported or you find a problem with a service you operate or are responsible for, please follow the troubleshooting and diagnosis steps. It is important you keep updating the GGUS ticket so others know you are active, and which actions you have taken.

Troubleshooting and diagnosis

Calm down, make sure you are woken up.
Communicate. File a GGUS ticket and assigning it to the “CMS Web Tools” CMS Support Unit if it has not been done yet. Add comment to the ticket that are you are working on the issue.
Make a plan. Update ticket as you make progress; this is important so others know you are active.
If you suspect security compromise, involve CERN security team and the CMS security board. This is unlikely, but could happen.
Collect facts, paste some or all into the GGUS ticket. Attach very large output as files, small output can be included directly.
For servers which run but do not respond, and pystack from fact collection didn’t tell you why:
- Log on to the same back-end host and try curl -4vL http://localhost:PORT/URL to see if the server responds locally.
- If you can, log on to the appropriate front-end and try the curl from there, to see if basic network connectivity works. The backend firewalls restrict connections to servers only to front-end and localhost, so failure to access the server from elsewhere at CERN or outside is not an error.
- Try from any system at CERN curl -4kvL http://cmsweb.cern.ch/service, for dqm add --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem so it has your certificate.
- In all of above attempts check the backend application logs for any messages, check logs /data/srv/logs/frontend/*.txt (access_log_yyyymmdd and error_log_yyyymmdd) for interesting messages: certificate rejected, proxy access failed, …
- If all of those work, there’s a problem in client’s connectivity to cmsweb and it’s highly unlikely it’s your problem; there has been situations where cmsweb was at fault, but debugging requires rather a lot of skill so probably best to leave to others.
- If some of those don’t work, then you’ve identified the link/pair that is not operational, and focus on try to get that solved.
- If server uses database, and every attempt to access server just hangs, it may have a problem connecting to the database. Try connecting to the database using parameters from /data/srv/current/auth and manually running sqlplus user@tns-name yourself; do not do anything, just try to connect, then exit; also look for files like sqlnet.log for oracle errors.
If the server has died, i.e. simply is not running, and you don’t find why it was killed, there doesn’t seem to be attack activity, disks aren’t full, etc., try to consult with the developer, preferably via IM chat if you can. If you don’t get any answer, or it’s middle of the night or some other similar situation, just restart the server with sudo -H -u _foo bashs -lc "/data/srv/current/config/foo/manage restart 'I did read documentation'". Similarly, if the server is stuck and you cannot get developers to investigate. However do not continually restart a server, something else must be wrong then. More than 3 restarts a day per service is “continually.”
If logs filled the disk, try to find where and why, and get rid of the excess size files. Try not to destroy the logs since they might have something useful in them, so try “bzip2” compressing them somewhere else if you can. If it’s just simple blind repetition, and you can’t save the log without deleting something, you can delete some in-between repeated log entries, but leave leading first and trailing last part of the log to assist later analysis.

In case of real emergency, such as security breach or complete cluster break-down, call the HTTP group leader (Diego) or L2 managers. Note that although real emergencies can and do happen, most people vastly overestimate the significance of events. Do not confuse a few hours of inconvenience with emergency! At the very minimum the call must be preceded by genuine major failure event, significant pressure to act, a GGUS ticket copied to cms-service-webtools list several hours earlier, filled with complete technical details, and not having other follow-up on that ticket.

Collecting facts

Use wassh -h 'vocms[132,133]' "command" (but get to know the updated machine list from here) using commands below to automate data gathering across systems.

Check if there are lemon alarms with sudo /usr/sbin/lemon-host-check -sh (yes you can sudo for this).
Check your mail box for an oom-kill or other robot message.
Check your mail box for message from cron that might be related.
Check monitoring dashboard for any high load history, number of parallel requests, etc.
Run uptime to see if the machine has rebooted recently.
Run last and w to see who has logged in recently and from where.
Run top to see what processes are running, cpu/memory use, cpu hog / busy loop, …
Run ps xwuf to see which processes are running, grep -i sitedb to find what you look for.
Run netstat -talnp to find listening sockets, grep for LISTEN, ESTABLISHED, etc.
Run /usr/sbin/lsof -P -i :8300 to find process listening on port 8300.
Run /usr/sbin/lsof -p 12345 to see what pid 12345 is doing, lots of useful info here: cwd = current working directory, file descriptors 0r = stdin, 1w, 2w = stdout/stderr (= log file!), you’ll see also if the process has tons of open file descriptors, is leaking them, or has hanging connections to database, etc.
Run pystack 12345 to get a stack trace of every thread in a process:
- Look for things that sound like waiting on a lock / mutex / semaphore.
- Look for things that look like database connections (“OCI” = Oracle).
- Look for things that look like hanging sockets (send, recv, …).
- Record python parts of the stack trace with pystack 12345 | egrep '^[T=]'.
Run strace -p 12345 to see what system calls a process is calling.
Run df -h to see if disks have filled up.
Run du --max 2 -cm /data to see file system usage in megabytes under /data.
Run free to see if the system has run out of real memory; it’s normal to see all memory “used”, as long as “cached” is also high; having lots of “used” under “Swap” is bad news.
Run vmstat -nSK 5 to see system activity - page in/out, disk i/o, etc.
Run dstat 5 to see the same information + network usage more prettily.
Check server log files, look for any obvious recent errors; use lsof as shown above to figure out where the logs are if you don’t know.