DRAFT

ATLAS Rucio authentication service incident, Aug 20, 2021

ATLAS ADC Report

Timeline (CEST)

  • Friday 22:15 multiple Rucio nodes were taken out of the LANDB set ATLAS RUCIO blocking all access to Rucio AUTH service
  • 23:00 noticed HammerCloud mass auto-exclusion
  • DDM ops had no permission to add these back to the LANDB set, except via puppet runs (not clear at the time)
  • 23:30 Snow ticket INC2888026 and GGUS ALARM ticket opened GGUS:153548
  • DC operator tried to contact DB experts by phone, no success (and inappropriate)
  • 00:46 Configuration Mgmt issue confirmed, suggest contact these people some hours later clear wouldn't be solved, no contact with Configuration Mgmt experts (only DB for physics has piquet service)
  • discussion of options via Mattermost (Petr, Martin)
  • 03:55 workaround implemented by moving auth services to different accessible node
  • 10:00 Configuration Mgmt expert says root cause was disabled puppet. If puppet not run in 30 days then node dropped from LANDB set; puppet was disabled for unrelated issue with manifests on a small number of machines (including authentication machines)

Impact

  • ATLAS distributed computing offline for 6hrs (monit link)

Lessons

  • LANDB sets were previously managed manually; Not clear that active puppet run is required to keep node in LANDB

Suggestions

  • warning to be given prior to automated LANDB set changes
  • automated LANDB set changes only during working hours

Actions

  • Clarify with CERN-IT expectations surrounding out-of-hours support from experts
  • Discuss with CERN-IT options to triage an ALARM ticket during vs. out of hours
  • ADC CSOPS to review Atlas nodes with disabled puppet - DONE
  • Check if lack of puppet run has other side effects
  • Thanks to Martin, Petr, and IT-CM colleagues for exceptional efforts throughout the night.

CERN IT-CM additions

  • GGUS alarm procedure is https://cern.service-now.com/service-portal?id=kb_article&n=KB0002299
    • Operator called (tbc) IT-DB team, possibly pattern match on "LAN[DB]"?
    • No SMS was sent to config team despite being in procedure, but the need to call them was identified only after 2 hour of debugging.
    • Procedure should be improved to ask operator to circle back on procedure later on, in case the involvement of another team (Configuration Management in this case) is needed after initial ticket treatment. This would have sent an early SMS to IT-CM-LCS (who run Configuration Management).
  • Two tickets muddled it slightly (original Network ops + GGUS alarm) - IB from config team looking at 08.30 debugged and tried to update the ticket owned by netops (but couldn't), so at 09.30 when I looked, I didn't know he had.
  • The behaviour of the Puppet/LANDB set updater, though documented, is somewhat non-obvious. A warning email or such on imminent drop might be helpful, and possibly in-working-hours only updates (though this causes other issues with out-of-hours interventions).
  • Dropping out of PuppetDB can cause other issues too (DNS LB drop, for example). Generally, disseminating wider the message that leaving Puppet running is usually better than disabling (even if it's in error state).

Actions

  • Review alarm operator procedure to ensure operator circles back on procedure after new information is available
  • Disseminate wider the behaviour of PuppetDB, and that leaving Puppet disabled is rarely a good idea
  • (as above) Review ATLAS suggestions (LANDB set automated pre-warning and changes only in working hours)
  • (as above) Clarify expectations on support level together with ATLAS and others
  • Even nosier alarms than already exist about puppet not running - https://its.cern.ch/jira/browse/AI-6143 currently contains a blockage to make it impossible.

Useful commands

  • Nodes at risk of removal from PuppetDB, e.g currently 10 hosts are close to removal.

ai-pdb raw /pdb/query/v4/nodes --query '["<=","facts_timestamp","2021-07-29T08:30:22.298Z"]' | jq  -r  .[].certname
kc-lb-qa-9f3c2911d7.cern.ch
kc-lb-qa-9a967b16f2.cern.ch
p06036580r67456.cern.ch
....
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2021-09-08 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback