Week of 120109

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, Alessandro, Cedric, Ian, Kate, Eddy, MariaD, MariaG, AndreaV, Jamie, Steve, Gavin, Simone, Stefan, Maarten); remote(Tore, Stephane, Michael, Gonzalo, Onno, Jhen-Wei, Paolo, Tiju, Burt, Pavel, Rob)

Experiments round table:

  • ATLAS reports -
    • GGUS 77923: cvmfs issue in TAIWAN. Solved
    • FTS pilot 2.2.8 : No problem observed. Should prepare the migration of FTS production machines at CERN
    • GGUS 77965 : Significant error rate to access LFC in IN2P3-CC. Almost stopped deletion of IN2P3-CC and degraded the success rate of production jobs. Possibly related to one of the front-end.

  • CMS reports -
    • LHC / CMS detector
      • Shutdown
    • CERN / central services
      • Nothing to report
    • T0
      • Tier-0 farm performed LHE MadGraph production at 8TeV continues
    • T1 sites:
      • Problem reported with the job robot at CNAF over the weekend. Since recovered
    • T2 sites:
      • NTR

  • LHCb reports -
    • MC11: Monte Carlo productions
      • T0
        • Preparing for downtime tomorrow. Will jobs in the system be killed or suspended?
      • T1

Sites / Services round table:

  • ASGC: ntr
  • BNL: nrt
  • CNAF: Some transfer problem noticed last week now solved (no tickets)
  • FNAL:ntr
  • IN2P3: ntr (CMS will provide the needed contacts for the US network)
  • KIT: ntr
  • NDGF: ntr
  • NLT1: ATLAS/LHCb agreed to discuss the possibilityto make some tests on the setting of the CERN-SARA channel. They will keep SARA in the loop
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: Reminder of future interventions:
    • Tomorrow 10:00-12:00 upgrade of CASTOR PUBLIC (important improvements on the tape part); transparent (but for some OPS tests)
    • Wednesday 9:00-12:00 upgrade of the CASTOR Name Server: NOT transparent, ALL instances (ALICE, ATLAS, CMS, LHCb) affected; this is for migrating the DB to Oracle 11g and move it to new hardware
  • Central Services:
    • A SIR for the LSF problem in December will be provided (Platform did not come back yet with a full solution)
    • ATLAS is discussing with IT for deploying the FTS 2.2.8. CMS did not observe problems so far. LHCb will also in the loop. A staged roll out should be possible
    • LSF master will most probably replaced/upgraded this Wednesday (at the same time of the CASTOR Name Server upgrade)
    • Experiments should communicate what they want their queue to be during the CASTOR upgrade
  • Data bases:See the CASTOR line
  • Dashboard:

AOB: (MariaDZ) A file with drills of all real ALARM tickets since the last WLCG MB (6 weeks) is attached at the end of this twiki page.

Tuesday:

Attendance: local(Massimo, Cedric, Gavin, Eddy, Eva, Maarten, MariaD); remote(Gonzalo, Michael, Paolo, Tiju, Burt, Rolf, Ronald, Rob).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s
      • ntr
    • T2s
      • SE issue on UKI-SOUTHGRID-BHAM-HEP GGUS:78007. Solved.
      • Bad computation of Space Tokens size for 2 Storm sites : TECHNION-HEP GGUS:77963 and WEIZMANN-LCG2 GGUS:77962

  • CMS reports -
    • Nothing to report from CMS
    • Markus (CRC) will not be able to attend the daily chat today (and Thursday and Friday)

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0
      • For Wednesday downtime no special treatment needed for queues by PES
    • T1

Sites / Services round table:

  • ASGC:
  • BNL:ntr
  • CNAF:ntr
  • FNAL:ntr
  • IN2P3:ntr
  • KIT:
  • NDGF:
  • NLT1:
  • PIC: ntr
  • RAL: Connectivity problem under investigation (probably DNS related). Investigating. Site in downtime until 5pm
  • OSG: Discovered bug affecting sites using GUMS for auth. Users with more than one certificate in VOMS Admin 2.6.1 end up with no certificate in GUMS. 600 ATLAS and CMS users affected. Investigating together with VOMS developers. A ticket will be posted soon.

  • CASTOR/EOS: CASTORPUBLIC upgrade successful.
  • Central Services: LSF master upgrade taking place tomorrow. Assuming no secial arrengment for LSF queus during the CASTOR intervention are needed (if not, experiment should contact Gavin before 5pm today).
  • Data bases: CASTOR DB intervention (Name Server) tomorrow morning for upgrade (11g included) in the morning (9-12). ATLAS on-line DB will be upgraded to 11g in the afternoon (2-6pm)
  • Dashboard: ntr

AOB:

Wednesday

Attendance: local(Massimo, Cedric, Eddie, Edoardo, Gavin, Alessandro, Lola, Maarten, Eva, Giuseppe, Maria Dimou);remote(Michael, John, Giovanni, Burt, Markus, Rolf, Rob, Ron).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • CASTOR problems after NameServer database upgrade.
    • T1s
      • RAL : DNS problem finally fixed yesterday evening.
    • T2s
      • UKI-NORTHGRID-MAN-HEP : storage problem GGUS:78073
      • RO-14-ITIM : missing library GGUS:78071
      • DESY-HH : network problem

  • CMS reports -
    • CASTOR nameserver DB will be upgraded to Oracle 11g today. System is still not back up. Can we get a time estimate?

  • ALICE reports -
    • Yesterday at CERN almost all ALICE jobs failed. GGUS:78064 was opened. The root cause was the presence of an unexpected mount point /opt/alien on lxplus (also on lxbatch) containing an old AliEn version and thereby affecting the build of the AliEn version used on the WN. The same build worked at a few other sites, for reasons that have not been understood, which confused the investigations. Around midnight a new build with a workaround was put into production.

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0
      • Problem with access to castor after downtime today, the problem was swiftly fixed by castor team (GGUS:78103)
    • T1

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: Preliminary announcement: Feb 7 the site will be down for a series of multiple interventions
  • KIT:
  • NDGF:
  • NLT1: CVMFS problem at NIKHEF. Due to config changes mounts were not possible at WNs. This was solved by throwing away all the caches and restart the service at the WNs. The problem was solved within one and a half hours. LHCb submitted GGUS:78085 for this.
  • PIC:
  • RAL: DNS problem: back at 9pm. Investigating the root cause.
  • OSG: Working to solve the problem we reported yesterday (GUMS/VOMS admin problem with multiple certificate). We should make sure VOMSrs is not retired prior a tested solution is available

AOB:

  • ATLAS asked about the plans for the 11g upgrade on the Tier1. We will touch base at the next T1coord meeting.

Thursday

Attendance: local(Massimo, Maria Dimou, Cedric, Eva);remote(Michael, Gonzalo, Tore, Jhen-Wei, Ronald, Xavier, Gareth, Rob, Burt, Rolf, Paolo).

Experiments round table:

  • ALICE reports -
    • T0/Central Services
      • There was a configuration problem in the torrent built. Experts working on it.
    • T1s
      • FZK: no space left on the xrootd disk-only instance makes jobs write their output to other sites. This situation strained the GridKa firewall and affected all experiments using the site. The limit of ALICE jobs running has been lowered to 1000.

Sites / Services round table:

  • ASGC: Next Wed (2am-10am UTC) the site will be in downtime (power)
  • BNL: Next Tue in the shadow of the ATLAS central DB intervention, they will configure hot backup for their dCache storage system (chimera postgres db).
  • CNAF: ntr
  • FNAL:ntr
  • IN2P3:ntr
  • KIT: CE problem (publication problem)
  • NDGF: ntr
  • NLT1:ntr
  • PIC: Next Tue (11-12 UTC) due to an intervetion job submission will be unavailable (running and queued jobs won't be affected)
  • RAL:ntr
  • OSG:ntr

  • CASTOR/EOS:
    • ATLAS 2.1.11-9 upgrade done (transparent). The upgrade of CASTOR ALICE was cancelled (CDB incident) and it will be rescheduled.
  • Central Services:
    • A new FTS 2.2.8 release candidate was installed on the pilot fts-pilot-service (Oracle 11g) this morning. Fixes the "process not found in /proc" that CMS found and discovers correctly the gsiftp endpoints on EOS. There will be at least one more release candidate to come for a deployment problem.
  • Data bases: PVSS ATLAS replication (online-offline) down (investigating). Condition replication had problems and it is catching up.
  • Dashboard:ntr

AOB:

Friday

Attendance: local(Massimo, Cedric, Eva, Maarten, Stefan, Gavin, Eddie); remote(Michael, Gonzalo, Onno, Kyle, Burt, Rolf, Paolo, Andreas, John).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s
      • In order to finish high priority tasks, T1s are asked to reduce the share for analysis from 20% to 5% of the pledged computing resources. The sooner, the better.
    • T2s
      • GOEGRID : Low transfer efficiency GGUS:78212
      • LRZ-LMU :Too little space left on local disk to run job GGUS:78204

  • CMS reports -
    • Nothing to report (not able to connect)

  • ALICE reports -
    • NTR
  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0
      • SAM jobs failing when accessing CEs (GGUS:78185)
    • T1
      • IN2P3: SAM issues for testing the shared sofware area, fixed (GGUS:78054)

Sites / Services round table:

  • ASGC:
  • BNL:ntr (ATLAS shares already done)
  • CNAF:ntr
  • FNAL:ntr
  • IN2P3:ntr
  • KIT:ntr
  • NDGF:ntr
  • NLT1:Future interventions: 18-Jan outage at NIKHEF; 1-Feb outage at SARA (FTS and LFC LHCb). Details on GOCDB
  • PIC:ntr
  • RAL:Future interventions: 16-19 Jan transparent SRM interventions on ATLAS,...,LHCb (one per day). Details on GOCDB
  • OSG:ntr

  • CASTOR/EOS:ntr
  • Central Services:Will look in the CE (LCG-CE) problem at CERN (Monitoring problem)
  • Data bases:ATLAS replication problem solved
  • Dashboard:ntr

AOB:

-- JamieShiers - 29-Nov-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2336.0 K 2012-01-09 - 11:03 MariaDimou Final ALARM drills for the 2012/01/10 MB covering 6 weeks
Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2012-01-13 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback