Week of 120903
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments
General Information
Monday
Attendance: local(Massimo, Doug, Maarten, Philippe, Marcin, Ivan);remote(Michael, Saverio, Elisabeth, Onno, Stephane, Jhen-Wei, Thierry, Tiju, Gonzalo, Vladinir, Dmitri, Rolf, Ian).
Experiments round table:
- ATLAS reports -
- CERN/T0
- GGUS 85704 :ALARM : Friday night : 2 issues :
- Files produced by T0 migrated to TAPE too slowly -> backlog : Tape buffer was increased (100 TB) during the night and, on saturday, CERN-IT throttled down tape allocations to ALICE, dedicating drives to the other VOs. Backlog was exhausted in few hours. Solved.
- One disk server went down at the same time -> 161 files could not be accessed for TAPE migration or data export. Files provided by SFO were recovered. For merge RAW and derived data produced by T0, waiting for disk server to be back (no very urgent processing requested by ATLAS for this week-end). Not solved yet
- T1
- FZK : GGUS 85721 : ATLAS files could not be stagein from TAPE starting saturday. Should have been solved this morning.
- ALICE reports -
- CERN: migrating conditions data to the corresponding name space in EOS.
- CNAF: debugging Torrent method for SW installation.
- LHCb reports -
- Running user analysis, prompt reconstruction and stripping at T0 and T1s
- Simulation at T2s
- New GGUS (or RT) tickets
- T0:
- constant rate of aborted pilot observed during last week GGUS:85385; Solved
- T1 :
- IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644); Under investigation
- GRIDKA: Pilots failed at two CEs (GGUS:85716)
Sites / Services round table:
- ASGC: ntr
- CNAF: ntr
- IN2P3: ntr
- KIT: downtime "at risk" (17-SEP-2012 5:00-7:30) to intervene on the LAN. Short interruptions possible.
- NDGF: Same problem as Friday happened a second time (power). Fixed this morning
- NLT1: ntr
- PIC: ntr
- RAL: ntr
- OSG: ntr
- CASTOR/EOS: Working on recover the node blocking ATLAS files. The "ALARM" part of the same ticket (slow migration to tape) is actually solved. Root cause under investigation
- Central Services: NTR
- Data bases: The roll-out of security patches has started today and will continue instance-by-instance. In general this should be transparent if not said otherwise
- Dashboard: Probel in making the SAM test results available. The system should be back tomorrow (it seems connected to a change of execution plan in the database backend)
AOB:
Tuesday
Attendance: local(Jamie, Steve, Stephen, Kate, Doug, Ivan, Maarten);remote(Michael, Elizabeth, Rolf, Gonzalo, Saverio, JT, Xavier, Lisa, Thierry, Jhen-Wei, Tiju, Salvatore).
Experiments round table:
- ATLAS reports -
- CERN/T0 Nothing to report
- T1
- FZK - Time out Issues staging to tape - GGUS Ticket-ID: 85721 - Few errors in (3) in last 12 hours
- Atlas Internal
- RAL - file deletion changed to speed up clean up of SCRATCH disk
- TRIUMF LFC migration to CERN / Central deletion stopped
- CMS reports -
- LHC / CMS
- CERN / central services and T0
- Transfers to Tier-1s interrupted at weekend due to heavy user load, GGUS:85720. Should improve once users moved to Tier-2 facilities.
- Tier-1/2:
- ALICE reports -
- CNAF: running OK with Torrent since yesterday evening.
- LHCb reports -
- Running user analysis, prompt reconstruction and stripping at T0 and T1s
- Simulation at T2s
- T0:
- T1 :
- IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644); Wrong version of tcmalloc with link to AFS and wrong version of gcc
- GRIDKA: Pilots failed at two CEs (GGUS:85716); Still failed
- PIC: Pilots aborted (GGUS:85746); Medium queues were removed, but they are still in BDII
Sites / Services round table:
- ASGC: NTR
- BNL: NTR
- CNAF: NTR
- FNAL: NTR
- IN2P3: NTR
- KIT: NTA
- NDGF: still problems in Denmark; same problem of electricity; elec on site; probably have to take some pools down during the night
- NLT1: ntr
- PIC: ntr
- RAL: ntr
- OSG: ntr
- CASTOR/EOS:
- Central Services: ntr
- Data bases: applying latest oracle sec patch to integration dbs - all of which should be done today; prod in 2-3 weeks
- Dashboard: ntr
AOB:
Wednesday
Attendance: local(Massimo, Steve, Marcin, Julia, Doug, Ivan, Maarten, Maria);remote(Michael, Elizabeth, Rolf, Gonzalo, Saverio, Dmitri, Lisa, Thierry, Jhen-Wei, Tiju, Ron, Vladimir).
Experiments round table:
- ATLAS reports -
- CERN/T0 Nothing to report
- T1
- CMS reports -
- LHC / CMS
- CERN / central services and T0
- Tier-1/2:
- Security Challenge/Training launched yesterday. Many sites informed of two "malicious" DNs.
- LHCb reports -
- Running user analysis, prompt reconstruction and stripping at T0 and T1s
- Simulation at T2s
- New GGUS (or RT) tickets
- T0:
- T1 :
- IN2P3: Jobs failures, cannot load shared library, site was banned for production (GGUS:85644); Wrong version of tcmalloc with link to AFS and wrong version of gcc; Fixed
- GRIDKA: Pilots failed at two CEs (GGUS:85716); Probably fixed
- PIC: Pilots aborted (GGUS:85746); Medium queues were removed, but they are still in BDII; Fixed
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- KIT: The problem seen by LHCb was more on the WN side. It looks it has been fixed.
- NDGF: Power problems (DK) seems to be fixed after electricians' intervention
- NLT1: ntr
- PIC: ntr; Maarten commented that when a CE is removed it stays in state UNKNOWN in BDII for 12h: sw should take this behaviour into account
- RAL: ntr
- OSG: ntr
- CASTOR/EOS: vobox rebooted on request of CMS
- Central Services: 10% of the batch nodes have been upgraded to the latest gLite 3.2 Next week the remaining 90% will be upgraded.
- Data bases: ntr
- Dashboard: ntr
AOB:
- On Tuesday 18 September the daily operations meeting will be filmed as part of a TV documentary.
Thursday
No meeting (Jeune Genevois)
Experiments summaries:
Friday
Attendance: local(Massimo, Steve, Marcin, Ivan, Stephen);remote(Michael, Rolf, Gonzalo, Saverio, Lisa, Jhen-Wei, John, Alexandre, Vladimir, Kyle, Xavier).
Experiments round table:
- ATLAS reports -
- CERN/T0
- Problems with EOS over night GGUS ticket : 85919 - headnode restart things seem fine
- T1
- CMS reports -
- LHC / CMS
- CERN / central services and T0
- Tier-1/2:
- LHCb reports -
- Running user analysis, prompt reconstruction and stripping at T0 and T1s
- Simulation at T2s
- Validation of replrocessing at T1s and selected T2s
- New GGUS (or RT) tickets
- T0:
- T1 :
Sites / Services round table:
- ASGC: Tape library problem. Vendor call opened.
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- KIT: 40k files overdue to tape. Expect to be solved beginning of next week.
- NLT1: ntr
- PIC: ntr
- RAL:
- Yesterday switch problem. SOme jobs (and some transfers) have been lost
- New tapes for LHCb
- CASTOR LHCb will be upgraded next week (2.1.12)
- OSG: ntr
- CASTOR/EOS: ATLAS request status of missing files on EOS: investigation going on (ticket GGUS:85933 et al.)
- Central Services:
- Data bases:
- Dashboard:
AOB:
--
JamieShiers - 02-Jul-2012