Week of 140113
WLCG Operations Call details
- At CERN the meeting room is 513 R-068.
- For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
- Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
- To have the system call you, click here
- In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.
General Information
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
Monday
Attendance:
- local: Felix, Stefan, Alberto, Alessandro, Belinda, Andrej, Maarten, Eddie,
- remote: Rolf, Michael, Lisa, David, Xavier, Kyle, Sang-Un, Tiju, Pepe, Salverio, Christian,
Experiments round table:
- ATLAS
- Central services/T0
- GGUS:100110 FTS3 pilot service will be put back (most probably tomorrow) for ATLAS Functional Tests
- Andrej Filipcic has joined the ADC operations team
- Asking CERN IT/PES whether Frontier squids can be run as a service
- Maarten: Do you still see openssl related errors? Alessandro: for data transfer no, otherwise will get back to you
- CMS
- Running 2011 legacy MC rereco and 13 TeV MC at T1 sites
- GGUS:100198 Requesting access for FNAL to replicate CMS VOMS DB
- Few other GGUS tickets open to T2's & T3's
- LHCb reports (raw view) -
- Heavy ion reprocessing completed but 2 files (too complex events, timing out)
- MC and user jobs only: no tape recalls requires, only disk access for user jobs and upload from MC jobs and MC-merging jobs
- T0:
- T1:
- Problems of SRM instability at IN2P3. Downtime over, but still SE under scrutiny
- T2:
- Problems with CBPF storage: limited number of concurrent transfers to 4
Sites / Services round table:
- ASGC: NTR
- IN2P3: problems with dcache/SRM, found the problem at the kernel level, relevant parameter was changed and shall fix the problem, recommendation was to upgrade to another kernel, which will be done asap. Would be useful to be able to limit the FTS3 activities for such problems. Stefan: will contact FTS3 team
- BNL: NTR
- FNAL: NTR
- KIT: NTR
- OSG: Was there a BDII problem Sat 7.30 am CT? Stefan: Not aware of a problem. Maarten: Could you open a ticket a posteriori? Kyle: yes will do. - Will not attend next Monday local holiday
- KISTI: NTR
- RAL: NTR
- PIC: Next Thu stop FTS2 instance for 3 hours b/c of move to new machine, already inserted in GOC and experiments were informed. Alessandro: Shall we take this opportunity to move to FTS3? Pepe: will get back to you
- CNAF: NTR, Alessandro: do you have a plan to upgrade storm and when? Salvatore: was foreseen for tomorrow but has been postponed, no date available yet.
- NDGF: Today's downtime finished fine, Wed morning some fixes on pool nodes, upgrade dCache at the same time.
- Storage: restarted EOS ATLAS namespace this morning to help Rucio renaming
- Grid Services: WMS, Gridsite, L&B to be updated soon to EMI3 update 12. Announcement coming soon.
- Monitoring: NTR
AOB:
Thursday
Attendance:
- local: Simone (SCOD), Maarten (Alice), Belinda (CERN Storage), Felix (ASGC), Alberto (CERN Grid Services), Eddie (CERN Dashboards)
- remote: Philippe (LHCb), Sang-Un (KISTI), Lisa (FNAL), Michael (BNL) Kyle (OSG), Matteo (CNAF), Jeremy (GridPP), Christian (NDGF), Tommaso (CMS), Rolf (IN2P3-CC), Xavier (KIT)
Experiments round table:
- ATLAS reports (raw view) -
- Central services/T0
- No problem with FTS3 FT seen, today the ES cloud was included to use the FTS3 for all ATLAS transfers
- Again observed high load on ADCR database (central catalogue and Panda queries), which causes many site services daemons to time out (Wednesday after 17:15 CET). We were told, that high load on underlying Oracle storage was also observed, could this be a root cause ?
- T1
- Staging errors at FZK-LCG2 solved after vendor intervention on tape robots
- Staging input file failing continues at INFN-T1 (GGUS:99982)
- CMS reports (raw view) -
- Running 2011 legacy MC rereco and 13 TeV MC at T1 sites
- Not much to report, smooth operations.
- GGUS:100198 Requesting access for FNAL to replicate CMS VOMS DB [SOLVED]
- GGUS:100315: T2_UK_London_IC, problems with Proxy lifetime [NEW]
- As a note, CMS is starting to send jobs via CRAB also to those T1s which completed DISK/TAPE separation.
- ALICE -
- central services moved to new HW on Jan 14-15
- also more efficient power/UPS distribution and cooling
- Torrent no longer available
- CERN
- high error rates after central services restart, to be debugged
- LHCb reports (raw view) -
- Only MC and user jobs
- T0: NTR
- T1: NTR
- T2:
- problems of transfers to CBPF (FTS3) partially understood (gridftp not returning performance marked -> disabled), but still transfers >~ 3650 seconds failing, although timeout is 7200 s)
Sites / Services round table:
- OSG: the BDII ticket discussed on monday was opened (GGUS:100263), but the issue may be hard to debug, because the problem fixed itself (disappeared)
- CERN Storage: on tuesday EOSATLAS had to be put in R/O mode for 1h following an accidental reboot. Also, there is currently a EOS LHCb SRM problem which is being looked at.
- Grid Services : CERN's WMS to be updated to EMI3 update 12 on the 21st of January. ITSSB entry
AOB:
--
SimoneCampana - 16 Dec 2013