Week of 150406
WLCG Operations Call details
- At CERN the meeting room is 513 R-068.
- For remote participation we use the Vidyo system. Instructions can be found here.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
- Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Monday: Easter Monday holiday
- The meeting will be held on Tuesday instead.
Tuesday
Attendance:
- local: Maarten (SCOD + ALICE)
- remote: Alessandro (ATLAS), Christoph (CMS), Di (TRIUMF), Jeremy (GridPP), Kyle (OSG), Lisa (FNAL), Onno (NLT1), Pepe (PIC), Rolf (IN2P3), Tiju (RAL), Ulf (NDGF)
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- CERN-PROD GGUS:112835 this is about one or more files not being staged within the timeout (which is now 2 days). At this point these files should be checked, it's not normal to have such long waiting time to prestage files.
- CMS -
- Sorry, nobody could join the call today.
- Nothing to report
- ALICE -
- High activity, except between Sun ~19:00 and Mon ~07:00 CEST
Sites / Services round table:
- ASGC:
- BNL:
- CNAF:
- FNAL:
- downtime this Wed 06:30-19:00 local time for power maintenance affecting the building hosting the WN
- GridPP: ntr
- IN2P3: ntr
- JINR:
- KISTI:
- KIT:
- NDGF:
- received a lot of ATLAS data OK over the weekend
- NL-T1:
- congratulations on the first beams in the LHC!
- NRC-KI:
- OSG: ntr
- PIC:
- last week there was a 2-day downtime for a UPS upgrade, which went NOT OK:
- a major electrical fault occurred towards the end of the maintenance, which had to be extended by 4h
- a DDN system hosting ~100 TB then suffered a RAID controller misconfiguration, which took ~10h to solve
- a broken climatization unit affected two thirds of the CPU capacity, which had to be kept switched off over Easter
- at the moment the T1 is running with 50% of the CPU
- more blades will be moved to main computer room to recover part of the missing capacity
- the new UPS is suspected to have been installed incorrectly, which then would have to be fixed in the near future
- RAL:
- quiet Easter
- TRIUMF-RAL network issue under investigation
- CASTOR downtime for upgrade tomorrow
- RRC-KI:
- TRIUMF: ntr
- CERN batch and grid services:
- CERN storage services:
- Databases:
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB:
Thursday
Attendance:
- local: Maarten (SCOD + ALICE), Steve (grid services)
- remote: Alessandro (ATLAS), Dennis (NLT1), Di (TRIUMF), Felix (ASGC), John (RAL), Matteo (CNAF), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI), Ulf (NDGF), Zoltan (LHCb)
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- any news from RAL? it seems from GOCDB that the downtime is over, is everything ok?
- Alessandro: can we use the FTS-3 instances again?
- John: AFAIK it was proposed that ATLAS wait until Monday; will be followed up offline
- Alessandro: OK, we keep using CERN and BNL for now
- LHCb reports (raw view) -
- User and MC jobs are running in the system. The failure rate of the jobs is very low.
- No issues to report
Sites / Services round table:
- ASGC: ntr
- BNL:
- CNAF: ntr
- FNAL:
- GridPP:
- IN2P3: ntr
- JINR:
- KISTI: ntr
- KIT:
- NDGF:
- Downtime for storage 11-12 UTC on Monday due to head node reboots to get new kernels.
- We got an atlas ticket complaining about lack of space, they had filled the space token for MCTAPE. We increased it.
- NL-T1: ntr
- NRC-KI:
- OSG:
- can ATLAS have a look at GGUS:112470 about SRM failures?
- there appears to be an issue with the latest monthly accounting report, possibly related to the introduction of multi-core records; this matter is being looked into
- PIC:
- RAL:
- yesterday there was a scheduled downtime foreseen for a CASTOR upgrade
- it was a good opportunity to reboot a switch as well
- that unexpectedly caused a lot of network problems, which took until this morning to resolve
- the CASTOR upgrade had to be canceled, a new date will be agreed
- TRIUMF: ntr
- CERN batch and grid services: ntr
- CERN storage services:
- Databases:
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB:
--
AndreaSciaba - 2015-02-27