Week of 170424

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Maarten (Alice, WLCG), Kate (chair, WLCG, DB), Vincent (security), Andrea M (middleware, fts), Mark (LHCb), Gavin (computing), Ignacio (DB), Herve (storage), Paul (storage), Marian (network, monitoring)
  • remote: Andrew (NL-T1), Dario (Atlas), Christoph (CMS), Di Qing (TRIUMF), Elena (CNAF), Gareth (RAL), Kyle (OSG), Sang Un (KISTI), Xin Zhao (BNL), Dave M (FNAL), Dimitri (KIT)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • data16 reprocessing will be completed by the end of this week. data15 reprocessing starting. 100k cores used for reprocessing out of a total 250k-300k available (rest for MC production and reconstruction, group derivations and analysis).
      • (Only) CERN Frontier servers are under stress with reprocessing and user jobs. An additional server is installed, reserved for interactive users. Frontier loads from reprocessing are still under investigation, more tests will be performed using the Tier-0 farm.
    • Problems:
      • DDM proxies expired last Saturday morning on FTS servers despite having been renewed a week ago. Promptly fixed thanks to a few people who were alert during the week-end (including shifters). Under investigation.

  • CMS reports ( raw view)
    • CMS detector is being commissioned with cosmics
    • Moderate overall CPU utilization
    • Re-reco of 2016 data
      • Still in progress with pre-staging RAW data
      • First pilot processings have started
      • Launch of full campaign expected soon
    • Brought down CMS EOS instance at CERN
      • Likely caused by many IO-intense workflows running on all CMS CERN resources (Openstack Tier-0, HLT, lxbatch via CEs)
      • GGUS:127784
Herve commented the issue is related to GSI authentication which has a 200Hz limit - otherwise all users are blocked. This issue is under investigation. Christoph commented that the load also added up as the issue is gone since the load is gone.

  • ALICE -
    • very high activity on average
    • CERN: emergency disk server additions to EOS on Sunday - many thanks!
      • cleanup of "old" data could not keep up with the activity levels

  • LHCb reports ( raw view) -
    • Activity
      • MC Simulation, Data Stripping and user analysis
    • Site Issues
      • T0: Some ongoing problems with Condor CEs (GGUS:127553)
      • T1:
        • RAL: Still running with a limit on the number of Merge jobs to avoid problems with storage (GGUS:127617). Hoping these problems will be fixed by the CASTOR upgrade a week on Wednesday

Sites / Services round table:

  • ASGC: Submarine cable APCN2 fiber cut on April 22nd, 2017. Network to Taiwan (ASGC) is still functional, but losing fail-over capability. Recovery will be after May.
  • BNL: SE(dCache) is IPv6 enabled
Maarten asked about the FTS. FTS is still IPv4 due to issues at one of the sites. Andrea added CERN FTS is IPv6 and no issues were noticed. The issues are still under investigation.
  • CNAF: ntr
Elena asked about the page where supported versions of software components can be found. Maarten commented there is the Baseline page kept up to date by the MW officer. Subsection on the additional software like Java will be added there. Action on the SCOD team.
  • EGI: nc
  • FNAL: ntr
  • IN2P3: nc
  • JINR: SAM dropped at 17-04-2017 due to LAN problem, fixed by rebooting one of two our CE.
  • KISTI: ntr
  • KIT: ntr
  • NDGF: Downtime on Thursday 9-15. Router will be decommissioned. Several short outages as cables are moved one by one to new router. Some Atlas and Alice data affected. Vidyo connection hanging. Anyone else have trouble connecting?
  • NL-T1:
    • Downtime in Sara 1st/2nd of May.
    • Data transfers were failing (GGUS:127552)
  • NRC-KI: nc
  • OSG: Accounting will be transitioned from FNAL to Nebraska
  • PIC: nc
  • RAL: We are aware of the Castor problems affecting LHCb. We are working to test the Castor 2.1.16 upgrade.
  • TRIUMF: NTR

  • CERN computing services:
    • CREAM instabilities over Easter weekend - issue with Java configuration on some CEs resolved. GGUS:127727
    • Some issues on HTCondor CE:
      • Fix for delegation bug (8.7.1) will be available soon - will schedule deployment once we have it.
        • Andrea asked if development branch will be used. Gavin confirmed it's usually more actively supported, the stable branch can be way behind.
      • Spooldir cleanup for CE (not being done) - will set to 3 days.
        • Maarten: issue is related to LHCb not cleaning up the jobs (which should be ok, but not with current default settings). Default grace period will now be changed to 3 days. Alice approach is to fetch pilot stdout and stderr automatically and delete on client side. Mark will follow up with LHCb. For CREAM CEs the default grace period of 10 days typically can pose a similar problem and sites may thus need to shorten it.
  • CERN storage services:
    • FTS: configuration change in order to use the STRICT_RFC2818 for globus-gsi host name verification. Done in the pilot. One node configured in production and we found 2 SEs that are now failing the verification ( GGUS:127875 and GGUS:127874). We will complete the change when those issues will be fixed.
    • EOSAlice was running out of diskspace - more machines were added.
  • CERN databases:
    • Proxy used by CNAF for Phedex connections had to be restarted
    • CMSR firewall will be redeployed on Wednesday, we are expecting it to be rolling, but at risk (OTG:0037052)
  • GGUS: NTR
  • Monitoring: Reports have been sent.
  • MW Officer: NTR
  • Networks: PerfSonar4 was released last week. 40 instances are left to update.
  • Security: NTR

AOB:

  • ATTENTION: next meeting on Tuesday May 2 !
Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2017-04-24 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback