Week of 130916

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Doug, Stefan, Manuel, Zbigniew, Maria
  • remote: Onno, Xavier, Rolf, Michael, David, Joel, Kyle, Tiju, Thomas, Wei-Jen, Lisa, Lucia
  • apologies: Sang Un, Pepe (report by mail)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • CERN-PROD EOS/CASTOR (GGUS:97307) atlpilo2 new Robotic Certificate DN needed to be added for EOS and CASTOR - Solved very quickly (Thanks)
    • T0
      • NTR
    • T1
      • NDGF-T1 (GGUS:97306) Incomplete files on disk, likely caused by transfer during downtime, Site recommends marking files lost and resend
      • NDGF-T1 (GGUS:97322) Source file/user checksum mismatch - likely caused by transfer during downtime, site investigating further
      • Dip in production jobs over the weekend caused by Robotic DN switch over affected ATLAS Data Transfer Request Interface (DaTRI) - solved on Sunday night

  • CMS reports (raw view) -
    • GGUS rather quiet, some CERN incidents however:
    • INC:384013 INC:384404 -- alarms due to vocms140 not responding. Both likely happening when CouchDB being compactified/views being generated -- following up.
    • INC:383274 CMS Savannah briefly unavailable both on and offsite friday evening.
    • Older tickets:
      • GGUS:96912 and GGUS:96913 related -- Warsaw removed a CE from CMS SiteDB, but not BDII, leading to resulting troubles with SAM tests, etc. CMS site support followed up on friday.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Main activity are MC productions
    • T0:
      • info in the BDII for ce301 and ce302 was wrong. (now corrected)
      • FTS3 used the instance at CERN after the correction of the time.
    • T1:
      • GRIDKA: one tape library is not beahving properly.
      • PIC: downtime today

Sites / Services round table:

  • NL-T1: Warning Downtime, next week Monday 23rd, for mass storage system, files on tape will not be available in the morning. The DT will be combined with HW maintenance / OS upgrade and probably turn into an outage.
  • KIT: NTA
  • IN2P3-CC: All day downtime coming Tuesday 24th for batch system (will slow down already Monday evening). Tuesday dCache and MSS will be down too. After DT 50 % of WNs will have been migrated to SL6.
  • BNL: NTR
  • OSG: NTR
  • RAL: NTR
  • NDGF: NTR
  • ASGC: At risk downtime coming Sunday 5 am to 9 am UTC for power maintenance
  • FNAL: NTR
  • CNAF: NTR
  • KISTI:
    • scheduled downtime: from 23rd September 00:00 (UTC) to 6th October 00:00 (UTC)
      • To replace top switches and reconfigure internal network to fit into 10Gbps connectivity from outside (we are planning to upgrade the network connectivity between CERN and KISTI up to 10 Gbps in next year)
      • whole network equipment will be replaced by new 10Gbps switches and ethernet cards
      • heavy rearrangement of service nodes and cabling work foreseen
      • soon I will put this downtime on GOCDB.
    • public holiday called "Chu-Seok" on Thursday. Nobody will be connected from KISTI on 19th September (next WLCG operations meeting)
  • PIC: Planned downtime to upgrade dCache and the firmware of one of the tape libraries in PIC was executed today successfully. Everything went smooth and fine. The site is now fully up, running and operational
  • Databases: NTR
  • Grid Services: Network intervention on Wed 18th, 5am - 10pm where routers will be upgraded. Most services likely to be affected with short intervention (both GPN and LCG network) (see also entry in the status board)
  • GGUS: NB! GGUS Release with routine ALARM tests will take place on 2013/09/25. All details on dev. items entering production on page https://ggus.eu/pages/news_detail.php?ID=491 . The GGUS slides for tomorrow's MB are attached to this page. The weekly GGUS traffic file ggus-tickets.xls is attached, as always, to page WLCGOperationsMeetings.

AOB:

Thursday

Attendance:

  • local: Doug, Stefan, Ken, Manuel, Jan, Maarten
  • remote: Dennis, Xavier, Kyle, Gareth, Pepe, Michael, Wei-Jen, Rolf, Thomas, Lucia, Lisa
  • apologies: Joel

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • On going FT3-Pilot problems (GGUS:97359 and GGUS:97419) effect Functional tests to all sites. Functional test served by FT3-Pilot
      • Network interventions on Wednesday, produced many unexpected problems for ATLAS. We request a Postmortem be done. We also request that ahead of other such interventions a Risk Analysis be published so we can better prepare ourselves
        • Doug: request post mortem of router upgrade. Manuel: we are preparing a SIR. What happened was that not only GPN and WLCG switches were upgraded but also others (affecting hypervisor access to storage) at the same time. This resulted in hypervisors losing connection and it took a while to bring all back up again. The upgrade campaign was run both in the CC and at the SafeHost facility.
    • T0
      • NTR
    • T1
      • TRIUMF - (GGUS:97393) - Stage in Errors - Site reports problem during storage system migration. Large number of lost files (SIR)

  • ALICE -
    • KIT
      • Successfully switched to CVMFS yesterday morning!
    • RAL
      • Switched to CVMFS this morning, not clear yet if all is OK.
      • Stefan: Are these the first sites that were switched? Maarten: No, several smaller T2 sites were already done. These are the first bigger sites that were switched.
    • Russian sites
      • Mon evening Sep 16 the Russian GEANT link was again limited to 100 Mbps
      • Most T2 sites are still closed for ALICE jobs
        • JINR, PNPI, Troitsk, ITEP, MEPHI
      • 3 sites are operational
        • RRC-KI, SPbSU, IHEP (much reduced)

  • LHCb reports (raw view) -
    • Main activity are MC productions
    • T0:
      • network intervention yesterday in the computer center.
    • T1:
      • RAL : Castor Database rebooted yesterday to fix an ORACLE error
      • PIC: configuration lost after the dcache intervention and it took some times to recover.
        • Pepe: We confirm that the configuration at our site was changed to avoid displaying the file: protocol. On the long term it will also be possible to provide this protocol. Stefan: very good, LHCb will be happy to try it out.

Sites / Services round table:

  • NL-T1:
    • Issue Mon-Tue with dCache at SARA, new nodes where activated without certificates. This caused problems for transfers and the nodes were turned off Tue afternoon.
    • There will be a DT next Monday
  • KIT: NTR
  • OSG: NTR
  • RAL:
    • issue with batch jobs for ALICE being followed
    • yesterday stop of castor storage system b/c of oracle reboot - needed to change a parameter in db configuration, which required a restart
  • PIC: Last Monday upgrade of dCache was carried out. Afterwards we experienced problems with info provider, not publishing correctly space information. Some pools were unstable, put them in read-only mode and rebuilding the pool was enough.
  • BNL: NTR
  • ASGC: castor disk hw failure observed today. This may affect CMS transfers. We are currently fixing it.
  • IN2P3: NTR
  • NDGF: NTR
  • CNAF: NTR
  • FNAL: NTR
  • Storage: Castor/EOS affected by router upgrade intervention yesterday. Castor affected by database disconnection. EOS/ALICE triggered internal failover, 2 hours down beyond network intervention, EOS/CMS had a similar problem but recovered earlier. No data was lost.
  • Grid Service: Nothing to add, (link to SIR mentioned above)

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2348.0 K 2013-09-16 - 10:32 MariaDimou Final GGUS Slides for the 2013/09/17 WLCG MB
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2013-09-25 - StefanRoiser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback