Week of 130304

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: (Luc - ATLAS, Massimo, Ignacio, Maarten - ALICE, MariaD)
  • remote: (Stefano - LHCb, Wei-Jen - ASGC, Xavier - KIT, Michael - BNL, Pepe - PIC, Tiju - RAL, Thomas - NDGF, Rolf - IN2P3, Rob - OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T1s
      • TAIWAN-LCG2, one disk server unavailable since Friday due to hardware failure. Data on disks was not accessible. The storage was rebuilt and back in production with the data access restored on Sunday.
      • CA cloud, starting Saturday many CA sites failing transfers due to expired CRL. The investigation was delayed due to a Scheduled Power Outage at TRIUMF-LCG2. CRL files were copied manually, services restarted, the issue looks fixed now, GGUS:91893 .

  • CMS reports (raw view) -
    • No one connected, no report on the CMS twiki since mid-February.

  • ALICE -
    • KISTI: main disk SE looks OK again
    • KIT: disk SE started failing Fri afternoon, fixed by restart Sat afternoon. Xavier (KIT) commented on this issue that ALICE jobs still strain their firewall. Maarten will get the site, middleware and experiment experts to check in a coordinated fashion via an existing mail thread.

  • LHCb reports (raw view) -
    • Ongoing activity as before: mainly MC and user jobs, some prompt processing (CERN + T2s), reprocessing activity almos done.
    • T0:
      • NTR
    • T1:
      • PIC: Problem reading header of some files using Xrd client talking with server dcgftp07.pic.es:1094 ("XrdClientMessage::ReadRaw: Failed to read header (8 bytes)"). Investigation ongoing.
      • NL-T1: GGUS:92048 - Problem in accessing files with Xrd client. Problem seems related to autorization token ("Xrd: CheckErrorStatus: Server [srm.grid.sara.nl] declared: Authorization check failed: No authorization token found in open request, access denied.(error code: 3010)"). From NL-T1 contact: "seems usual TURL resolution problem that we have over and over again, each time dCache is reset".
      • RAL: Some problem with network card on diskservers. Intervention ongoing. Tiju (RAL) confirmed the intervention is complete and the service back in operation again.

Sites / Services round table:

  • ASGC: ntr
  • KIT: ntr
  • BNL: ntr
  • PIC: Investigation on-going of LHCb problem reported above.
  • RAL: Scheduled intervention on Tue 2013/03/12 for a core network infrastructure switch replacement.
  • NDGF: Tomorrow 2013/03/05 pm there will be OS patching activity. GOCDB is updated about the service interruption. On Wed 2013/03/06 there will be an intervention at one of the NDGF sites. No more info at this point. MariaD said, as the meeting is now only twice a week, so announcements can't be done daily, please open a GGUS ticket and try notifying the unavalable site. The GOCDB info about this site will be displayed immediately.
  • IN2P3: ntr
  • OSG: ntr

  • CASTOR: Planned intervention on Mon 2013/03/11 with nameserver intervention required between 11am and 1pm CET. Experiments present think it is OK but Massimo will also send email.
  • Grid services: ntr
  • GGUS; ntr

AOB: Next meeting is on THURSDAY!

Thursday

Attendance:

  • local: (Luc - ATLAS, Massimo, Ivan, Jerôme, Maarten - ALICE, MariaD)
  • remote: (Stefano - LHCb, Wei-Jen - ASGC, Xavier - KIT, Pepe - PIC, John Kelly - RAL, Roger - NDGF, Rolf - IN2P3, Kyle - OSG, Lisa - FNAL, Lucia - CNAF, David Mason - CMS, Ronald - NL-T1)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • CERN-PROD:file transfer failure from T2 sites due to SECURITY_ERROR GGUS:92166 & INC:252496. Problem disappeared (final comments needed) Ignacio or other Grid service managers please update the ticket. Submitter or other ATLAS shifters, please, do not 'verify' the ticket, even if it gets 'solved' so that comments can still be added.
      • DDM Dashboard receiving a lot of 'file-copied' callbacks with duration='None'. Fix applied (D. Tuckett), problem diappeared
    • T1s
      • RAL failing transfers SRM errors GGUS:92239. Problem with firewal after network outage. UDT ongoing to understand
    • Calibration-T2s
    • ATLAS internal - not for WLCG Operations Meetings
      • PRODDISKS at T2s getting full due to reprocessing. Space needed under evaluation.
      • CNAF has merged GROUPDISKs token to DATADISK (under StoRM)

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data progresing well -- utilizing all T1 resources
    • CERN / central services and T0
      • Working on reconfiguring for reprocessing
        • Some delay in reconfiguring LSF for this purpose
        • Castor disk pools being cleaned and will ask soon to move them to EOS
        • HLT cloud commissioning progressing after network reconfig.
    • Tier-1:
      • IN2P3 Following issues with direct dcap reads, switching to xrootd reads revealed xrootd file access limitations (dCache team ticket filed)
    • Tier-2:
      • Tested and moved some T1 workflows to larger T2's with xrootd input, due to high T1 use.
        • US T2's at first, expanding to German and Italian T2's soon.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Mainly user jobs. MC production on hold for new requests.
    • T0:
      • NTR
    • T1:
      • NTR

Sites / Services round table:

  • ASGC: ntr
  • KIT: ntr
  • NL-T1: Scheduled intervention for SARA on Monday 2013/03/11.
  • PIC: A successful change of ENSTORE interface took place on Tuesday 2013/03/05.
  • RAL: (sent be email due to audioconf problems) As mentioned by Atlas, there were network problems at RAL yesterday and today. The problem was caused by high cpu load on the site firewall and router. The network team reported it fixed at approx 15:00 GMT yesterday. But problems re-occurred for a brief period last night and again today. The network team at RAL is still investigating this.
  • NDGF: There is a downtime this afternoon scheduled in GOCDB.
  • OSG: ntr
  • IN2P3: ntr
  • CNAF: The CE05 abd CE06 are in downtime for the 8-14 March period.

  • CASTOR interventions next week:
    • 11-MAR-2013: 10:00 CERN time 1h ATLAS 2.1.13-9 (transparent)
    • 11-MAR-2013: 11:00 CERN time 1h Name Server patching all instances (transparent)
    • 11-MAR-2013: 12:00 CERN time Name Server schema change (OUTAGE)
    • 11-MAR-2013: 14:00 CERN time 1h LHCb 2.1.13-9 (transparent)
    • 13-MAR-2013: 10:00 CERN time 1h ALICE 2.1.13-9 (transparent)
    • 13-MAR-2013: 14:00 CERN time 1h CMS 2.1.13-9 (transparent)

  • Dashboards: ALICE SAM tests were failing during the 28/2-4/3 period, which affected dashboard views. This is now fixed.

  • GGUS: Update of GGUS host certificate with the April release on Wednesday 2013/04/24. All interfaces affected. Interface developers received email on 2013/03/01. No reply so far. Details in Savannah:136227

AOB:

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2013-03-07 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback