Week of 120220

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Andrew, Dan, Ignacio, Maarten, Maria D, Massimo, Mike, Steve, Xavier);remote(Gonzalo, Jhen-Wei, Joel, Jose, Lisa, Mette, Michael, Onno, Rob, Rolf, Stefano P, Tiju).

Experiments round table:

  • ATLAS reports -
    • LYON LFC. SAV:126485, GGUS:79347 ALARM ticket
      • FR cloud is set offline for DDM, Production and analysis
      • table space was increased, DDM transfers were resumed at 17:50. FR cloud is set online at 19:40
    • Dan: reminder for T1 sites to fill in the FTS 2.2.8 deployment spreadsheet announced Thu last week

  • CMS reports -
    • NTR. Quiet weekend.

Sites / Services round table:

  • ASGC
    • downtime for all services Thu 01:00-04:00 UTC for core switch maintenance
  • BNL - ntr
  • CNAF - ntr
  • FNAL
    • network problem yesterday 08:30-16:30 CST, still reduced capacity, but users should not notice that
  • IN2P3 - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL
    • CMS CASTOR intervention ongoing as planned, should finish at 16:00

  • dashboards - ntr
  • GGUS/SNOW - ntr
  • grid services
    • AFS UI: current version for gLite 3.2/SL5 now pointing to 3.2.11
      • bug: glite-ce-job-output did not work, GGUS:79384 opened - fixed
    • FTS pilot updated to official EMI FTS 2.2.8 release, too early to tell if it is working fine, more news tomorrow
    • all CREAM CEs updated to EMI-1 release
  • storage
    • EOS-ATLAS updated with fix for memory leak
    • CASTOR LHCb default pool overload by 2 users
    • CASTOR public update on Wed 09:30-12:30 CET with SAM tests for experiments "at risk"

AOB:

  • Steve: sites running an EMI-1 DPM should watch out for GGUS:79354 !

Tuesday

Attendance: local(Alessandro, Dan, Maarten, Maria D, Michail, Mike, Steve, Xavier);remote(Burt, Gareth, Gonzalo, Jeremy, Joel, Kyle, Mette, Michael, Oliver, Paco, Rolf, Stefano P).

Experiments round table:

  • CMS reports -
    • Tier-0:
      • next Mid Week Global Run (Cosmics data taking without magnetic field): Wednesday, Feb 22 - Thursday, Feb 23
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
    • Tier-1 site issues, open GGUS tickets (went through all open CMS GGUS tickets today):
      • GGUS:79259: T1_IT_CNAF: request to update FTSMonitor, will be handled within 2-4 weeks due to separation from lemon that runs on the same server, last update 2012-02-21
      • GGUS:79257: T1_DE_KIT: request to update FTSMonitor, will be looked at next week, last update 2012-02-16
      • GGUS:78725: T1_FR_CCIN2P3: CREAM CE proxy expiration issue, should have been fixed on Feb. 7th update, should be closed today
      • GGUS:79162: T1_FR_CCIN2P3: JobRobot errors, "Error while loading shared library", should be closed today
      • GGUS:75983: T1_FR_CCIN2P3: network degradation between ALL US T2s and IN2P3, 2nd 10 Gbit interface provided by RENATER, strong advice to use LHCOne, last update 2012-01-26
      • Oliver: SRM problem at KIT, no ticket yet
    • Services and Infrastructure issues, open GGUS tickets (went through all open CMS GGUS tickets today):
      • GGUS:79389: BDII 2nd Line Support: Discrepancy between OSG interoperability BDII and top level CERN BDII, last updated 2012-02-21
      • GGUS:79331: glite WMS: Condor version in current gliteWMS causes churn on GRAM5 Compute Elements, last update 2012-02-21
      • GGUS:76713: lcg_util Development: lcg-get-checksum command issues, request to lower priority, last update 2011-11-30
        • Maria D: you can use the escalation button if you disagree with how a ticket is treated
      • GGUS:76686: gLite Yaim Core: glexec error on cream, traced back to missing group in /etc/groups, installation problem, last update 2012-01-26
      • GGUS:74318: VOMS-Admin: output of the 'voms-admin list-members' contains multiple field separators, on hold, last update 2012-01-26
      • GGUS:66764 gLite VOBOX: Problem renewing proxies in the CMS VOBOX (phedex), suggestion to use new VOBox release, last update 2012-02-20
      • Oliver: keep generic tickets associated with CMS?
      • Maria D: keep such tickets with CMS as concerned VO while assigned to the correct support unit

  • ALICE reports -
    • SARA job submission stopped working when the VOBOX host certificate was recently changed to have a new subject DN, not configured in myproxy.cern.ch (GGUS:79423 opened for that now)

  • LHCb reports -
    • T0
      • SLS issue : someone removed some readonly volume yesterday around 3pm and one of our tests using this volume was timing out. The problem is fixed now.
    • T1
      • IN2P3 : (GGUS:79356) : using the queue with higher memory limit helps to run our jobs.
        • Rolf: now that your T2 jobs are running OK, will you keep the higher memory requirement and update the VO card?
        • Joel: there is a memory leak in the current application, a new version is being tested; we will inform you when the problem has been fixed
      • PIC : Request for space token migration (GGUS:79305)
      • GridKa : Request for space token migration (GGUS:79303) "nearly finished"
      • SARA : Request for space token migration (GGUS:79307)
    • T2
      • LAL (GGUS:79200). CREAM CE problem?
        • Maarten: you can use the ticket escalation button when there is no response

Sites / Services round table:

  • BNL
    • intervention tomorrow 15:00-19:00 UTC to move conditions data infrastructure to new HW and Oracle 11g, will affect data access and streaming
  • CNAF - ntr
  • FNAL
    • still reduced network capacity after Sunday outage, but meeting the operational requirements; non-disruptive intervention Thu morning CST to get back to full capacity
  • GridPP - ntr
  • IN2P3 - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC
    • planned network intervention went OK this morning, traffic was low
  • RAL
    • yesterday's CMS CASTOR update went OK, ATLAS instance will be done tomorrow
    • Oracle security patch will be applied on ATLAS and LHCb 3D databases on Thu, rolling intervention, services "at risk"

  • dashboards - ntr
  • GGUS/SNOW - ntr
  • grid services
    • CERN FTS The Lyon Monitor https://fts-monitor.cern.ch/ will be re-directed from https://fts300.cern.ch/ (version 1.5.1) to https://fts200.cern.ch/ (version 1.6.1) on Thursday 24th at 09:00 UTC. There will be a 10 minute outage for the alias. The old service will be destroyed quietly in two weeks time.
    • FTS 2.2.8 pilot looks fairly OK, though a small failure rate has been seen, not yet confirmed if that is due to the new code; proposal to upgrade T2 service tomorrow; known issues will be made available to T1 sites
  • storage
    • reminder - upgrade of CASTOR public instance tomorrow 09:30-12:30 CET
      • better Xrootd plugin
      • tape access improvements
    • upgrade of experiment instances to be decided after experience with public instance

AOB:

Wednesday

Attendance: local (AndreaV, Mike, Dan, Alesandro, Eva, Edoardo, Massimo, Xavi, MariaDZ); remote (Gonzalo/PIC, Mette/NDGF, Lisa/FNAL, Alexander/NLT1, Jhen-Wei/ASGC, Pavel/KIT, Tiju/RAL, Rolf/IN2P3, Rob/OSG, Michael/BNL; Joel/LHCb, Oliver/CMS).

Experiments round table:

  • ATLAS reports -
    • Tier0:
      • The castorpublic outage has the following services in the "Affected Service Endpoints": prod-lfc-atlas.cern.ch, fts22-t0-export.cern.ch, fts-t2-service.cern.ch. Are those really affected? The ATLAS auto-exclusion agents can take action to exclude these services so it should be accurate. [Massimo: long ago had an internal dependency of tests on castorpublic, now gone. It must be the CE/FTS tests that depend on it. Andrea: who is responsible for these tests? Alessandro: maybe IT-GT or IT-PES. Massimo: should be IT-PES, will follow up with them.]
      • [Massimo: in general, please remember that critical services should not depend on castorpublic for availability testing! ]
    • Tier1:
      • BNL CondDB outage just started. Clients should failover, so we are not taking any actions to disable Panda queues.
    • ADC Central Services:
      • Disk failure (warning) in one of the pandacache servers used to host user input sandboxes. If it is taken offline, 50% of user jobs will fail. Instead we have taken the machine out of the load balancer and are now waiting until all dependent analysis jobs have completed. A more resilient design is being implemented.

  • CMS reports -
    • Tier-0:
      • next Mid Week Global Run (Cosmics data taking without magnetic field): Wednesday, Feb 22 - Thursday, Feb 23
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
    • Tier-1 site issues
      • GGUS:79441: T1_ES_PIC: problem with PIC FTS server, CMS group was missing from the common groups.conf for yaim, SOLVED 2012-02-21
      • GGUS:79259: T1_IT_CNAF: request to update FTSMonitor, will be handled within 2-4 weeks due to separation from lemon that runs on the same server, last update 2012-02-21
      • GGUS:79257: T1_DE_KIT: request to update FTSMonitor, will be looked at next week, last update 2012-02-16
      • GGUS:75983: T1_FR_CCIN2P3: network degradation between ALL US T2s and IN2P3, 2nd 10 Gbit interface provided by RENATER, strong advice to use LHCOne, problem was handed to Internet2 who will contact Dante, set to SOLVED 2012-02-22

  • LHCb reports -
    • Experiment activities: User analysis, Validation productions for restripping
    • T0
    • T1
      • PIC : Request for space token migration (GGUS:79305)
      • GridKa : Request for space token migration (GGUS:79303) "nearly finished"
      • SARA : Request for space token migration (GGUS:79307)
    • T2
      • LAL (GGUS:79200). CREAM CE problem? [Joel: not happy about the reply from LAL about this issue, "we are on holiday". MariaDZ: is Rolf representing the French NGI here too? Rolf: no, representing only the French T1, not the French NGI. MariaDZ: will follow up on this.]

Sites / Services round table:

  • Gonzalo/PIC: ntr
  • Mette/NDGF: ntr
  • Lisa/FNAL: ntr
  • Alexander/NLT1: ntr
  • Jhen-Wei/ASGC: one CMS disk had a partition failure, we are looking into it. [Oliver: any idea which data it contains? Jhen-Wei: will follow up.]
  • Pavel/KIT: about FTS, our expert is away this week, will look into the issue next week
    • [Joel: any news about CVMFS? Pavel: last week reported that it will take a few weeks, no news since then.]
  • Tiju/RAL: successfully upgraded Castor for ATLAS
  • Rolf/IN2P3: ntr
  • Rob/OSG: we are progressing on the issue of attachment exchange with GGUS
  • Michael/BNL: intervention on ATLAS CondDB is starting and will take 4h

  • Xavi/Storage: castorpublic intervention is finished; interventions on the VOs will be announced
  • Mike/Dashboard: ntr
  • Eva/Databases: ntr
  • Edoardo/Network: ntr

  • Steve/Grid: CERN FTS T2 service CERN ITSSB. Upgrade to EMI 2.2.8 from gLite 2.2.4 completed by 12:30. Overrun by 30 minutes. Not quite as transparent as intended and the service endpoint was removed for 30 minutes. Propose to update the T0 is similar fashion on Tuesday 28th February. [Alessandro: this is in linewith what was agreed with Steve, ATLAS would like to wait a few days to make sure all is ok before this is deployed on a more general scale. Oliver: also true for CMS, no particular urgency on our side.]

AOB:

Thursday

Attendance: local(Alessandro, Alex, Dan, Eva, Giuseppe, Maarten, Maria D, Mike, Oliver K, Xavier);remote(Burt, Gonzalo, Jhen-Wei, Joel, Mette, Oliver G, Rob, Rolf, Ronald).

Experiments round table:

  • ATLAS reports -
    • Tier0:
      • NTR
    • Tier1:
      • It took RAL took some extra time (~2hrs) to come out of their downtime yesterday due to a fault in the new ATLAS SSB status switcher. Apologies for that.
    • ADC Central Services:
      • There was a drop in pilots coming out of the three CERN pilot factories yesterday afternoon due to a failed CondorG upgrade. Fixed last night.

  • CMS reports -
    • Tier-0:
      • next Mid Week Global Run (Cosmics data taking without magnetic field): Wednesday, Feb 22 - Thursday, Feb 23
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
    • Issues:
      • NTR

  • LHCb reports -
    • T0
      • CERN : EMI WMS problem fixed
                 2012-02-22 14:13:25 UTC WorkloadManagement/TaskQueueDirector/gLitePilotDirector
                 INFO: Reference https://wms301.cern.ch:9000/GnF9HdDoA5-r5LwXWlT29Q for
                 TaskQueue 2758122
              
                 2012-02-22 14:15:28 UTC WorkloadManagement/TaskQueueDirector/gLitePilotDirector
                 INFO: Reference https://wms301.cern.ch:9000/Fpl3XHXTKY_-_ps3D-z9SQ for
                 TaskQueue 2758135
                 
        Which means the problem is solved and LHCB is now successfully using this WMS. The other WMS that need to be updated with this patch are at least:
        • wms01.pic.es
        • wms02.pic.es
        • wms-lhcb.grid.cnaf.infn.it
        • lcgwms01.gridpp.rl.ac.uk
      • Maarten: GGUS:79096 provides a few rpms that fix the problem; other sites could go ahead and deploy them as well; an official release should appear some time in March
    • T1
    • T2
      • LAL (GGUS:79200). CREAM CE problem?
        • Maria D: will look into how the ticket got handled

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG
    • problem with ticket attachment exchanges between GGUS and OSG: solution expected next week
    • ticket exchange between FNAL and OSG currently broken, being looked into
  • PIC - ntr

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW - ntr
  • storage
    • proposal for CASTOR ATLAS upgrade on March 6
      • Alessandro: probably OK, we will let you know

AOB:

  • (MariaDZ) CERN Grid Services please GGUS:77157 requires attention as requested in the AOB of WLCGDailyMeetingsWeek120213#Tuesday.
  • FTS 2.2.8
    • Oliver K: release notes will have the CERN experience with the upgrade; let us know ASAP if a gLite 3.2 release is needed
    • Alessandro: some T1 already indicated when they can do the upgrade
      • missing/incomplete: IN2P3, INFN-T1, KIT, PIC, SARA, TRIUMF
      • let's try to agree a schedule at the next T1 Service Coordination Meeting
      • each upgrade can be smooth, no real worries

Friday

Attendance: local(Massimo, Oliver, Dan, Alex, Eva, David);remote( Gonzalo, Onno, Xavier, Rolf, JhenWei, Lisa, Yiju, Joel, Roger, Rob, Michael).

Experiments round table:

  • ATLAS reports -
    • Tier0: NTR
    • Tier1:
      • RAL: transfer errors at RAL-LCG2, mainly with destination RAL-LCG2_DATADISK, with errors
        • [SRM_INTERNAL_ERROR] Too many threads busy with Castor at the moment.
        • [SRM_FAILURE] No valid space tokens found matching query
        • submitted GGUS:79592
        • "the problem looks to be load related and to the fact that there is a script generating a lot of load as it cleans tape backed files from disk that aren't needed". Need to pause that script.
        • problem solved.

  • CMS reports -
    • Tier-0:
    • Processing activities:
      • 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
    • Issues:
      • GGUS:79623: T0_CH_CERN: CASTOR-SRM_CMS is not available, last update 2012-02-24

Sites / Services round table:

  • KIT (Xavier)
    • GridKa's annual site downtime is now scheduled for March 13th, 4 a.m., till 15th, 11 a.m. (UTC). All services will be down, except for FTS, which will be down only March 14th.
    • In favour of the preparations for the downtime, the priority of CVMFS is a bit lowered. We target the week after the downtime (cw 12) for first production tests.
  • PIC
    • NTR
    • LHCb asked if they are going to put in place a workaround for the EMI WMS. This will be asked next week.
  • NLT1: NTR
  • IN2P3: NTR
  • ASGC: NTR
  • FNAL: NTR
  • NDGF: Network problems. Investigating.
  • OSG: Fix for ticket with attachment available; it will be in production on Monday
  • BNL: NTR
  • RAL: The site asks ATLAS why they do not see activity since yesterday 5GMT. It looks to be a problem for other sties as well.

  • Central Services:
  • Dashboard: NTR
  • Storage: Investigating the CMS ticket
  • DB: WLCG prod will be unavailable ~15 Monday at 9:00 (reboot needed for enabling the Ora11 compatibility mode)

AOB: (MariaDZ) There was a (admitedly cryptic) update in the LAL (GGUS:79200) CREAM CE problem and the ticket is now waiting for reply from LHCb. Please don't forget unusual date for GGUS release MONDAY 27 Feb.

-- JamieShiers - 18-Jan-2012

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2012-03-07 - JaroslavaSchovancova
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback