Week of 110418

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Steven, Fernando, Alessandro, Jamie, Maria, Dan, Maarten, Dirk, Massimo, Zbyszek, Peter, David, MariaDZ);remote(Kyle, Onno, Gareth, Rolf, Federico, Michael, Gonzalo, Jon, Jhen-Wei, Andreas, Giovanni).

Experiments round table:

  • ATLAS reports -
    • ATLAS
      • Stable beams over the weekend, with 336 bunches.
      • Physics run, 336b, today.
    • GGUS
      • GGUS down Monday morning (Oracle connection problems). GGUS team contacted.
    • T1s
      • NDGF-T1 partial tape system downtime continued over the weekend.
      • ASGC failing all FTS transfers on Friday evening (CERN and ASGC FTS servers). Alarm ticket sent, GGUS:69743.
      • RAL batch system opened to 200 jobs on Saturday, allowing fast data11 reprocessing to finish. SRM stable over the weekend.

  • CMS reports -
    • LHC / CMS detector
      • Good stable fill on 336 bunches on Sunday (unfortunately CMS ran without a fraction of the ECAL). Plan is to repeat run today.
    • CERN / central services
    • Tier-0 / CAF
      • Tier-0 farm was saturated with Sunday run (Pending + Running jobs == 2.5k)
      • CAF getting filled with jobs now [ Steve - loads of "swap full" alarms yesterday ]
    • Tier-1
      • Finishing one "older" data reprocessing request
      • MC production in progress
      • WMAgent testing on-going at various sites
    • Tier-2
      • CMS Tier-2 Readiness pretty good lately : e.g. in last 2 weeks, 78% sites with CMS Readiness fraction > 80%
      • MC production and analysis in progress
      • CMS still has the general issue that CREAM CE SAM tests are not reporting properly to the SSB, hence triggering wrong alarms, see Savannah:113192
    • Other
      • new CRC-on-Duty from tomorrow on : Ian Fisk

  • ALICE reports -
    • T0 site
      • Strange behaviour of the WMS nodes used by ALICE led to the discovery of a few k jobs "lost and found" by LSF.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Data taking in the weekend, no data now. Reconstructions and stripping activities going now.
    • T0
      • Tape drive allocation has been done. modification of space token at CERN (GGUS:69709)
    • T1
      • glexec test failures at IN2P3-CC (GGUS:69793) [ Maarten - CEs that fail tests are part of T2 at IN2P3. T1 CEs are fine. Will follow up with IN2P3. ]
    • T2

Sites / Services round table:

  • NL-T1 - announcement: SARA SRM - at risk on Thursday to update CERTs.
  • RAL - comment on problems with ATLAS SRM Thu-Fri and much reduced batch over w/e. Not completely understood but workaround in place. During morning increasing batch work - going ok. Almost batch to normal. AT RISK tomorrow morning for a couple of hours due to problem network with small breaks in connectivity. Will pause batch just before and drain FTS. Hopefully just a very short stop
  • BNL - ntr
  • FNAL - ntr
  • PIC - ntr; reminder that we are scheduled downtime tomorrow for electrical interventions, batch queues currently being drained
  • IN2P3 - nta
  • NDGF - on Friday we mentioned a planned downtime tomorrow for scheduled upgrade of dCache head node. Scheduled until 17:00 but will probably finish earlier. No communication with tape guys regarding Danish tapes - look as if still out according to ATLAS report. At noon we had sudden failure of some network equipment - reading unavailable both tape and disk for some ALICE and ATLAS data. No eta for when fixed.
  • CNAF - downtime announced last Friday confirmed for tomorrow 10:00 - 14:00 UTC tape library of CNAF will not be available.
  • ASGC - ntr
  • OSG - GGUS outage?

  • GGUS - 009:30 KIT unscheduled network downtime. Was this eventually published? Back available at 10:30.

  • CERN Storage: case of 1 file late to tape seen by LHCb fixed on spot (raw data). A few files still hanging around in strange status.

  • CERN DB - first upgrade of 11g on a TEST DB. We will perform upgrade of production in same fashion at the end of the year and all went smoothly. 1 incident with replication to T1s: due to deadlock of streams processes replication stuck for ATLAS conditions and LHCb cond+LFC. Problem with Oracle s/w. From time to time it happens... LFC affected and LHCb conditions for 30'. ATLAS a bit longer but all up by 12:00. (Took about 2h for ATLAS).

AOB:

Tuesday:

Attendance: local(Fernando, Maria, Jamie, Maarten, Eva, Nilo, David, Alessandro, Andrea V, Dan, Massimo, MariaDZ);remote(Gonzalo, John, Jon, Ronald, Rolf, Felix, Rob, Federico, Giovanni, Andreas, Dimitri).

Experiments round table:

  • ATLAS reports -
    • ATLAS
      • Physics with 336b - 72b/injection but moderate intensities
      • Calibration during inter fill period, then start physics run at injection (data11_7TeV)
    • T1s
      • RAL scheduled a downtime between 10am - 12am CERN time for a network interruption. Downtime was expected to be short so UK cloud was not excluded from DDM. However during the interruption internal network problems appeared and as a consequence FTS is still not available - DDM Site Services are not able to contact RAL FTS.
      • PIC: Outage of all services at PIC due to the yearly electrical maintenance. Panda queues for ES cloud were set to offline on Monday around 6:00pm in order to drain. Later at 1:00am the cloud was set offline in DDM.
      • INFN-T1: Scheduled downtime at INFN-T1 between 10am and 6:00pm for a tape library upgrade. All data on tape will be unavailable, but the data stored on disk will be available during the intervention. According to accounting for INFN-T1 it looks like there is plenty of free space on the tape buffer so INFN-T1 has not been excluded from Santa Claus.

  • CMS reports -
    • LHC / CMS detector
      • Good filling.
    • CERN / central services
      • CMS CVS repository moved from the current LCGCVS service to the CERN CVS service on Monday 18th April, see announcement here. CMS identified smalls issues that are being addressed at the moment by experts (doing a re-sink).
    • Tier-0 / CAF
      • Tier-0 farm
      • CAF getting filled with jobs now
    • Tier-1
      • Will hopefully launch reprocessing of 2010 Data on Wednesday/Thursday
      • MC production in progress. A number of open tickets about tape families for custodial MC (RAL, PIC, CCIN2P3)
    • Tier-2
      • MC production and analysis in progress
    • Other
      • new CRC-on-Duty : Ian Fisk

  • ALICE reports -
    • T0 site
      • WMS behaviour back to normal after the "lost and found" jobs were cleaned up yesterday.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Data taking in the weekend, no data now. Reconstructions and stripping activities going now. Hitting memory due to application problems: temporary solution found, special thanks to IN2P3 and CNAF for letting us completing.
    • T0
      • New space token have been setup - thanks to CERN! [ Massimo - the installation of hw required for this operation finished yesterday, hope the space will be available soon. ]
    • T1
      • Two GGUS tickets open (GGUS:69812 GGUS:69813) and closed quickly by RAL and IN2p3 to increase memory there.. Thanks for increasing the memory to run our stripping for the next days.
      • GGUS:69827 Slowness with FTS xfers at Lyon . Channel IN2p3-In2p3 was closed

Sites / Services round table:

  • PIC - ntr; scheduled downtime on-going
  • RAL - FTS has just started again - about 20' ago. Had a switch that was misbehaving and had to reconfig. After this some ports not working - now should be ok. Roughly 30K ALICE jobs queued here at RAL. Gareth is sending e-mail to ALICE asking for them to look into it. Maarten - will do.
  • FNAL - ntr
  • NL-T1 - tomorrow 10:00 - 11:00 local time SARA storage at risk. Tape b/e m/cs needs reboot. Reading from tape will not work during that time, writing ok.
  • IN2P3 - ntr
  • ASGC - ntr
  • CNAF - tape library downtime on-going; 14:00 UTC is the scheduled end of this downtime. Nothing else to report. Alessandro - very difficult for experiment to verify of all ok for tape libraries. Please can you send a confirmation when the intervention is over. e.g. cloud support team in IT.
  • NDGF - today's dCache upgrade worked fine, finished at noon. Network problem yesterday due to equipment breaking, fixed yesterday pm. Danish tape libs should be online since late last night. Can't confirm personally. ATLAS will look at it.
  • KIT - ntr
  • OSG - looks like downtime for GGUS has been added to AOB - this is what we still needed. Trying to understand if this will have affected ticket exchange. Pretty unlikely as early morning OSG time. Will look at this time period to make sure all messages were transferred as needed.

  • CERN Storage - on top of what was mentioned for LHCb strange alarm on ALICE side; bug on our side only affecting probes not service. Have two files with problems for CMS - following up. Very strange unusual case

AOB: (MariaDZ) Answer to OSG's question yesterday: The KIT network problems on Monday 2011/04/18 started around 9:15am CEST. The end of downtime mail reached the GGUS developers at 10:23. Not sure GGUS was completely unavailable during all of that period (~1 hour).

  • CNAF - question on LCG CE in calculation of availablity - is it possible to exclude this in calculation. Ale - working on this; trying o put algorithm in place that works with LCG or CREAM CE, as available and applicable. Not yet finalised. Maarten - idea is that by end June it should no longer be necessary to have an LCG CE for neither experiments nor tests for availability. If you have a working LCG CE keep it running for these remaining couple of months!

Wednesday

Attendance: local(Ewan, Maarten, Fernando, David, Alessandro, Jan, Maria, Jamie, Edoardo, Luca, MariaDZ);remote(John, Federico, Rob, Ian, Gerard, Jon, Xavier, Jhen-Wei, Michael, Giovanni, Onno).

Experiments round table:

  • ATLAS reports -
    • ATLAS
      • start physics run at injection (data11_7TeV)
      • 16:00-19:00: Calibration period
        • Access for Pixel
        • BCM clock tests
      • Night: Physics (LHC aiming for next step in intensity 480 bunches)
    • T1s
      • PIC: Extended downtime until 8PM. ES cloud continues offline in Panda, DDM and has been excluded temporarily from Santa Claus
    • Central Services
      • DDM Central Catalogues: Alarms in the evening of Apr 19 due to known problems (load balancing and connections closed by client throwing IOErrors).

  • CMS reports -
    • LHC / CMS detector
      • Expecting physics after yesterday's cryo issues
    • CERN / central services
      • We have a couple of tickets for jobs stuck we can't kill. We have a repacker job stuck and holding the system. The operations team would like access to bkill -r from the cmsprod account. [ Ewan - CMSPROD should be able to issue this for any jobs run by CMSPROD already. Ian - can I put you in touch with the person in charge? Will get more details and submit a GGUS ticket if necessary. Incident report on CMS report Twiki ]
    • Tier-0 / CAF
      • Runs over the weekend working through Prompt Reco. Good utilization but ramping down now
    • Tier-1
      • Will hopefully launch reprocessing of 2010 Data on Thursday
      • MC production in progress. A number of open tickets about tape families for custodial MC (RAL, PIC, CCIN2P3)
    • Tier-2
      • MC production and analysis in progress
    • Other
      • new CRC-on-Duty : Ian Fisk

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • RAL: huge number of waiting jobs due to CREAM CE often reporting zero instead, thereby inviting more jobs to be submitted (GGUS:69856) [ John - only just seen that ticket reopened. Have (probably) removed jobs - will put to solved again ]
      • KIT: some 4k running jobs do not appear in MonALISA; they might be stuck on something; under investigation
    • T2 sites
      • Usual operations

  • LHCb reports - Low rate of reconstructions and stripping activities going now.
    • T1 - Closed ticket GGUS:69827 (IN2P3-IN2P3 FTS transfers)
    • A number of tests - job submission in particular - failing at IN2P3. Seems to be a problem in the way that CEs are advertised in BDII.

Sites / Services round table:

  • CNAF - down time of yesterday afternoon for tape library over without problem
  • NL-T1- this morning had unscheduled downtime to fix a problem on tape b/e. Some tape files were therefore not available. Maintenance completed ok - now everything works. Since we had a downtime we did tomorrow's downtime today(!) CERTs updated this morning and tomorrows downtime for SARA SRM cancelled.
  • RAL - nta
  • PIC - yesterday we had a downtime extended to today due to big problems in cooling system. Still a water leak but hope to get green light at 16:00 and then will take ~4 hours to get all back on line, i.e. by 20:00.
  • FNAL - ntr
  • KIT - ntr
  • ASGC - ntr
  • BNL - ntr
  • OSG - there was a change at ESNET CRL location yesterday. Resolved and redirections put in place yesterday/ Some OSG users with DoE CERTs cannot sign on to CERN SSO. Try to resolve this morning (local time).
  • GridPP - ntr

  • CERN Storage - 30' unavailability for CASTOR ALICE yesterday. S/w bug combined with network trouble (bad switch).

AOB: (MariaDZ)

  • There is a request to send email notification to sites on every ticket update Savannah:120243. When 'direct site notification' was implemented in July 2008 for Tier0/1 and generalised in January 2009, sites insisted they don't want email floods, just to know there is a ticket for them. Should we change now? What do WLCG sites think? Please say now or comment in the ticket.
  • As already announced in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek101115#Wednesday (AOB) the email templates for submitting TEAM and ALARM tickets will be decommissioned with the GGUS May Release on 2011/05/25.

Thursday

Attendance: local(Steven, Maarten, Ian, Fernando, Andrea, Jan, Alessandro, Maria, Jamie, Simone, Zbyszek, Stephane);remote(Kyle, Roberto, Rolf, Ronald, Michael, Jon, Giovanni, Jhen-Wei, Andreas, Gerard, Xavier, Gareth).

Experiments round table:

  • ATLAS reports -
    • ATLAS
      • Calibration period until injection of new fill
      • Start physics run at injection (data11_7TeV)
      • LHC planning physics with 480bunches /36bunches per injection and 336bunches / 72 bunches per injection
    • T1s

  • CMS reports -
    • LHC / CMS detector
      • Good run last night. Depending on when online updates we may be running at high rates throughout the weekend
    • CERN / central services
      • Wrote directly to an LSF supporter. bkill -r was disabled, but is hopefully being re-enabled. Had the near dead node that was holding 8 jobs put out of its misery.
    • Tier-0 / CAF
      • A number of jobs in stuck state.
    • Tier-1
      • Will hopefully launch reprocessing of 2010 Data on Thursday
      • Open (Savannah) tickets about tape families for custodial MC (RAL, PIC, CCIN2P3). Delaying MC now [ Rolf - don't find any ticket for CMS. Ian - Savannah ticket? Possible that bridge to GGUS didn't work. Rolf- we have automatic procedures for GGUS only ]
      • PIC had PhEDEx down after the maintenance, but back this morning
    • Tier-2
      • MC production and analysis in progress. Reduced effort over the holiday
    • Other
      • CRC-on-Duty : Ian Fisk

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • RAL: yesterday's ticket (GGUS:69856) was solved after the meeting
      • KIT: jobs missing in MonALISA - problem mostly fixed by upgrading to latest AliEn version; less than 500 such jobs remain, we may kill them at some point [ by now only 350 left - will probably die by themselves ]
      • KIT: configuration error caused ALICE::FZK::Tape SE to fail write requests; quickly fixed by KIT admins
    • T2 sites
      • Usual operations

  • LHCb reports -
    • RAW data distribution and their FULL reconstruction (Magnet Polarity UP) is going on w/o major problems on all T1s.
    • Decided to stop the stripping of reconstructed data with Magnet Polarity DOWN having found a problem in the options used. Most likely will restart today with just one input file per job (instead of 3 due to a known memory leak issue at application level).
    • T0
      • Setup two new space tokens for LHCb: LHCb_Disk and LHCb_Tape. Migration of disk servers after Easter.
      • Problems staging files out of Tape (72 files). Files requested to be staged yesterday we would have expected them online. GGUS ticket most likely to be filed, Philippe looking on that. (these are the same 72 files as below).
    • T1
      • PIC: back from the downtime, no major problems to report with.
      • RAL: reported 72 files corrupted, trying to recover them from CERN

Sites / Services round table:

  • IN2P3 - ntr
  • NL-T1 - ntr
  • BNL - ntr
  • FNAL - ntr
  • CNAF - ntr
  • ASGC - this morning some jobs failed from no data from DPM. Network not set properly on one disk server - fixed and now should be ok
  • NDGF - quiet running today
  • PIC - ntr
  • KIT - ntr
  • RAL - trying to find GGUS ticket relating to CMS tape families - maybe never made it into GGUS from Savannah. ATLAS SRM - some significant problems yesterday and this morning. Believe it is fixed. DB problem with optimisation behind SRM. [ Ale - is this the same problem as last Friday? A- believe so. Made network intervention on Tuesday. During that Oracle decided to move CASTOR N/S DB to a different node due to failover - now on same node as ATLAS SRM. Pushed that back yesterday evening. Probably broadly the same. DB people looked at it all morning and think it is ok now but low load.
  • OSG - CERN SSO issue created by ESNET CRL move seems to be resolved other than browser caching issues. If you restart browser then ok.

  • CERN DB - ATLAS there is a GEOMETRY schema added to replication to T1s. Receiving quite high load on DB - 100K queries. Contacted Gancho who is following this.

  • CERN VOMS service The certificate for the LHC voms services on voms.cern.ch will be updated on Wednesday April 27th. The current version of lcg-vomscerts is 6.4.0 and was released 2 weeks ago. It should certainly be applied to gLite 3.1 WMS and FTS services. [ Has been put into release for those services a few weeks ago. T1s running FTS services should be sure that they have latest version of RPM. ]

AOB:

Friday

  • No meeting - CERN closed

-- JamieShiers - 15-Apr-2011

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2011-04-21 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback