Week of 111010

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Cedric, Dirk, Eva, Iouri, John, Maarten, Maria G, Massimo, Mattia, Steve);remote(Dimitri, Gonzalo, Joel, Kyle, Lisa, Maria D, Markus, Michael, Onno, Rolf, Thomas, Tiju).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • DB issues :
        • ATLR : "There are plenty of ATLAS_COOL_READER_U sessions on inst 2 and 3 on the ATLR. In particular there are tens of sessions from OSUSER usatlas1 and usatlas2." Under investigation
        • ATLR :High load generated by HTL repro (known issue).
        • ADCR : High load due to deletion (problem with 2 indexes) fixed by Gancho.
      • CERN-PROD_SCRATCHDISK (EOS) claimed to be over quota. Was due to a bad configuration (fixed by Ale).
    • T1 sites
      • Transfer timeouts to Lyon : GGUS:75114
        • Rolf: the ticket is solved since ~10 AM
      • Files not accessible at FZK : GGUS:75125
    • T2 sites
    • Mattia: during the weekend the ASGC CE tests timed out, Alessandro has been notified

  • CMS reports -
    • CAF job submission problem again. Solved, see INC:070575
      • Steve: the issue is due to a non-transparent replacement of old by new HW
        • the old HW is being drained of long jobs, while the new HW is not yet available

  • LHCb reports -
    • Experiment activities
      • Reconstruction and stripping at CERN
      • Reprocessing at T1 sites and T2 sites
    • T0
      • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
    • T1 sites:
    • T2 sites:
      • IN2P3-t2: (GGUS:75131) Increase the queue length
        • Rolf: that ticket was handled this morning and is waiting for a reply
    • Mattia: KIT tests are failing, the WMS cannot match the affected CEs

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - nta
  • KIT - ntr
  • NDGF
    • Friday's network problem was solved Fri afternoon by the replacement of a faulty transceiver card
    • this afternoon's network maintenance has been done
    • this evening's network maintenance has been postponed to Thu
  • NLT1
    • during the weekend 1 SARA pool node crashed and only rebooted OK after a configuration error had been corrected today
    • a few other pool nodes will be rebooted this afternoon to pick up a new 10 GbE driver - some files will be unavailable for a few minutes, but those nodes are new and only contain some 3 TB of data in total
    • NIKHEF are draining their queues to prepare for a downtime starting this afternoon. All worker nodes will be switched off, to do maintenance on the power infrastructure. They expect to switch them on tomorrow at the end of the afternoon.
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR/EOS
    • EOS update for ATLAS tomorrow afternoon
  • dashboards - nta
  • databases
    • ADCR was impacted by a network issue affecting the SafeHost data center
  • grid services - ntr

AOB: (MariaDZ) As there was no update from KIT to last Friday's AOB here is the info on the web services' interface problems that GGUS experienced on 2011/10/06 from the GGUS developers: The reason for the interface problems have been caused by an update of the intrusion prevention system (IPS). Due to this update the KIT DNS was not able to get in touch with other DNS servers outside. After rolling back to the previous version of the IPS it took some time until the DNS communication worked correctly again.

  • the GGUS team will provide a Service Incident Report for WLCG

Tuesday:

Attendance: local(Jamie, Oliver, Yuri, Steve, Ookey, Massimo, John, Maarten, Mattia, Maria, Alessandro);remote(Paco, Joel, Marc, Xavier, Thomas, Tiju, Rob, Lorenzo, Gonzalo, MariaDZ).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • DB issues :
      • ADCR : many blocking sessions affecting availability and causing some issues on panda-server, DDM,etc. Network/HW device repaired.Resolved yesterday.
      • Oracle issue with the DQ2 catalogues. Fixing it is in progress (as of noon today).
    • CERN AFS problem caused ssh connection to lxplus,lxvoadm failures and slowness: GGUS:75183 solved.
    • Transfer failures to EOS space tokens at CERN-PROD: GGUS:75191 assigned. Under investigation.
  • T1 sites
    • NDGF-T1 MCTAPE recalls failing: GGUS:75174 in progress - one of tape pools has h/w problems. Downtime announced. [ Thomas - tape pool we had problems with is one of several. Just one of tape libraries so didn't feel appropriate to downtime entire storage. We are working to get a working tape pool up but no progress report. ] Yuri - temp disabled tape recalls so this at least will improve monitoring (!)
    • SARA : file transfer failures, CRL has expired. GGUS:75188 - problem on some new dcache pool nodes with the fetch-crl script fixed this morning.
    • Taiwan-LCG2 production job failures "Voms proxy certificate does not exist or is too short". GGUS:75192 in progress: problem with the shared library loading. [ managed 1h ago to get this library so should be fixed now ]


  • CMS reports -
  • LHC / CMS detector
    • not much data in the last 24 hours
    • HighPU runs were taken successfully, will go into PromptReco after 48 hours waiting period
  • CERN / central services
    • lsf-rrd.cern.ch: sends wrong host certificate, annoying smile - will open ticket
    • Oracle rebooted again yesterday. Seems to happen at least twice a week. Very disruptive. Ticket reference GGUS:74993 (have to update it again). Is there an update about a possible fix? [ Whole Oracle CMS instance will be rebooted at 16:00 to enable more logging to get more info to Oracle who will hopefully fix ]
    • CAF job submission problem was solved, due to notes being retired ticket, all is ok again from our side, ticket was closed.
    • 5' ago issued alarm ticket to CERN - user jobs cannot open xroot files. GGUS:75218
  • T0 and CAF:
    • cmst0 : very busy processing data !
  • T1 sites:
    • all sites used to 100%, big MC reprocessing campaigns going on
    • next downtime: FNAL, Oct. 15th, 10 AM to Oct. 16th, 1 AM (CERN time)



Experiment activities:

  • Experiment activities
    • Reconstruction and stripping at CERN
    • Reprocessing at T1 sites and T2 sites
  • T0
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775) [ lfc config in bdii - ATLAS also affected by this mis-config ]
  • T1 sites:
    • IN2P3 : (GGUS:75132) installation of LFC version 1.8.2-2
    • CNAF : (GGUS:75162) : problem with CREAM CE [ Lorenzo - ticket mentioned by LHCb is "in progress". People are working on it now ]
    • GRIDKA : (GGUS:75170) wrong information in the BDII
  • T2 sites:


Sites / Services round table:

  • NL-T1 - ntr
  • KIT - ntr.
  • IN2P3 - issue with Oracle grid engine. Observed several times unexpected behaviour of scheduler. Sometimes scheduling not done for several tens of minutes. Working on config to see if there is something that could explain behaviour. Submitted jobs remain in q and will eventually be spawned. Working with Oracle to find root cause. Maria - is there any update on SIR asked for GGUS problems? no, sorry.
  • NDGF - as said earlier broken tape pool but otherwise NTR
  • RAL - ATLAS instance of CASTOR down for 1h due to DB problem. Fixed an AOK now.
  • ASGC - some job failed for ATLAS. We will see what is going on. CMS: some files could not be scratched from temp disk. Large # jobs pending in transfer manager. Doing disk draining. Stopped draining and let CASTOR focus on staging requests. See if this fixes it. Oliver - we had Savannah ticket open but not GGU.
  • CNAF - site at risk to update Storm s/w. Affect only ATLAS VO.
  • PIC - ntr
  • OSG - in prod services maintenance window - started 20' ago and lasts 2 more hours. Should be transparent.

  • CERN EOS: regarding SRM problem last night this was due to FUSE layer. Has been addressed now.

AOB:

Wednesday

Attendance: local(Yuri, Jamie, Daniel, Mattia, Luca, Maarten, Edoardo, Massimo, John, Alessandro);remote(LIsa, Marc, John, Joel, Lorenzo, Thomas, Pavel, Ron, Oliver, Rob).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • ATLR Oracle DB problem: account locked since 1am. ALARM GGUS:75234. Affected T0 reconstruction and access to Conditions DB till the morning.Unlocked at ~7:50 UTC by ATLAS DBA. Job submission resumed. [ Luca - followed up with Gancho and Armin. SInce a couple of months see high load from time to time from ATLAS T0 jobs. As yesterday. DB was inaccessible due to high load. Rate of submission of jobs was so high that account had to be locked. GGUS ticket: opened by Armin and at the same time DB monitoring alarms fired. GGUS alarm arrived as SNOW at 10:00. But usual SMSes and phone call ok. DBA called OPS. Massimo - EOS best effort people were called before DB. Will check procedures. Maarten - is high load understood? A: yes, as consequence of high submission rate. Action plan: some throttling of jobs implemented on ATLAS side. On DBA side decreased max # connections for this particular account. Long term plan is to use FroNTier for this type of workload. Typically query same kind of data from many clients. Now 150 connections x 2. ]
      • After the meeting Dario Barberis commented the following: There was nothing different in the job submission rate last night from what it always was. The throttling was implemented years ago and was always working OK so far. The problem is NOT understood as there was no unusual action from the ATLAS side.
        • Luca Canali confirmed this on Thu and reported that IT-DB and ATLAS have agreed an action plan to solve the matter.
  • T1 sites
    • NDGF-T1: MCTAPE recalls still fail: GGUS:75174 in progress, updated at ~9:00 UTC. [ Thomas - just in process of testing new pool. Should be available very soon. ]
    • TRIUMF-LCG2: transfer failures: AGENT errors. GGUS:75215. The endpoint disappeared in services.xml file and got recovered with a newly generated version. No more errors in the last 8h.
    • SARA: LFC migration is in progress. NL cloud set offline (no new job submission, file transfer stopped temporarily).


  • CMS reports -
  • LHC / CMS detector
    • data taking resumed, good fill in since about 4 AM CERN time
  • CERN / central services
    • lsf-rrd.cern.ch: sends wrong host certificate, INC071288
    • CMSR Oracle instance rebooted yesterday, 4 PM CERN time, updated ticket GGUS:74993 so that we follow up with the additional logging info with Oracle
    • MyProxy problem yesterday, reported by user support GGUS:75228 and another team ticket by me GGUS:75226
      • both tickets have been opened before 2 PM UTC
      • assigned to MyProxy 2nd Line Support at 14:51 UTC
      • no reaction by experts since then
      • does not look normal, would have assumed getting reply from experts yesterday
      • this morning, MyProxy is still only at 80% http://sls.cern.ch/sls/service.php?id=MyProxy
      • tickets were closed at 7 AM UTC today, MyProxy outage understood and preventive measures taken. sls monitoring at 80% also understood, will be fixed
    • CAF file access problems GGUS:75218 was solved, kerberos machine had an issue handing out kerberos tokens
  • T0 and CAF:
    • cmst0 : very busy processing data !
  • T1 sites:
    • all sites used to 100%, big MC reprocessing campaigns going on
    • Small problems with few files with access problems at T1_TW_ASGC, following up by comparing checksums locally and in bookkeeping systems
    • next downtime: FNAL, Oct. 15th, 10 AM to Oct. 16th, 1 AM



Experiment activities:

  • Experiment activities
    • Reconstruction and stripping at CERN
    • Reprocessing at T1 sites and T2 sites
    • Again no notification received since more than 24h about downtime (GGUS:75243)
  • T0
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775) [ Mattia - problem needs to be redirected to LFC people at CERN. LHCb needs to be added to list of VOs for that site. Should be assigned to CERN-PROD. ]
    • Again no notification received since more than 24h about downtime (GGUS:75243)
  • T1 sites:
    • CNAF : (GGUS:75162) : problem with CREAM CE [ ticket will be updated ]
    • GRIDKA : (GGUS:75170) wrong information in the BDII [ Pavel - problem is fixed now and correct values should be published soon. Not a problem unique to GRIDKA - also other sites have this. Maarten - have to be very careful comparing sites; depends if Qs are dedicated to VOs. ]
    • GRIDKA: (GGUS:74915 ) space token migration [ Philippe discussed with Andreas and we will get some space at end of week ]
  • T2 sites:
    • IN2P3-t2: (GGUS:75131) done but no jobs are running . We should discuss with IN2P3 concerning this issue. [ working lunch with IN2P3 people today; summary will come in GGUS ticket ]
    • IN2P3 - issue with streaming being followed up.


Sites / Services round table:

  • FNAL - more details about downtime Sat. Outage from 3am to 6pm local time. Whole site computing downtime, including email, web, CMS interactive and grid and transfer will all be down.
  • IN2P3 - ntr
  • RAL - ntr
  • CNAF -
  • NDGF - nta
  • KIT -
  • NL-T1 - ntr
  • OSG - maintenance yesterday went smoothly and transparently.

  • CERN DB - collaborating on LFC merge of SARA into CERN

  • Network - tomorrow we will reconnect 10Gig link from CERN to KIT. Start at 11:00 - interruption of 10-15' traffic will be re-routed so site will not be disconnected.

AOB:

Thursday

Attendance: local(Steve, Mattia, Maria, Jamie, Yuri, Dario, Oliver, Ookey, Eva, Massimo, John, Maarten, Simone, Alessandro, AndreaS, MariaDZ);remote(Joel, Michael, Paco, Thomas, Lisa, Rolf, Rob).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • another ATLR Oracle DB (and the db-dashboard) problem. DBAs contacted, no GGUS ticket, just Elog created for the record. (see also T2 issue below) [ Eva - DB problem all one. High load from COOL T0 job processing. Rate of job submission has not changed but getting more load. Investigating together with ATLAS. From time to time observe spikes of high load. DB gets inaccessible and several node reboots caused by clusterware due to high load. All applications affected by this high load and from time to time connections not possible. Job failures creating a lot of sessions on cluster consuming a lot of resources and creating listener problems. DBA removing manually these sessions. Have implemented a job that runs every 5' to do this automatically. Dario - Muon calibration problem occurred during the above - should not investigate too much now. Problems don't occur when jobs are initializing. From T0 monitoring side nothing unusual. Maria - need two SIRs, one on DB problem and one on why GGUS tickets don't get through. Massimo - did investigation: will report later. ]
  • T1 sites
    • NDGF-T1: MCTAPE issue resolved.
    • TRIUMF-LCG2: issue resolved.
    • SARA: LFC migration completed. NL cloud transfers were activated, under testing with production and analysis jobs.
  • T2 sites
    • Michigan muon calibration DB replication to CERN delay 4-11h. Under investigation. [ Michigan thinks ok their end and perhaps streams replication from CERN. ATLAS DBA reported something wrong with apply process which was restarted and then now ok ]


  • CMS reports -
  • LHC / CMS detector
    • data taking continued, another good fill in since about 3 AM CERN time
  • CERN / central services
    • CMSR Oracle instance rebooted yesterday, 4 PM CERN time, updated ticket GGUS:74993 so that we follow up with the additional logging info with Oracle
  • T0 and CAF:
    • cmst0 : very busy processing data !
  • T1 sites:
    • [T1_US_FNAL]: network connectivity lost, ALARM ticket: GGUS:75267, people are working on a solution. Problem with switch that was rebooted and cleared possible errors, now back up.
    • [T1_TW_ASGC]: file access problems, MinBias MC files used for Pileup mixing not accessible by processing jobs, debugging ongoing GGUS:75258 [ Can access files again, slowly ramping up again but don't understand source of error. Were running flat out at ASGC when hit problem. Will keep ticket opened until confirmed that flle access ok ]
    • next downtime: FNAL, Oct. 15th, 10 AM to Oct. 16th, 1 AM


  • LHCb reports -
  • Experiment activities : reach 1 fb-1 this morning.
    • Reconstruction and stripping at CERN
    • Reprocessing at T1 sites and T2 sites
  • T0
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
  • T1 sites:
  • T2 sites:
    • IN2P3-t2: (GGUS:75131) done but no jobs are running . We discuss with IN2P3 concerning this issue and Pierre will make a summary of our discussion.


Sites / Services round table:

  • BNL - ntr
  • NL-T1 - as soon as problems with BDII are solved expect to see LHCb jobs running. Joel - not due to this issue. Requests in stager stuck - not linked to BDII. Only for SAM tests is BDII involved.
  • ASGC - as reported CMS can now access files
  • NDGF - flaky tape pool replaced yesterday evening so tape reads should be fine. DOwntime for network maintenance postponed to next week. Downtime of 1h on Monday for some dCache upgrades
  • FNAL - we are having power outage on Sat from 3am to 6pm Chicago time. Nothing to add on network problem
  • IN2P3 - ntr
  • OSG - ntr

  • CERN Grid: myproxy. On Tuesday afternoon there was a 20' outage for config change. Will avoid doing what we did from now on. Myproxy was 80% for a day and a half due to cert expiring on non-production nagios box. BDII thing mentioned by Maarten now fixed

  • CERN DB: currently adding 2 nodes to ATLAS offline DB to help with high load. Problem with muon replication - when moved to production propagation to integration DB was kept. Requested to ATLAS to remove propagation to integration DB.

AOB:

  • Massimo: Case of EOS being called for ATLAS DB problem. Checked operator instructions. On front page some services like CASTOR are singled out. Some problems result in very long drop-down list. Discussed with Fabio to put also DBs on front page to avoid human errors. They will add some info on how to treat emails to find the appropriate expert. Maite: something similar for SERCO team. MariaDZ: two issues; 1) operators understand from description who to call. 2) Have special workflow for GGUS alarms that does not go to 2nd line support but 3rd line so that experts can be immediately notified. Noone from DB group is in this 3rd line list.

Friday

Attendance: local(Daniel, Ookey, Yuri, Jamie, Maria, John, Massimo, Mattia, Steve, Maarten, Eva);remote(Xavier, Onno, Oliver, Michael, Joel, Gonzalo, Lisa, Thomas, Rolf, Gareth, Rob).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • HLT reprocessing is in progress.
  • T1 sites
    • NIKHEF production and user analysis job failures, stage-in errors. GGUS:75293 filed at 21:31 Oct.13, solved this morning: appears to be a transient load-related problem, no more errors since 2:20 Oct.14.
    • IN2P3: LFC having some problems, downtime announced to fix some Oracle problems. (just before meeting)
  • T2 sites
    • ntr


  • CMS reports -
  • LHC / CMS detector
    • Machine development for 25n a operations
  • CERN / central services
    • CMSR Oracle instance spontaneous reboot problem, GGUS:74993, kept open to follow up with increased logging information
  • T0 and CAF:
    • cmst0 : very busy processing data !
  • T1 sites:
    • [T1_TW_ASGC]: file access problems, MinBias MC files used for Pileup mixing not accessible by processing jobs, debugging ongoing GGUS:75258, problem reappeared, got more experts on the CMS side involved.
    • next downtime: FNAL, Oct. 15th, 10 AM to Oct. 16th, 1 AM



  • LHCb reports -
  • Experiment activities
    • Reconstruction and stripping at CERN
    • Reprocessing at T1 sites and T2 sites
  • T0
    • Nagios: No probes being sent to prod-lfc-lhcb.ro (GGUS:74775)
  • T1 sites:
    • CNAF : (GGUS:75162) : problem with CREAM CE
    • GRIDKA : (GGUS:75261) srm problem yesterday evening.
    • GRIDKA: (GGUS:74915 ) space token migration (51TB have just been added)
    • NIKHEF : (GGUS:75279) problem of WMS publication in the BDII
    • SARA : srm problem (GGUS:75296) - try to stage with pin lifetime. After 30" has been unpinned. Have to check in LHCb code if there is some pin value and then update ticket. From our point of view 1000 jobs trying to stage files at SARA. Initial value was about lcg_getturl. [ Onno - not sure what dCache does when file is not online and you ask for a turl. Best to ask dCache developers. Can do that... Is possible that 1.9.12 behaves differently to 1.9.5. Comment - dCache will try to stage file before returning turl. If this does not happen within timeout will get timeout error. ]
  • T2 sites:
  • AOB
    • BDII issue - still being worked on


Sites / Services round table:

  • KIT - this morning had a problem with gpfs clusters. Had some issue with internal comms. Asked vendor for more info. All dCache pools using gpfs mounts had to be restarted. 2 downtimes Tuesday 08:00 - 09:00 UTC for ATLAS 3D DB (down for Oracle config changes): 09:00 - 12:00 at disk for FTS and ATLAS 3D for Oracle patches
  • NL-T1 - nta
  • BNL - ntr
  • PIC - ntr
  • FNAL - ntr
  • NDGF - about 1.5h ago the disk system at Oslo subsite broke. HP storageworks solution that failed. Think it is controller. Titan cluster is down and some of ATLAS disk data is unavailable. No estimated time to repair
  • IN2P3 - as mentioned Oracle DB problem under investigation - no news further than that
  • RAL - next Tuesday in morning we will do some microcode updates on tape library so no tape access 08:00 - 12:00 UTC. CASTOR will be up as normal
  • ASGC - about CMS issue: found that files are in 2 disk servers; staged to another disk server to see if jobs can access them. To confirm if a disk issue or something else. Oli - starting slowly to send jobs again.
  • OSG -

AOB:

-- JamieShiers - 14-Sep-2011

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2011-10-14 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback