Week of 110516

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Alessandro, Ewan, Ignacio, Jacek, Jamie, Jarka, Maarten, Maria D, Maria G, Michal, Ricardo);remote(Dimitri, Gonzalo, Jeremy, Joel, Jon, Oliver, Paolo, Rob, Roger, Rolf, Ron, Tiju).

Experiments round table:

  • ATLAS reports -
    • RAL: burst of ATLASDATADISK-related errors this morning: [STORAGE_INTERNAL_ERROR] GetSpaceMetaData for tokens [atlas:ATLASDATADISK] failed: [SRM_FAILURE] No valid space tokens found matching query] Duration 0. Issue went away, no GGUS filed. Can RAL comment, please?
      • Tiju: Oracle RAC node problem, automatic failover, but a 10-minute break of the service was suffered; not yet understood
    • NDGF-T1: stager issues continue GGUS:70511 in progress. Can NDGF-T1 comment, please?
      • Roger: some soft errors seem to be taken as hard errors; being investigated
    • FZK: similar issue: still the large amount (~16000) of SRMV2STAGER errors. Elog 25394, 25420 on Sat. and Sun. evening. GGUS:70471 submitted on Wed. is marked as solved (as files brought online), but the errors still continue.
      • Dimitri: issue not yet clear, please open a new ticket
    • ADCR instance 1 flaky since last Friday. Load is increasing and decreasing. ATLAS DBA and DDM are investigating. No ATLAS service affected by this yet.
      • Jacek: high load caused by DQ2 application, not yet understood

  • CMS reports -
    • LHC / CMS detector
      • beam operation and data taking
    • CERN / central services
      • NTR
    • Tier-0 / CAF
      • VDM scans didn't produce load on Castor at all (peaked at 700 MB/s into T0TEMP, all other pools lower rates)
      • degradation in T0STREAMER castor pool noticed today, possibly going on since Saturday, data rates don't look higher than normal, GGUS:70620
        • Ignacio: SLS probe shows delays due to contention for some servers, nothing pathological; some limits may be increased
      • now expecting routine data taking
    • Tier-1
      • lots of activity, MC and 2011 data re-reconstruction ongoing
    • Tier-2
      • MC production and analysis in progress (summer11 production)

  • ALICE reports -
    • T0 site
      • During the weekend the JobManager was in a strange state wich caused >6k failed jobs to get stuck in the SAVING status; the problem was fixed this morning.
    • T1 sites
      • ntr
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Data Taking is active again. Validation of a new reconstruction
    • New GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
      • T1
        • SAM / SE tests failing for GridKa and SARA. Currently investigated by site admins.
          • Ron: LHCb user space token issue due to dark data, being cleaned up
          • Joel: KIT could have the same problem
      • T2

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT
    • 9k files were lost due to a disk controller malfunction, ALICE has been informed - the files were all replicas, not a big problem
  • NDGF
    • small SRM upgrade went OK
  • NLT1
    • this morning 1 pool node crashed and was rebooted
    • LHCb user space token has been cleaned up now, dark data in other space tokens to follow; when that is done, the space tokens can be migrated
  • OSG
    • on Friday afternoon (UTC) FNAL was found missing in the CERN test BDII running the latest release, cause unknown; ticket GGUS:70595 opened; Friday evening FNAL was present, but other unexpected differences with lcg-bdii.cern.ch were observed; the mattter is being investigated
      • Ricardo: tentative new dates for upgrading lcg-bdii and sam-bdii are Monday and Tuesday next week
    • Maria D: question about GGUS upgrade, see item posted in AOB
  • PIC - ntr
  • RAL - nta

  • CASTOR
    • LHCb input needed for space token migration
      • Joel: will ping colleagues
  • dashboards - ntr
  • databases - nta
  • grid services - ntr

AOB: (MariaDZ) After discussions with OSG, the May GGUS Rel. on 2011/05/25 will start at 6am UTC and last for max. 4hrs as usual. The OSG development part required will be done during the european afternoon. BNL and FNAL will act on ALARM tickets, if any, based on the email notification by putting updates into GGUS. Ticket synchronisation will be done automatically when both interventions are finished. Related ticket Savannah:119899

Tuesday:

Attendance: local(Alessandro, Felix, Ignacio, Iouri, Jacek, Jamie, Maarten, Maria G, Michal, Peter, Philippe);remote(Foued, Gonzalo, Joel, Jon, Lorenzo, Michael, Paolo, Rob, Roger, Rolf, Ronald, Tiju).

Experiments round table:

  • ATLAS reports -
    • CERN: ADCR DB server instance 1 unavailable at ~16:20 yesterday. Affected different ATLAS services (CC, DDM_VOBOXes, Pandamon). Storage/disk problems. Temporarily fixed (workaround/reboot, but DB is at risk) at ~17:00. The same problem between 22:00-23:00, brought online again. Scheduled intervention/outage on ADCR today at ~14:00-15:00 canceled.
      • Jacek: issue caused by a failing disk that Oracle did not manage to evict; an intervention was successful, the disk was reintroduced and is being rebalanced, the system currently looks OK; the matter will be followed up with Oracle support
    • FZK-LCG: ~4K SRMV2STAGER errors. GGUS:70632 solved. There was problem with the tape staging from Saturday to Monday morning. The errors would continue after this for a while due to the backlog. The tape connection works well since this morning.
    • NGDF: GGUS:70511 in progress, some corrupted files have been identified

  • CMS reports -
    • LHC / CMS detector
      • Beam operation and data taking : next collision run planed for tonight on 480 bunches
    • CERN / central services * NTR
    • Tier-0 / CAF
      • Expecting routine data taking
      • GGUS:70620 has been solved
        • Ignacio: 3 disk servers were in bad shape, causing delays while not triggering alarms in the monitoring; the trouble was with the system, not with the CASTOR software; the issues were cured by rebooting the machines
    • Tier-1
      • Lots of activity, MC and 2011 data re-reconstruction ongoing.
      • As of today, CMS workflows at Tier-1s are mostly done with the previous round of data, waiting for the next one; exception is FNAL, where a new round of previously ill-stated re-reco (MinBias) will be re-launched today
    • Tier-2
      • MC production and analysis in progress (summer11 production)
    • AOB :
      • CMS and Dashboard team are working in close collaboration on the site downtimes reporting with high priority, which will improve the overall site monitoring.

  • ALICE reports -
    • T0 site
      • Yesterday the number of running jobs unexpectedly decreased a lot for a while: several jobs were found without JDL records in the central services, causing the Job Broker to get stuck. This problem was not seen before and the cause is unknown at the moment. Removing those jobs cured the issue.
    • T1 sites
      • ntr
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Data Taking is active again. Validation of a new reconstruction ongoing
    • New GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
      • T1
        • PIC: 2 WMS nodes have rather full sandbox partitions
          • Gonzalo: being investigated
          • Maarten: at CERN we use an extra rpm to clean up those partitions (details sent offline); it also is possible that some users "abused" the service by putting large files in their input or output sandboxes and not cleaning up
      • T2

Sites / Services round table:

  • ASGC
    • network maintenance downtime Fri May 20, 2-10 UTC; also some DPM DB optimization work will be performed
      • Q: will a downtime be declared for all services?
      • A: yes
      • note: at the time of writing only the DPM downtime appeared in the GOCDB
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • Tue May 24 there will be a downtime for all services (incl. dCache) the whole day; the automatic downtime notifications and the CIC Portal may not work
      • confirmed by EGI broadcast sent after the meeting; the broadcast tool should work during the downtime
  • KIT
    • FTS downtime Tue May 24, 09:00-12:30 CEST: migration of Oracle backend to new servers
  • NDGF - nta
  • NLT1 - ntr
  • OSG
    • CERN top-level BDII investigation continues (GGUS:70595)
      • Maarten: the latest BDII being stricter, 4 sites are always rejected because what they publish violates the schema; those are easy to fix; we are worried about the random rejection of good records
  • PIC - nta
  • RAL - ntr

  • CASTOR
    • LHCb space token migration finished, the recipe sent by Philippe Ch. has been followed
      • Q for LHCb: any T1 left?
      • Joel: only NLT1, who are first cleaning up the dark data; at IN2P3 there is some issue with the space reporting that is being looked into
  • dashboards - ntr
  • databases - nta
  • grid services - ntr

AOB:

Wednesday

Attendance: local(Alessandro, Ewan, Felix, Ignacio, Iouri, Jacek, Jamie, Maarten, Maria D, Maria G, Michal, Nicolo);remote(Andreas, Gonzalo, Joel, John, Jon, Kyle, Lorenzo, Michael, Paolo, Roger, Rolf, Ron).

Experiments round table:

  • ATLAS reports -
    • CERN/Central services
      • ADCR DB unavailable again, another disk failure (~11:20). DB restart on standby hardware.
        • Jacek: another failure caused the service to get stuck again, so the DB was moved to standby HW and Oracle support has been contacted; the standby HW has a slightly smaller capacity, in particular w.r.t. IOPS, but should be OK for now
        • Maria G: could the time to switch be reduced?
        • Jacek: normally it takes less than 1 hour
        • Alessandro: some ATLAS applications needed to have their configurations corrected to deal with this failover transparently; fixed with Marcin's help
    • Tier-1
      • IN2P3-CC unscheduled downtime (Regional-NAGIOS outage). May 18 8:30 - 17:30.
      • NDGF-T1/FI_HIP_T2 scheduled downtime (SE,SRM warning). May 17 22:00 - May 18 4:00.
    • Tier2

  • CMS reports -
    • LHC / CMS detector
      • Next PH collisions fill expected tonight on 768 bunches
    • CERN / central services
      • NTR
    • Tier-0 / CAF
      • Expecting routine data taking
    • Tier-1
      • Lots of activity, MC and 2011 data re-reconstruction ongoing.
        • T1_TW_ASGC : adjusting batch fairshare settings with new WMAgent submission tool (see Savannah:121021)
        • T1_DE_KIT : CMS forgot to request tape families for some datasets to be transferred custodially from T0, so transfer failed and need to manually re-transfer. Mistake on the CMS side, and we were warned by the CMS site contact at KIT, thanks and sorry ! (see Savannah:121019)
    • Tier-2
      • MC production and analysis in progress (summer11 production)

  • ALICE reports -
    • T0 site
      • Very low activity.
    • T1 sites
      • ntr
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Data Taking is active again. Validation of a new reconstruction ongoing. we have a major issue . 9k jobs are in status "checking" / "inputdata resolution" and this number is increasing. We try to identify this problem but we have not yet found the cause of it.
    • New GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • problem of network connectivity between volhcb20.cern.ch and RO instance of LHCb LFC. (INC:039361)
        • Joel: some pilots failed on ce130 - any known issues for that CE?
          • Maarten: it looked OK in the SAM tests
      • T1
        • SARA : CVMFS cache increased to 6 GB. Space token re-organisation done
        • PIC : problem of partition full on wms01 and wms02 was due to a cron which was not running.
      • T2

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF
    • investigating GGUS:70511, problem with space, files failing to pin; there is a tape issue, 301 files may have been lost, not clear which VOs are affected; an ATLAS space is full
      • Alessandro/Iouri: a list of affected files has been requested
  • NLT1 - ntr
  • OSG
  • PIC - ntr
  • RAL - ntr

  • CASTOR - ntr
  • dashboards - ntr
  • databases - nta
  • grid services - ntr

AOB:

Thursday

Attendance: local(Alessandro, Ewan, Felix, Ignacio, Iouri, Jacek, Maarten, Maria D, Massimo, Michal, Peter);remote(Joel, John, Jon, Michael, Paco, Paolo, Rob, Roger, Rolf).

Experiments round table:

  • ATLAS reports -
    • CERN/Central services
      • ADCR DB on standby h/w was not accessible outside CERN (from ATLAS AMI DB in France). Fixed: the port was opened.
        • Jacek: the new HW of the ADCR standby DB had not been added to the correct firewall profile
        • Jacek: currently all physics DB are accessible from outside, whereas ATLAS only need outside access for AMI, whose DB might be moved to a dedicated instance, such that the access to the rest can be restricted
        • Alessandro: a discussion about AMI has started, also concerning other aspects of that application
        • Iouri: PanDA DB connections look slow, being investigated
    • Tier-1
      • RAL-LCG2: File transfer failures to SCRATCHDISK: "DESTINATION error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] destination file failed on the SRM with error [SRM_ABORTED]" . GGUS:70698 filed 05-18 23:51, solved: 05-19 09:20. Seems a load issue: heavy load between 23:30 on 18/5 and 04:30 on 19/5 on LSF and a lot of gridFTP connections hitting two disk servers where space is available. Failed transfers have now gone through OK.

  • CMS reports -
    • LHC / CMS detector
      • PH collisions fill expected tonight on 768 bunches
    • CERN / central services
      • NTR
    • Tier-0 / CAF
      • Expecting routine data taking
    • Tier-1
      • MC and Data re-reconstruction ongoing.
      • 2 CMS SAM errors at T1s this morning :
        • T1_IT_CNAF : CMS SAM-SRMv2 error (lcg-cp), see GGUS:70711
        • T1_IT_RAL : CMS SAM-SRMv2 error (lcg-cp), see GGUS:70712
        • Both errors above were temporary glitches that were quickly recovered, however the sites stayed in error state on the central CMS monitors (and triggered alarms by the shifters) because a dashboard collector was stuck. Now fixed.
        • However this pointed us to a more general problem : the distributed site monitoring in the CMS Computing Shift Procedures are entirely based on the production Site Status Board (http://dashb-ssb.cern.ch/dashboard/request.py/siteviewhome). However the latter still points to the "old" CMS SAM tests (CE/CREAMCE/SRMv2), while Nagios seems to be the preferred test nowadays. Hence, we need to replace the "old" CMS SAM columns by Nagios in the production SSB.
    • Tier-2
      • MC production and analysis in progress

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • CNAF: GGUS:70717. Submission is not working for one of the CREAMs (ce01-lcg.cr.cnaf.infn.it), also failing the SAM tests. Experts looking at it.
      • RAL: VOBox upgraded to the latest AliEn version. Job submission was not working due to an AliEn configuration error, fixed.
    • T2 sites
      • Several operations

  • LHCb reports -
    • Experiment activities:
      • Data Taking is active again. Validation of a new reconstruction is OK. The LFC problem has been fixed by changing the timout and the time between two retries.
    • New GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • problem of network connectivity between volhcb20.cern.ch and RO instance of LHCb LFC (INC:039361). In contact with network people.
      • T1
        • SARA : Space token re-organisation done and in place. (problem during the night because they have been put in prod before we modify our Configuration Service.
      • T2

Sites / Services round table:

  • CASTOR - ntr
  • dashboards
    • both ATLAS DDM dashboards were upgraded this morning: the new version that is being validated and the old version that is in production
  • databases - nta
  • grid services - ntr

AOB:

  • Maria D was contacted as European Registration Agent for the DOEgrids CA to approve a certificate request by a user from Oregon who wants to join the "pheno" VO (a.k.a. "phenogrid") based in the UK
    • OSG cannot help here, because no site in Oregon participates in OSG
    • Maria will contact the VO managers to see if the user is known in their community and allowed to join the VO

Friday

Attendance: local (AndreaV, Michal, Felix, Ewan, Massimo, Yuri, Lola, Jacek, Peter); remote (Michael, Jon, Gonzalo, Gareth, Rolf, Rob, Christian, Paolo; Joel).

Experiments round table:

  • ATLAS reports -
    • CERN/T0/Central services
      • Since ~18:15 cron jobs via acrontab didn't work. ATLAS Tier-0 monitoring is based on it wasn't available either. ALARM GGUS:70735 in progress: one of the central AFS servers (running acron) failed. The acron service was restored at ~3:15am. Down again at 7-8am? gangarbt acrontab isn't running since 08:13 and acrontab -l was hanging. So this affects user analysis in addition.
        • [Massimo: hardware failure tonight on the machine serving users A-K, impossible to reboot it. Moved overnight to another machine, this worked initially; eventually other instabilities, fixed by restarting the machines. All OK now (alarm is closed), still working with only one machine instead of two for all users. Eventually all users will be served by a single more powerful machine, and a second one will be kept as hot backup. Reminder: this is not considered a critical service.]
        • [Peter: was the problem announced? Some people from CMS saw it and complained it was not announced. Massimo: it was announced on SSB.]
    • Tier-1
      • RAL-LCG2: AtlasDataDisk (disk server GDSS515) out of service this morning (09:30-11:30 UTC) to replace a faulty fan.
    • [Yuri: in addition, one problem about SRM at Taiwan, SAV:82483, seems to involve Oracle. Felix: investigating with help from the vendor.]

  • CMS reports -
    • LHC / CMS detector
      • Took a nice fill in a single long run yesterday night (19/pb)
      • Today afternoon: plan fill for physics with 768 bunches, w-e aim for 5 fills or 30 hours with 912 bunches.
    • CERN / central services
      • CASTOR/T0 file copying issue : although "rfcp" seems successful (exit code 0), the file is missing (or exists with size == 0) and T0 job crashes. Happened for P5 --> T0STREAMER tansfers, but also for T0 WN --> T0TEMP copies. It's a small effect (happened in bursts of errors 3 times in last 2 weeks), but operationally very heavy. This case has been reported earlier with some examples here : INC:033080
        • Castor team has all the logs and information that are relevant. If they need something more or more specific, they should ask
        • [Massimo: had a look at earlier issue, suspected a glitch, but it was too complex to debug in detail. As there is now a more recent example from yesterday, will investigate that one.]
    • Tier-0 / CAF
      • Expecting routine data taking
    • Tier-1
      • MC and Data re-reconstruction ongoing.
    • Tier-2
      • MC production and analysis in progress
    • AOB :
      • NTR

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Several operations

  • LHCb reports -
    • Data Taking is active again. Validation of a new reconstruction is OK.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 4
      • T2: 0
    • Issues at the sites and services
      • T0
      • T1
        • CERN : problem CE ce130.cern.ch (GGUS:70748) and ce203 and ce207 (GGUS:70730) [Ewan: ce130 is not doing particularly well, will investigate].
        • CNAF : gridmap file problem : Yesterday at 12:35 CEST the mapping of Ricardo suddenly changed at CNAF from pillhcb003 to pillhcb027. I don't know the reason, maybe the hard links in /etc/grid-security/gridmapdir went deleted (for all the experiments, not only LHCb). From that time, for all the running jobs the gram job state files of the jobs (owned by pillhcb003) could not be managed anymore by pillhcb027, and this caused the aborts. For any new job there should be no problem. What remains now is to understand why this happened... but this is another story. [Paolo: cross border channel is down, but the backup is ok, other users are not reporting problems.]
        • PIC : problem autofs (GGUS:70733). [Gonzalo: now fixed, was affecting cvmfs. Will change the procedures to improve the reaction to these issues.]
      • T2

Sites / Services round table:

  • KIT [Xavier]: No issues to report. But we have had a maintainance downtime on one of our tape libraries. Reading of files archived on tapes from that library was not possible from 10:30 to 13:30 local time.
  • FNAL: ntr
  • BNL: ntr
  • PIC: nta
  • RAL: ntr
  • IN2P3: ntr
  • OSG: question, is the BDII upgrade confirmed for next week? This was already postponed, should not be postponed much more. Ewan: will check with Ricardo.]
  • NDGF: problem with a Finnish T2 because the information system was reporting execution times of -12 million seconds and all jobs went there. Being fixed and queues are diminishing.]
  • CNAF: nta

  • Storage: nta
  • Dashboard: ntr
  • Grid: 8 CE will be drained next week: 6 LCG (ce124-129) and 2 Cream (ce201-2). Replacements (for Cream) are already in production.
  • Databases: ntr

AOB: none

-- JamieShiers - 10-May-2011

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2011-05-20 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback