Week of 110704

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Lukasz, Massimo, Ignacio, Dirk, Gavin, Maarten, Jamie, Maria, Zbyszek, Nico, Alessandro, MariaDZ);remote(Gonzalo, Felix, Ulf, Onno, Joel, Tiju, Rolf, ShuTing Liao, Paolo, Dimitri).

Experiments round table:

  • CMS reports -
  • LHC / CMS detector
    • LHC Technical stop
    • ECAL in global run tonight
  • CERN / central services
    • Mon 4th - job failures on T0 with "send2nsd: NS009 - fatal configuration error: Host unknown: rfio" errors staging out to CASTORCMS, recovered in the morning. Opened GGUS:72193 asking for information. [ Massimo - one of main head nodes went again into strange state and had to be rebooted. Have seen multiple times in the last weeks and have replaced box to cure. Ignacio - got replacements for all head nodes of the same model. Seen for CMS and ALICE. ]
  • T1 sites:
    • KIT: stageout of CMS prompt skimming jobs not allowed due to missing tape family. CMS asked to temporarily disable tape family protection to complete the high priority workflow, many thanks to KIT for the intervention during the weekend. CMS Data Operations to get in contact with KIT for a more permanent solution. SAV:121019 and GGUS:72177 Now its time for CMS Data ops to work with KIT to find a permanent solution, not temporary work-around.
    • KIT: SAM test failures on CE (software area) and SRM (timeouts) during the weekend. Now looks OK, KIT may close the ticket if the situation is stabilized GGUS:72201
    • FNAL-->KIT transfers: one file subscribed for transfer doesn't exist anymore, will need to be invalidated globally SAV:121923
  • T2 sites:
    • Some T2 sites unavailable in SAM during weekend, now recovered except for T2_TR_METU SAV:121933
  • AOB:
    • New JobRobot dataset distributed to T1 sites, will be distributed also to T2s if all is OK.
    • Ian Fisk will be CMS Computing Run Coordinator on duty from tomorrow July 5th to 11th.

  • ALICE reports -
    • T0 site
      • A partial power cut on Sunday caused some of the central services to be down for many hours, affecting essentially all jobs.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

Experiment activities: Processing of last taken data going smoothly.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • CASTOR: Ticket open: INC048765 Any news? [ Massimo - I submitted ticket in order to have a placeholder. Were asking about other filenames. Joel - some of them have become unblocked. Massimo - root cause unknown, have a recipe but need to know on which files to apply it ]
    • LFC: upgrade to SL5: postponed
  • T1
    • RAL: Planned major intervention on their network on Tuesday morning planned to use this DT slot to also upgrade to CASTOR 2.1.10-1 .
    • GRIDKA: some intermittent problem with the SHARED area [ Dimitri - think that problem due to NFS problems at weekend. As far as I understand now ok. ] [ Nico - also correlated to CMS SAM test failures. ]

Sites / Services round table:

  • FNAL (Bakken): Holiday in USA. We did NOT have any cooling outage last Friday or over the weekend. We had a 4 hour outage of the primary and secondary LHCOPN (Starlight) network on Saturday morning. The backup LHCOPN link continued work but bandwidth was reduced to 8 GE during the outage. The default ESNET link also continued to work normally. There was a fiber cut close to Chicago that was repaired promptly.
  • PIC - ntr
  • ASGC - ntr
  • NDGF - Severe thunderstorms over Copenhagen flooded at least one machine room. Compute power for NDGF-T1 is shut down in Denmark due to this. The dCache pools and tape pools should be online, but there might be short interruptions as the flooding damage is fixed. NDGF-T1 mailserver was down for ~18h due to this from Saturday evening onward.
  • NL-T1 - SARA SRM: this morning a disk server had a kernel panic. OK arter reboot. Some ATLAS and LHCB files
  • RAL - reminder of tomorrow's intervention on site: declared in GOCDB. Work on CASTOR and FTS. Site n/a for about 3h.
  • IN2P3 - ntr
  • CNAF - problem during weekend for 1h - connectivity problem solved during Saturday
  • KIT - nta

  • CERN The AFS UI new_3.2 link has now been updated to gLite 3.2.10. This version of the UI appears to include python2.6 modules.

  • CERN DB: PDBR will be moved back to original h/w. DB will not be available for 1h at 11:00. Tomorrow at 14:00 will upgrade CMS integration DB to 11g.

  • CERN Storage ; tomorrow move to CASTOR 2.1.11 plus some optimization to avoid problems seen with DB. After that ALICE and LHCb (Wed). Also a stress test. ALICE changed f/e to CASTOR from 8 m/cs to one big box. Will be scheduled when ok with ALICE online. Q: LSF test that is ongoing? Massimo - did switch over from non-LSF to LSF plus some DB optimisation. Then rolled back - and back again. Ale - so ATLAS will start with stress tests of this non-LSF "transfer manager"

AOB: (MariaDZ) GGUS Release this Wednesday. Please note the regular ALARM testing ticket Savannah:121939 including GGUS certificate change test Savannah:121803

* Gridops.org decommissioned:

 Here's the list of new addresses: cic.gridops.org (old CIC portal) -> cic.egi.eu goc.gridops.org (GOCDB) -> goc.egi.eu gstat.gridops.org (gstat) -> gstat.egi.eu Addresses of other operational tools can be found on the following wiki page: https://wiki.egi.eu/wiki/Tools Please update links on all your custom tools and web pages. Also, forward this information to all known relevant contacts. 

  • acron service at CERN upgrade this morning. "Minimal pertubation".

Tuesday:

Attendance: local(Massimo, Lukasz, Jamie, Gavin, Maarten, Eva, Nilo);remote(Michael, Ulf, Gonzalo, Jon, Xavier, Rolf, Joel, Ronald, Tiju, Paolo, ShuTing, Rob).

Experiments round table:

  • ATLAS reports -
    • T0/CERN:
      • CERN-PROD GGUS:72229 it seems that ATLAS cannot run more than few jobs in parallel. Fairshare issue? [ Gavin - investigated: found that ATLAS share fine but issue was one local ATLAS user had submitted a large number of jobs in a very long queue. User has stopped that activity and a limit in place so that local users are limited in longer queues.

  • CMS reports -
  • LHC / CMS detector
    • LHC Technical stop
  • CERN / central services
    • Castor upgrade. Slowed down by stager DB migration
  • T1 sites:
    • Problem with the conditions used in the weekend reprocessing. Jobs resubmitted.
    • KIT: stageout of CMS prompt skimming jobs not allowed due to missing tape family. CMS asked to temporarily disable tape family protection to complete the high priority worfklow, many thanks to KIT for the intervention during the weekend. CMS Data Operations to get in contact with KIT for a more permanent solution. SAV:121019 and GGUS:72177. We will be writing here again. [ Xavier - just got word from DE CMS that admin intervention is needed. Need to get to our tape experts and ask if this can be fixed in DB. ]
  • T2 sites:
    • A number of transfer tickets
  • AOB:
    • New JobRobot dataset distributed to T1 sites, will be distributed also to T2s if all is OK.
    • Ian Fisk is the CMS Computing Run Coordinator on duty from tomorrow July 5th to 11th.


  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports - Processing of last taken data going smoothly.
    • T0
      • CASTOR: Ticket open: INC048765 Philippe will send the two remaining problematic files?
      • LFC: upgrade to SL5: probably today (rolling upgrade by adding an SLC5 m/c) [ Massimo - for a number of reasons haven't had time but will look into it this afternoon. ]
    • T1
      • RAL: intervention over
      • NIKHEF agree to increase memory limit on WN until Sep - found where in our application the problem is and in the next release should have less need for memory.

Sites / Services round table:

  • BNL - yesterday morning we observed slight degradation of transfer efficiency. Related to a huge number of activities on T0D1 space token area. Looking further into application found it was related to central deletion service trying to delete files / directories with > 200K subdirs from a T0D1 related path. Affects pnfs responsiveness and impacts other transfer related activities due to extended db table locking. Deleted these unnecessary subdirs (mostly empty) and restored performance.
  • NDGF - Copenhagen will stay down until Friday (WNs) but storage still up - no interruptions to storage.
  • PIC - ntr
  • FNAL - ntr
  • KIT - nta
  • IN2P3 - put into production two linux servers for xrootd for ALICE
  • NL-T1 - SARA MATRIX had short downtime to restart dcache following upgrade of CA RPMs
  • RAL - update: completed intervention successfully and all services back online now
  • CNAF - ntr
  • ASGC - ntr
  • OSG - ntr

  • CERN storage: CMS Upgrade still ongoing. Reason is that maintenance on DB of stager taking much longer. Compexity of DB should not be an issue. Procedure now issue - hope to restore service in 30' or so.

  • CERN DB: LHCb online DB died yesterday afternoon after transparent intervention done by sysadmins in pit- had to switch to standby in CC. This morning moved back PDBR DB to RAC6 and tomorrow morning COMPASS DB move is scheduled

  • CERN DASHBOARD: migration of dashboard cluster from SL4 to SL5 alias for CMS wmagent had to be switched - transparent to users

  • CERN BATCH - will remove local batch queues in the next two weeks and announce on service status board. Queues are *80, e.g. 1nh80, 8nh80 etc.

AOB: (MariaDZ)

Wednesday

Attendance: local(Alessandro, Jamie, Maarten, Lukasz, Ignacio, Massimo);remote(Ulf, Jon, MIchael, Xavier, Gonzalo, Onno, Felix, Tiju, Rolf, Paolo, Ian, ShuTing, Joel, Rob).

Experiments round table:

  • ATLAS reports -
    • CERN-IT/ATLAS CASTOR tests new CASTOR set up is validated under ATLAS realistic environment. It is proposed to keep it after MD+TS rollback in case of instability under stress conditions is TBD with the experts.
    • CERN CASTOR = many instabilities due to some h/w problems of some files - ongoing since 10 days. Please declare filenames and we will take actions.

  • CMS reports -
  • LHC / CMS detector
    • LHC Technical stop
  • CERN / central services
    • Castor upgrade finished yesterday evening
    • No files are being marked as migrated to tape by PhEDEx. Ticket submitted https://ggus.eu/ws/ticket_info.php?ticket=72314 [ Massimo - for upgrade yes it took much longer and finished with us rebooting DB. DB went in strange state. Really deeper investigation - full procedure went fine in preparation but not in real thing. All DB operations took factors longer than in tests. 2nd point: acknowledged ticket. See data accumulating and being copied to tape. Ignacio - please send us names of files that should have been migrated. See user files for migration. ] Ian - since upgrade had lots of "expired" files. Will update ticket
    • CERN dropped from BDII, and was recovered after site BDII restarted
  • T1 sites:
    • Failures of transfers FNAL to CERN. Normally this is Release Validation and somewhat time critical [ Maarten - what size? A: 100GB files. We have timeouts set for 20GB. If a raw data file is expected to produce a reco file > 20GB kick into a recovery stream to be handled later. ]
  • T2 sites:
    • NTR
  • AOB:
    • NTR


  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • NIKHEF: VOBOX keeps running out of memory (4 GB), under investigation
      • RAL: no jobs running since a few days, under investigation [ Tiju - ALICE above fair share so might see delays before jobs start ]
    • T2 sites
      • Nothing to report

  • LHCb reports - Processing of last taken data going smoothly.
    • T0 LFC: upgrade to SL5: in production
    • CERN root CA has changed. Procedure? Maarten - for most people nothing needs to be done.
    • T1 CERN: intervention over

Sites / Services round table:

  • NDGF:
    • srm.ndgf.org downtime for dCache upgrade 12.7.2011 13-14 UTC
    • Finnish pools of srm.ndgf.org down for fibre works on 13.7.2011 16-17 UTC.
    • No news about situation in Denmark
  • FNAL - ntr
  • BNL - ntr
  • KIT - ntr
  • PIC - ntr
  • NL-T1 - ntr
  • ASGC - ntr
  • RAL - ntr
  • CNAF - ntr
  • IN2P3 - problem deploying new cert auth list. at 11:30 deployment was done and after all CREAM CEs crashed and other problems.. Quota problem in f/s. Error corrected two hours later and suppose all LHC experiments that had jobs running will see some problems - by now all ok.
  • OSG - ntr

  • CERN storage - did not just LHCb but also ALICE (CASTOR to 2.1.11) Faster than foreseen.

AOB:(MariaDZ) Progress on test ALARMs following today's GGUS Release can be followed via Savannah:121939. Alarm test for CNAF untouched except for message by CNAF T1 admin. GGUS:72289. Rob - should we in US expect alarms? Maria - yes, not launched earlier due to timezones. Jon - please can I make a small request? Great you don't wake us up but you send them at 12:00 noon. Maybe an hour earlier. Maarten - people used to receiving GGUS email updates might have noticed a problem this morning (no URLs) - this is since fixed.

Thursday

Attendance: local(Maria, Jamie);remote(Michael, Jon).

Experiments round table:

  • ATLAS reports - Reconstruction tests are conducted by Tier0 team. Databases are affected for a short (<1h) period

  • CMS reports -
  • LHC / CMS detector
    • Getting ready for end of technical stop. Swapping out CMS prompt reco versions.

  • CERN / central services
    • CAF was switched to EOS for all user access.

  • T1 sites:
    • Slow transfers reported from KIT. Traced we believe to an overload of the STAR channel at KIT. Suggested improvements in a ticket to the site.

  • T2 sites:
    • NTR

  • AOB:
    • NTR

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • yesterday's issues at NIKHEF and RAL have been fixed
    • T2 sites
      • Nothing to report

  • LHCb reports - Processing of last taken data going smoothly.
    • T0 - LFC: upgrade to SL5: one instance OK . The second one will be put in production this afternoon

Sites / Services round table:

AOB:

Friday

Attendance: local(Massimo, Lukasz, Jamie, Maria, Eva, Nilo, Maarten, Gavin);remote(Michael, Ian, Jon, Gareth, Gonzalo, Felix, Rolf, Andreas, Dimitri, Joel, Alexei, Onno, Rob).

Experiments round table:

  • ATLAS reports - data11_7TeV run later this afternoon. RAW data export from CERN will be started.

  • CMS reports -
  • LHC / CMS detector
    • Ramping up detector (i.e. magnet) for end of technical stop
  • CERN / central services
    • Lost BDII at CERN again. Ticket submitted [ Gavin - a permanent fix in preparation which is to update openLDAP, hopefully next week ]
  • T1 sites:
    • Slow transfers reported from KIT. Traced we believe to an overload of the STAR channel at KIT. Suggested improvements in a ticket to the site.
    • Big problems with file access at KIT. Tickets submitted, impacting production and analysis. Report from KIT that problem understood with dCache - expected to restablize
  • T2 sites:
    • NTR
  • AOB:
    • NTR

  • ALICE reports - rate test ongoing this morning and afternoon. Still not finished but going fine. Things look fine from CASTOR side too.

  • LHCb reports - Processing of last taken data going smoothly.
    • T0 - LFC: upgrade to SL5: two instances running on SLC5 with virtual machine

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • RAL - had 1 disk server out from ATLAS scratch - back in R/O now.
  • PIC - ntr
  • ASGC - ntr
  • IN2P3 - ntr
  • NDGF - update on DK: thunderstorms and flooding under control. All services started again and monitoring looks ok. Some small snag with tape pool which should have been fixed during day. Yesterday's d/s with ATLAS that went missing is now back in R/O mode - looks more stable than since a long time. Some outages next week as in minutes from 2 days ago
  • KIT - planned firmware update for tape library. Tuesday 10:00 - 12:00
  • NL-T1 - ntr
  • OSG - short network outage at OPS center Sunday morning < 15' - failover services but still short outage early morning east coast time

  • CERN Storage - ALICE has just finished test - seems ok!

  • CERN DB: LHCb online DB switched back to hardware in pit

  • CERN FTS: The "Tier2" FTS service at CERN now has the following new channels started. fts-t2-service.cern.ch: BNL-CERN CERN-CERN IN2P3-CERN INFN-CERN KIPT-STAR RAL-CERN and STAR-KIPT All these agents run on the first SL5 FTS service node at CERN.

AOB:

-- JamieShiers - 27-Jun-2011

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2011-07-08 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback