Week of 120326

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Cedric, Ale, Ulrich, Mike, Maarten, Eva, Dirk);remote(Gonzalo/PIC, Michael/BNL, Mette/NDGF, Joel/LHCb, Tiju+John/RAL, Kyle/OSG, Burt/FNAL, Onno/NL-T1, Pavel/KIT, Jhen-Wei/ASGC, Daniele/CMS,Rolf/IN2P3, Stefano/CNAF).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • Alarm ticket : GGUS:80602, CERN-IN2P3 FTS channel is not processing ATLAS transfers
    • T1s
      • Alarm to INFN-T1 : GGUS:80582, failed to contact on remote SRM.
      • Alarm to TAIWAN : GGUS:80586, failed to contact on remote SRM [httpg://f-dpm001.grid.sinica.edu.tw:8446/srm/managerv2]. "Informed by ISP that 10Gb network between Taipei and Amsterdam broken from 06:00 - 10:00 UTC time. During the period we were using 2.5Gb backup link for all WLCG activities, thus some transfers failed because there was no sufficient bandwidth. 10Gb link up at 10:15am UTC timezone."
      • Triumf : GGUS:80569, "The problem is that the CRL of ca_NIKHEF expired this morning on our FTS. I just ran fetch-crl manually to fix this problem. Now I saw the transfers from SARA succeed."
      • Tape famillies : https://twiki.cern.ch/twiki/bin/viewauth/Atlas/ADCOperationsProjectTags
    • T2s
      • INFN-MILANO-ATLASC : GGUS:80503, SRM problem.
      • UKI-NORTHGRID-LIV-HEP : GGUS:80583, power and cooling failure for many hours

  • CMS reports -
    • LHC machine
      • beta* measurements, injection protection set up, Ramp and squeeze with pilots and tight collimators settings (and more..)
    • Plans (as from this morning's LHC Ops meeting):
      • Collimators set up at flat up
      • moved to Wed: ramp up of 2-3 bunches and collisions. Need luminosity from experiments
      • moved to Wed: collimators set up at collisions
      • Next weekend: Plan to have FIRST COLLISIONS with STABLE BEAMS. Aim is the have the 3 bunch fill and the 48 bunch fill during next weekend.
    • CERN / central services
      • CERN SRM: SLS availibility down to 75% in the CMS Critical Services, availability info = "Error on Delete", caused by "Too many threads busy with Castor at the moment". Assuming it's caused by load, and no tickets opened. No major impact on ops.
      • CERN CMS build machines: SLS availability down to 0% over the weekend, catching up today with the experts to know more, no tickets opened (and no complains).
    • Tier-0:
      • no more crashes in the express workflow, all under control
    • Tier-1:
      • [new] RAL: tape migration backlog (minor, just few files), being taking care of, not a worry for ops (Savannah:127387)
      • [new] FNAL: 35 files not in "ready to transfer" state at FNAL, probably a PhEDEx stager agent needs to be checked, and probably not a site issue at all (Savannah:127396)
      • [new] KIT/FNAL: some transfer s KIT->FNAL expiring in PhEDEx/FTS, probably need to check at the FNAL FTS server level (Savannah:127397)
    • Tier-2:
      • business as usual, with a slight increase in the failure rate of Job Robot, few Savannah opened to track with sites

  • LHCb reports -
    • Data reStripping and user analysis at Tiers1 going well
      • Stripping jobs (both 17b & 18) are ~complete at all sites except GridKa and IN2P3
      • The associated merging jobs are ongoing.
    • MC simulation at Tiers2 ongoing
    • T0
    • T1
      • GRIDKA : slow tape recall (GGUS:80589)
        • Onno: one tape library broken since friday and under investigation. This may also affect some CMS pools.
      • CNAF : (GGUS:80592) some CVMFS endpoint not accessible (wn-103-03-31-07-a :: df: `/cvmfs/lhcb.cern.ch': Transport endpoint is not connected)
      • IN2P3 : low number of running jobs (GGUS:80595)

Sites / Services round table:

  • Gonzalo/PIC - ntr
  • Michael/BNL - follow-up on last week's jobs failures: suspected initially high load on local AFS (now replicated), but the problem turned out to the use of setup scripts from AFS at CERN. This needs ATLAS follow up and may affect also other sites.
  • Mette/NDGF - tomoorow donw time from 1-4 afternoon (SRM and dcache pools), another one on Wed 1-3 for dcache update
  • Tiju/RAL ntr
  • Burt/FNAL - CMS ticket with FNAL/KIT transfers has been solved. For second CMS ticket in contact with experts and hope for solution by afternoon US time.
  • Onno/NL-T1 - ntr
    • Joel: SARA unscheduled downtime (warning concerning one disk server which may go down) should really be scheduled downtime as this is not fully transparent and LHCb will ban the site. Ale: had this discussion already some time in the past - ATLAS would prefer to not put the site in downtime for a single server, but rather run with the remaining resources. Onno: long duration for the warning period is just due the uncertainty about when the vendor technician will do the intervention.
  • Pavel/KIT - ntr
  • Jhen-Wei/ASGC - On Sunday 3h network outage, but used backup link. Still some transfer failures due to insufficent bandwidth.
  • Rolf/IN2P3 - ntr
  • Stefano/CNAF - LHCB cvmfs problem: expert are investigating. On 29th scheduled FTS downtime for FTS upgrade. Queues will start to be drained the day before.
  • Kyle/OSG - ntr
  • Ulrich/CERN - problem with batch farm during last days as a campaign to change bios setting was applied to too any nodes at the same time (due to a config problem). Now fixed.

AOB:

Tuesday

Attendance: local(Daniele, Eva, Luca M, Maarten, Mike, Simone, Ulrich);remote(Gerard, Joel, Kyle, Lisa, Mette, Michael, Paco, Rolf, Tiju, Xavier).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • Alarm ticket : GGUS:80602 solved and verified. All FTS servers at CERN have been patched for a known issue. Thanks.
      • Some (non critical) files can not be exported from T0. Not clear if the problem is in staging from tape or accessing data from the disk server (or both). Under investigation by CERN+ATLAS DDM. GGUS:80656
        • Luca: at first the copies were from CASTOR to CASTOR
        • Simone: that would be recalls, the stage-in then failed somehow
        • Luca: NS update today might have led to timeouts during a few min
        • Simone: the situation looks OK since this morning
      • High LSF bsub average time of nearly 10.000 ms during the early morning. Then it started regressing, would be good to know what happened. No ticket.
        • Ulrich: problem during weekend reported yesterday led to 200k jobs in the queue, which may have put pressure on the system; it also could have been due to some user, hard to tell now; let us know when you see it again
    • T1s
      • Alarm to INFN-T1 : GGUS:80582 solved. Certificate problem at the StoRM FE.
      • Alarm to TAIWAN : GGUS:80586 solved. Problem with the primary 10Gb link Taipei-Amsterdam. The 2.5Gblink was being used instead.
      • FZK : GGUS:80596 : ATLASMCTAPE (TAPE buffer for MC) not cleaned because of slow/no migration to TAPE. Currently affects archival of MC data but will affect soon MC production in DE cloud and later data export from T0
        • Xavier: we are working on it, no estimates yet
      • PIC: FTS upgrade to FTS 2.2.8 underway

  • CMS reports -
    • LHC machine
      • Cryo problems, commissioning work...
      • Trying collissions today, or on THU
      • Aiming for first fill with stable beams (3x3) by the end of this week.
      • To be followed by the 48x48 fills, then the scrubbing run. To be confirmed.
    • CERN / central services
      • CERN SRM (srm-cms.cern.ch SRM v2.2 endpoint) quite unstable in SLS (here), and getting worse recently. No tickets.
        • Luca: did it cause problems for CMS?
        • Daniele: no
        • Luca: looking into it, no clues yet
      • CRAB servers availability dropped down to 70% today at ~4am for ~1 hr. No impact.
    • Tier-0:
      • brilliant solution of the CMST0 black-hole WN issue (here). Tracked down to error from malloc in line 151 of payload.c of the Frontier library. CMST0 rebooted and kernel updated (target update to SLC5.8, kernel 2.6.18-308.1.1.el5). Well coordinated with CMS and well executed by CERN-IT. Thanks (among others) to Samir Cury, Gavin McCance and Dave Dykstra.
        • Ulrich: what was the error message?
        • Daniele: details in the ticket
    • Tier-1:
      • [follow-up] RAL: tape migration backlog (minor, just few files) (Savannah:127387). SOLVED.
      • [follow-up] FNAL: 35 files not in "ready to transfer" state at FNAL, probably a PhEDEx stager agent needs to be checked, and probably not a site issue at all (Savannah:127396). SOLVED.
      • [follow-up] KIT/FNAL: some transfers KIT->FNAL expiring in PhEDEx/FTS, probably need to check at the FNAL FTS server level (Savannah:127397). SOLVED.
    • Tier-2:
      • business as usual.

  • LHCb reports -
    • Data reStripping and user analysis at Tiers1 going well
    • Stripping jobs (both 17b & 18) are ~complete at all sites except GridKa and IN2P3
    • The associated merging jobs are ongoing.
    • MC simulation at Tiers2 ongoing
    • New GGUS (or RT) tickets
      • T0
        • CERN : waiting for implementation ...(GGUS:79685)
          • Ulrich: indeed waiting for implementation, low priority
          • Maarten: LHCb should anyway fall back on using '.' if TMPDIR is not defined, so maybe CERN does not need to do anything?
          • Ulrich: we had already started preparing the implementation
      • T1
        • GRIDKA : slow tape recall (GGUS:80589)
        • Rolf: high LHCb activity at IN2P3; increasing the share would lead to stalled jobs; dCache disk cache full with pinned data sets!
        • Joel: we limit the activity on our side when needed
        • Rolf: for GGUS:80338 (corrupted files) we need further info from LHCb
        • Joel: will follow up

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - nta
  • KIT - nta
  • NDGF
    • 2h downtime tomorrow afternoon for dCache pool upgrades
  • NLT1
    • downtime tomorrow 10:30-12:30 CEST to restart all NIKHEF grid services with new kernel
  • OSG - ntr
  • PIC
    • FTS 2.2.8 upgrade went OK
  • RAL - ntr

  • dashboards
    • Mike: timeline for FNAL FTS to publish info in new monitoring system?
    • Lisa: FTS experts first needed to debug transfer errors, now fixed; monitoring is next
  • databases - ntr
  • grid services
    • CERN CvmFS Starting Monday 2nd April at 09:30 , duration 24 hours. The CVMFS Stratum zero service will be unavailable for /cvmfs/cms.cern.ch. Stratum ones will continue to serve the last copy /cvmfs/cms.cern.ch and CVMFS client sites should be unaffected by the migration. Updates: ITSSB.
  • storage
    • CASTOR Name Server updated, clients may have experienced timeouts during 5 min

AOB:

Wednesday

Attendance: local(Daniele, LucaC, Simone, Mike, Maarten, LucaM, Ulrich, Dirk);remote(Mette/NDGF, Alexander/NL-T1, Joel/LHCb, John/RAL, Jhen-Wei/ASGC, Burt/FNAL,Xavier/KIT, Jeremy/GridPP, Kyle/OSG, Stefano/CNAF).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • NTR
    • T1s
      • Tape migration problem at FZK persists (news?). The all DE cloud has been put in "brokeroff" i.e. it will drain jobs assigned to it but no new tasks will be assigned. More than 70TB of backlog for tape migration. FZK has alo been blacklisted for T0 exports.
      • Problem contacting FTS in PIC after the upgrade yesterday. GGUS:80692 has been submitted. Problem fixed.
      • Problem contacting FTS at Triumf. GGUS:80675. The script fetching CRLs was failing since a while (March 23). The script was run bu hand and FTS Tomcat was restarted. Problem fixed. Triumf instrumented also a more robust fetch CRL script.
      • IN2P3 upgraded FTS to 2.2.8. Things look OK form the ATLAS side
    • T2s
      • Problem exporting calibration data from CERN to IFIC-LCG2: GGUS:80684. Problem fixed restarting SRM.

  • CMS reports -
    • LHC machine
      • Nominal bunches in collisions foreseen for Thu M
    • CERN / central services
      • NTR
    • Tier-0:
      • some CAF hosts (lxbst2807, lxbst2808) have high load average, correlated with high activity, under control so far
    • Tier-1:
    • Tier-2:
      • at sites, business as usual
      • one note on distributed analysis. We had issues with wms003.cnaf.infn.it over last 7-10 days (Savannah:127174). Brief summary follows. Jobs aborted, gridftp thread busy, uberftb hangs. Changed the maximum nb of gridftp connections several times, restarted, did not fix. wms003 removed from the WMS@CNAF pool. Investigated offline, found a problem in the pool account, fixed. Uberftp worked. Then another problem came (the one on March 22, VOMS-related, see earlier WLCG Ops reports). Fixed. All cleaned up again. Confirmed now we have 2000 pool accounts for CMS, and 500 gridftp connections, on wms003. Confirmed it's working now. Next: protective actions in the queue, suggesting the CNAF WMS support team to add more machines to the pool to load-balance better in case of issues (these should trigger a quicker exclusion of problematic machine from the pool, and let a well-dimensioned working pool in production).

  • LHCb reports -
    • Data reStripping and user analysis at Tiers1 going well
      • Stripping jobs (both 17b & 18) are ~complete at all sites except GridKa and IN2P3
      • The associated merging jobs are ongoing.
    • MC simulation at Tiers2 ongoing
    • T0
      • CERN : waiting for implementation ...(GGUS:79685)
    • T1

Sites / Services round table:

  • Michael/BNL - FTS upgrade scheduled for tomorrow.
  • Mette/NDGF - ntr
  • Alexander/NL-T1 - ntr
  • John/RAL - ntr
  • Jhen-Wei/ASGC - ntr
  • Burt/FNAL - fts monitoring enabled, both servers upgraded.
  • Xavier/KIT - ntr
  • Stefano/CNAF - FTS upgrade scheduled for tomorrow. Site will start draining beforehand
  • Jeremy/GridPP - ntr
  • Kyle/OSG - ntr
  • LucaM/CERN: the problem with SLS reporting has been solved
  • Mike/CERN - VM for FTS monitoring went down and is still inaccessible - looking for a new machine. Thanks to FNAL for monitoring setup.

AOB:

Thursday

Attendance: local(Daniele, Ulrich, Simone, Luca, Ivan, Maarten, Eva, Dirk);remote(Michael/BNL, Mette/NDGF, Xavier/KIT, John/RAL, Lisa/FNAL, Joel/LHCb, Ronald/NL-T1, Kyle/OSG, Rolf/IN2P3 ).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • NTR
    • T1s
      • Tape migration in FZK looks OK. DE cloud back in production for Montecarlo. Data export from CERN still not reactivated (waiting one day of MC activity). Need to re-discuss tape monitoring in general (next T1SCM)
      • Transfers from CERN to T1s failing at 20% error rate yesterday afternoon (TRANSFER ERROR in TRANSFER PHASE). BNL was more evident, put problem existed for many T1s. GGUS:80702. Problem went away by itself.
      • Issues observed in contacting FTS at Triumf (approx 50% failure rate) yesterday afternoon. GGUS:80700. Problem went away by itself.
      • Yesterday afternoon for approx 2 hours (2PM to 4PM), Frontier servers were marked "degraded" in SLS. Exceptions were RAL and CERN. The issue was flagged as "monitoring problem" since there was no real problem in the frontiers per se nor in ATLAS production using those Frontiers. Problem went away by itself.
      • Is there a correlation between the three problems above (they happened at the same time and disappeared at the same time without any external action). For example, Hiro provided perfsonar plots showing degraded PING tests (packet losses) between CERN and many T1s. But there was no problem in perfsonar between CERN and BNL (while BNL was affected both for the transfer problem and the Frontier problem ). Boh ... opinions or comments?
      • Michael/BNL: Investigations at BNL suggest that this was not caused by primary connectivity drop, but that the routing was affected: transfers where failing because of "no route to host". The problem was not apparent in perfsonar as the perfsonar machine might be on a different subnet or only selective routing information has disappeared . Dirk: will involved CERN networking experts to analyze the problem.

  • CMS reports -
    • LHC machine
      • collisions moved to Friday
      • collimator set up at 4 TeV - bunch length instabilities under study
      • inject and dump work
    • CERN / central services
      • Kerberos availability in SLS dropped to 0% at ~midnight for ~1 hr. No ticket needed.
    • Tier-0:
      • NTR in this meeting
    • Tier-1:
      • CNAF: scheduled FTS upgrade (9:00-11:00). The service will be restarted with a fresh database, so old transfers not completed before 9 AM not going to be re-attempted by FTS
    • Tier-2:
      • business as usual

Sites / Services round table:

  • Michael/BNL - reminder: fts upgrade underway - anticipate some short break to perform head node swap to new hardware
  • Mette/NDGF - ntr
  • Xavier/KIT - ntr
  • John/RAL - ntr
  • Lisa/FNAL - ntr
  • Ronald/NL-T1 - ntr
  • Kyle/OSG - ntr
  • Rolf/IN2P3 - ntr

AOB:

Friday

Attendance: local(Simone, Luca, Ivan, Ulrich, Dirk, Maarten);remote(Daniele/CMS, Mette/NDGF, Alexander/NL-T1, Joel/LHCb, Jhen-Wei/ASGC, John/RAL, Rolf/IN2P3, Guenter/GGUS, Andreas/KIT, Lisa/FNAL, Jeremy/GridPP, Lorenzo/CNAF).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • NTR
    • T1s
      • Yesterday afternoon, problem at NIKHEF SRM. GGUS:80733. Promptly fixed with a restart.
      • Channel allocation problem in Lyon FTS was detected yesterday night (11:30PM). Reported with TEAM ticket GGUS:80745. Promptly fixed before 9:00 AM.
      • RAL reported a network problem was fixed yesterday (unresponsive switch rebooted). ATLAS did not notice (at least I did not)
      • Unofficial news is that FZK will upgrade FTS next tuesday. Confirmed?
        • Andreas/KIT: Confirmed (tue 8:30-10 - transparent)
      • FZK back in business for all activities after the tape issues in previous days (T0 exports reactivated this morning)

  • CMS reports -
    • LHC machine / CMS detector
      • Fri M: ramp for RF, Fri MA: nominal bunches in
      • CMS ready for beams
    • CERN / central services
      • CASTOR-SRM_CMS: few availability drops in SLS reported by the shifters. No evident impact. No tickets.
    • Tier-0:
      • NTR for this meeting
    • Tier-1:
      • IN2P3: 1) still CREAM-CE errors (Savannah:127473); 2) errors ("No space left on device") in inbound transfers (Savannah:127520)
      • ASGC: SRMvs critical CMS test "org.cms.SRM-VOPut" on srm2.grid.sinica.edu.tw failing since yesterdat at ~18:30. It does not seem to be a glitch. Experts looking at it, one problematic server identified (Savannah:127528 , GGUS:80760)
    • Tier-2:
      • business as usual

  • LHCb reports -
    • Data reStripping and user analysis at Tiers1 going well
      • Stripping jobs (both 17b & 18) are ~complete at all sites except GridKa and IN2P3
      • The associated merging jobs are ongoing.
    • MC simulation at Tiers2 ongoing
    • T0
    • T1
      • GRIDKA : Back in production

Sites / Services round table:

  • Mette/NDGF - scheduled down Mon 12 for 2h (norways affects some atlas data)
  • Alexander/NL-T1 - ntr
  • Jhen-Wei/ASGC - ntr
  • John/RAL - 3am-5:30am switch reboot
  • Rolf/IN2P3 - ntr
  • Andreas/KIT - ntr
  • Lisa/FNAL - ntr
  • Lorenzo/CNAF - ntr
  • Kyle/OSG - ntr
  • Jeremy/GridPP - ntr
  • Guenter/GGUS - downtime on Mon 2nd 4-6 UTC due to FZK central router emergency update.
  • Ivan/dashboards - recent schema upgrade resulted in a missing index and hence slow generation of statistics. The problem is fixed now but shown statistics will only catch-up in a few hours from now.

AOB:

-- JamieShiers - 31-Jan-2012

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2012-03-30 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback