Week of 121029

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alessandro, Alexey, Felix, Jan, Jarka, Maarten, Maria D, Torre, Ulrich);remote(Dimitri, Elisa, Ian, Lisa, Michael, Miguel, Onno, Rob, Rolf, Tiju).

Experiments round table:

  • ATLAS reports -
    • Central services
      • CERN-PROD: short-lived problem Friday evening with transfer errors to EOS, "hardware problem with the main headnode, and the namespace crashed. We are failing over to the second box." GGUS:87850
        • Jan: the machine's HW was suspect, but the crash was due to a SW error
        • Alessandro: is a failover machine ready?
        • Jan: there are 2 machines, both of which have HW errors to be repaired
      • Discuss FTS bug fix here today
        • Alessandro: since the FTS test instance has been working OK in the last few days, as of 09:30 CET today the new patch is in use on the FTS instances at CERN; we suggest observing the behavior for a few days and discussing the plans for the T1 during the operations coordination meeting on Thu
        • for T1 sites that want to try upgrading already, FTS developer Michail Salichos provided this info after the meeting:
          • the repository details are here
          • only transfer-fts package must be updated and tomcat restarted (no re-yaim is required)
    • T0/T1
      • IN2P3-CC: SRM failure early this morning, T0 export failing, raised to ALARM after 3hrs but addressed and resolved shortly thereafter by SRM restart. GGUS:87872
      • IN2P3-CC: recurrence of AsyncWait source errors, ticket from 10/14 updated. GGUS:87344
      • FZK: urgent UNAVAILABLE files on ATLASDATADISK needed for high priority production jobs, recovery efforts continue, lost file list will be provided to us when complete. GGUS:87766
      • FZK- DATATAPE remains out of Tier 0 export. Staging errors continued over weekend. GGUS:87510 GGUS:87526 GGUS:87630
      • BNL: at risk due to Hurricane Sandy, will be interested in whether there's any new information from BNL in this meeting. ATLAS production and analysis communities informed of the risk.
      • TAIWAN-LCG2 DATATAPE: [SRM_FILE_UNAVAILABLE] due to tape mounting hanging, reported 10/25, still awaiting word. GGUS:87796

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Operations teams are seeing SNOW tickets at CERN are slow to be responded to, and are wondering if it's always better to submit GGUS tickets even if not team or alarm.
        • Maarten: in practice a GGUS ticket has better visibility, in particular when routed directly to CERN, as most of the potentially involved service managers will at least get a notification of the ticket creation; SNOW tickets typically have lower priority and hence may not stand out; furthermore, such tickets typically have to go through one or two levels before reaching the experts
    • Tier-1:
      • Transfer quality problems to KIT GGUS:87886. It had not started bridged to GGUS, but is now
      • Still tape archiving problems at ASGC
    • Tier-2:
      • NTR

  • LHCb reports -
    • T0:
      • CERN:
        • redundant pilots (GGUS:87448). Still discussions going on, not yet understood.
        • 4 files with bad checksum on EOS (GGUS:87702) Ongoing.
    • T1:
      • GRIDKA: expired proxy in FTS system (GGUS:87367) can be closed.

Sites / Services round table:

  • ASGC
    • CASTOR-ATLAS issue being worked on
    • CMS tape situation should have improved now
  • BNL
    • Hurricane Sandy: for now we will keep the Tier-1 facility running, but a shutdown on short notice may be needed if the main power is lost; we will see in the next 12h
  • FNAL
    • FTS has been upgraded with the first expired proxy patch
  • IN2P3 - ntr
  • KIT
    • the tape and disk issues are being worked on
  • NDGF
    • a few head nodes were updated OK
  • NLT1 - ntr
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • GGUS/SNOW
    • File ggus-tickets.xls is up-to-date and attached to page WLCGOperationsMeetings. There is a net drop of ALARMs compared to past months (none last week). All test tickets following the last GGUS Release of 2012/10/24 were analysed and found successful. Details in Savannah:132626#comment6
  • grid services - ntr
  • storage
    • EOS-CMS had 3 automatic SRM restarts during the weekend, being followed up with the BeStMan developers
    • EOS-ATLAS update Wed morning?
      • Alessandro: we will let you know tomorrow
    • CASTOR: transparent update of central back-end services (authZ, tape scheduler) tomorrow, hot fix being replaced with official rpms

AOB:

Tuesday

Attendance: local(Alessandro, Alexey, Eva, Felix, Jan, Jarka, Maarten, Oliver, Ulrich);remote(Elisa, Lisa, Miguel, Rolf, Tiju, Xavier).

Experiments round table:

  • ATLAS reports -
    • ATLAS general info
    • Tier0/1s
      • IN2P3-CC discussed with ATLAS the possibility of having a downtime of 3 days ("On 10-12 December 2012, will take place a global shutdown in2p3-cc due to intervention to the main power"). From the ATLAS point of view this Tier1 downtime is OK under the condition that not many other T1s will have scheduled downtime at the same time. If the other T1s need downtime in the same period, those downtime schedules need to be coordinated (in the WLCG context, rather than ATLAS).
        • Rolf: the computer center will be connected to a second power network, the intervention has been planned since ~1 year and the dates are fixed now; there will be mostly electrical work, but critical interventions on services could also be carried out, if needed
      • FZK-LCG2 re-included into Tier0 Data Export after an email from German Cloud Support explaining that slowness on tape stagein would not be affected by the T0export (i.e. migration to tape). No problem observed up to now.
    • FTS patch deployed on CERN FTS T0 and T2 servers. Good news: submissions errors drastically decreased. Still checking more.

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • Tape archiving problems at ASGC: dark tape has been found and will be added quickly to the robot
      • Transfer problems from CERN to ASGC:
        • proxy of primary contact is not able to submit FTS transfers, proxy of his colleague works
        • both needed to update user office registration and experiment affiliation recently
        • primary proxy is not able to submit FTS transfers both from the local vobox at ASGC as well as lxplus
        • SNOW ticket INC:182880
      • ASGC 10Gb uplink is unavailable due to damaged cable in US by hurricane
      • Rolf: GGUS:87708 waiting on a reply from CMS
    • Tier-2:
      • NTR

  • LHCb reports -
    • T0:
      • CERN:
        • redundant pilots (GGUS:87448). Still ongoing.
        • 4 files with bad checksum on EOS (GGUS:87702) Ongoing.
    • T1:
      • NTR

Sites / Services round table:

  • ASGC
    • the 10 Gbit uplink to the US and Amsterdam is down; the backup link is being used, but only has 2.5 Gbit/s
  • BNL
    • Michael, after the meeting: facility services at BNL remained intact during the storm that has weakened over the course of the night and is marching slowly inland. The risk of having to shutdown the facility is significantly reduced by now, though there is a slight chance that we lose power given all the restoration work that is currently ongoing island-wide (1 million households on Long Island are currently w/o power).
  • FNAL - ntr
  • IN2P3 - nta
  • KIT
    • during the weekend 1 tape with ATLAS data got stuck in a drive; today it was recovered by a technician and is again serving ATLAS read requests
      • Alessandro: this does not solve the general slowness in reading we see since early October, right?
      • Xavier: no, but that problem may have been masked by the stuck tape
  • NDGF
    • there is a problem with the primary network link to CERN; using the backup link and looking into it
      • Alessandro: I do not see the backup link in the LHCOPN dashboard, will look into that
  • RAL
    • this morning CASTOR-ALICE was upgraded OK

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW
    • Experiments/sites with open GGUS tickets that don't receive adequate support, please submit them to MariaDZ for presentation at the WLCG Ops Coord. Meeting this Thursday 2012/11/01.
  • grid services
    • EMI-2 WN upgrade being prepared for deployment in preprod tomorrow (~10% of the farm)
  • storage
    • EOS-ALICE suffered a few crashes due to tests with a new version of the ALICE token
      • Maarten: the people involved have been informed, those requests were sent by accident
    • CASTOR-CMS: high activity on t0export, t1transfer and tape; access to tape has been throttled
    • CASTOR: central back-end services have been upgraded transparently as foreseen
    • EOS-ATLAS: update tomorrow morning OK?
      • Alessandro: OK

AOB:

Wednesday

Attendance: local(Alessandro, Alexey, Felix, Jan, Jarka, Luca C, Maarten, Maria D, Ulrich);remote(Burt, Elisa, Marc, Michael, Miguel, Oliver, Pavel, Rob, Ron, Stefano, Tiju).

Experiments round table:

  • ATLAS reports -
    • ATLAS general info
      • Reprocessing campaign: still some minor fixes ongoing. More news tomorrow.
    • Tier0/1s
      • INFN-T1 testing EMI-2: ATLAS don't see problems anymore, but we would like to know what was wrong (and then fixed) GGUS:87346
        • Maarten: libaio (32- and/or 64-bit) was absent; it is a standard SL5 component, i.e. not coming from EMI-2

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • high tape writing load not understood, checked and cannot explain situation, INC:184679 to follow up
        • maybe staging of user files to T1TRANSFER for offsite transfer cause high load?
        • Jan: we see an overlap of various high load patterns and would like CMS to verify which ones are expected
        • Oliver: will follow up further
    • Tier-1:
      • Transfer problems from CERN to ASGC:
        • proxy of primary contact is not able to submit FTS transfers, proxy of his colleague works
        • both needed to update user office registration and experiment affiliation recently
        • primary proxy is not able to submit FTS transfers both from the local vobox at ASGC as well as lxplus
        • INC:182880
        • Maarten: this problem can affect any FTS intance, see FTS info below
      • ASGC 10Gb uplink is unavailable due to damaged cable in US by hurricane
      • IN2P3 asked for answer to ticket GGUS:87708, looking into the problem
    • Tier-2:
      • NTR
    • Maria: can we soon discuss the Savannah-GGUS bridge with the GGUS developers?
    • Oliver: will follow up offline

  • LHCb reports -
    • T0:
      • CERN:
        • 4 files with bad checksum on EOS (GGUS:87702) Ongoing.
          • Jan: those files are lost, the ticket can be closed; we are trying to reproduce the problem on the preprod system
          • Elisa: what about the overwrite flag discrepancy?
          • Jan: not yet understood, hopefully the tests will reveal what happened
    • T1:
      • NTR

Sites / Services round table:

  • ASGC
    • the 10 Gbit link is back since this morning
    • the ASGC-CNAF throughput was discovered to have been low since a long time, we will contact CNAF network experts for a combined investigation
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • will not connect Thu and Fri because of bank holiday
  • KIT
    • cream-{4,5}-kit.gridka.de will be drained starting Mon Nov 5 for upgrade to EMI-2; meanwhile the new CEs cream-{1,2}-kit.gridka.de can be used instead
      • Maarten: the former are the only ones supporting ALICE, please look into that
      • Pavel (after the meeting): confirmed, we will fix that
        • done
    • will not connect Thu because of bank holiday
  • NDGF
    • the problem with the primary link to CERN has been fixed
  • NLT1 - ntr
  • OSG
    • there is an issue with certificates from our new DigiCert CA affecting ATLAS, we need to ensure this gets resolved on time
      • Maarten: all people involved seem to be in the loop already?
      • Michael: John Hover is driving this from the ATLAS side
  • RAL - ntr

  • dashboards - ntr
  • GGUS/SNOW - ntr
  • grid services
    • EMI-2 WN has been deployed on preprod (~10% of the farm) this morning
  • storage
    • EOS-ATLAS: upgrade took 2.5 instead of 2h and we could not switch to the new HW, will need to do that at the next opportunity
    • FTS T2 import errors in GGUS:87927 look due to issues at the source (McGill, Canada)
      • Alessandro: confirmed, will follow up

AOB:

FTS update

FTS developer Michail Salichos provided the following details after the meeting:

  • a new transfer-fts rpm is available in this repository
  • only transfer-fts is to be updated
  • tomcat then needs to be restarted
  • automatic restarts of tomcat should be disabled (such service interruptions are no longer needed and should be avoided)
  • the new rpm contains fixes for:
    • submitters proxy expiration (GGUS:81844)
    • jobs wrongly allocated to different VOs (GGUS:87929)
    • VOMS attrs can't be read from CRLs, delegation ID can't be generated (GGUS:87975, GGUS:86775)
    • mkgridmap cron job has old gLite paths, thereby preventing the addition of new VO members to the submit-mapfile

How to update transfer-fts package

Downgrading transfer-fts package instructions

Thursday

Attendance: local(Alessandro, Alexey, Felix, Jan, Maarten, Mike, Oliver, Stefan, Ulrich);remote(Gareth, Jeff, Lisa, Michael, Miguel, Rob, Ronald).

Experiments round table:

  • ATLAS reports -
    • ATLAS general info
      • Reprocessing campaign is starting in next hours.
    • Tier0/1s
      • INFN-T1 testing EMI-2: reported yesterday, still waiting for an answer of GGUS:87346
      • FZK-LCG2 staging issue GGUS:88008
  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • high tape writing load not understood, checked and cannot explain situation, INC:184679 to follow up
        • still working on this
        • Oliver: the problem looks understood now
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • On Nov 1 between 00:00 and 01:00 local time job submissions were found failing on multiple (possibly all) gLite 3.2 VOBOX nodes at various sites, with the CREAM client complaining that the proxy had supposedly expired while it was fine. This happened for different DNs from different CAs. GGUS:87997 opened for the CREAM developers.

  • LHCb reports -
    • T0:
      • CERN:
        • 4 files with bad checksum on EOS (GGUS:87702) Closed (as we said yesterday)
          • Jan: we are still trying to reproduce the problem, in particular the overwrite flag discrepancy
        • redundant pilots (GGUS:87448). Still verifying if the problem is solved
        • 252 FTS jobs in the system that some time ago landed in STAR-STAR channel as Atlas jobs (GGUS:87686): still to be fixed on LHCb side.
          • Alessandro: ATLAS did not see this problem, while for CMS it did not reoccur so far since the patch was applied
          • Stefan: also for LHCb no new errors were seen
    • T1:
      • SARA: (GGUS:87975) yesterday FTS submissions to SARA failed, promptly fixed by the site
    • Stefan: the reprocessing activity is ramping down for a few days, but should go up again next week

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • RAL - ntr

  • dashboards - ntr
  • grid services - ntr
  • storage
    • EOS-CMS: SRM was stuck for 2h this morning
    • EOS-LHCb: SRM was stuck for 20 min this morning
    • EOS-CMS: update next Wed?
      • Oliver: will follow up
    • CASTOR-LHCb: 3h intervention needed to move DB to new HW
      • Stefan: will follow up
    • CASTOR-CMS: one file with a size of 1.4 TB (sic) is causing issues in various areas; we intend to enforce a cap of 500 GB on the file size, with a recommendation to stay below ~200 GB; the problematic file is an archive that should be repacked into a number of smaller files
      • Oliver: will follow up

AOB:

Friday

Attendance: local(Alexey, Eva, Felix, Jan, Maarten, Mike, Oliver, Ulrich);remote(Burt, Elisa, Gareth, Kyle, Lisa, Onno, Stefano, Ulf, Xavier).

Experiments round table:

  • ATLAS reports -
    • ATLAS general info
      • Bulk reprocessing campaign was started.
    • Tier0/1s
      • CERN-PROD ALARM GGUS:88064 : Inaccessible files on CASTOR/T0ATLAS, one diskserver has a controller problem
        • Jan: the machine is expected back on Tue, the RAID rebuild will continue through the weekend and then the file system needs to be checked on Mon
        • Alexey: any data loss expected?
        • Jan: probably not, we have seen such problems without data loss
      • INFN-T1 testing EMI- 2 : still waiting for an answer of GGUS:87346
    • CENTRAL SERVICES
      • ATLAS TEAM member from UK were not able to edit/create anymore GGUS Team Tickets, solved GGUS:88020
    • Mike: SUM dashboard shows FZK tests failing since 1 day, Alessandro will look into it

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Cream CE problems noticed in HammerCloud tests for CERN: GGUS:87987 & GGUS:88042
        • Error messages like:
          • Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted
          • Transfer to CREAM failed due to exception: CREAM Start failed due to error [the job has a status not compatible with the JOB_START command!]
      • 1.4 TB file destined to be written to tape will be removed, GGUS:87940
      • RAW files vanished from the T0EXPORT pool before prompt reconstruction jobs could be run accessing them. There is 48 hours between the RAW files being written to Castor and the prompt reco jobs starting. Castor made the files available on the T0EXPORT pool very quickly after the ticket yesterday late afternoon, we have not seen missing files again since 4 AM. The cause of the files missing is under investigation, GGUS:88014
        • Jan: the cause was again the presence of the 1.4 TB file!
      • EOS:
        • Problems copying files from Castor to EOS, known problem, xrootd process crashes and transfers go into illegal state, older ticket used INC:179785
          • Jan: only copying from t0export was affected; an rpm update that should fix the problem was deployed this morning
        • Upgrade on Wednesday ok, how many hours unavailable? Is this problem above related to the update?
          • Jan: for that upgrade EOS will be unavailable for ~2h; start at 10:00?
          • Oliver: OK
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: the EMI-2 CREAM CEs did not work for ALICE due to the absence of the workaround for a known issue (WN proxy lacks SSL client purpose), fixed this morning (GGUS:88021)

  • LHCb reports -
    • T0:
      • CERN:
        • redundant pilots (GGUS:87448): ongoing
          • Ulrich: the cause is not yet understood, being looked into also by the developers
    • T1:
      • CNAF: yesterday some staging requests that were stuck have been cleaned (promptly fixed by the site, no need to open a ticket!)
    • Elisa: KIT tape staging performance discussion Mon at 10:00?
      • Xavier: that is the plan
    • Jan: CASTOR-LHCb intervention (3h downtime) OK on Tue afternoon?
      • Elisa: OK

Sites / Services round table:

  • ASGC
    • 1 file system on a CMS pool failed this morning, fixed
    • Tue Nov 6 01:00-05:00 UTC there will be a core router network intervention affecting Tier-1 operations
  • BNL - ntr
  • CNAF - ntr
  • FNAL
    • FTS rpm update foreseen for next week, but waiting for documentation describing rollback procedure, agreed during yesterday's operations coordination meeting
      • the recipe was sent to the fts-users list
  • KIT
    • restoring ATLAS data from a broken disk: the list of lost files was sent already, the rest is OK now
  • NDGF
    • 1 site has a slow tape pool affecting the staging of ATLAS data; the trouble seems to be with Fibre Channel connectivity; on Mon they will have a short downtime to relocate the pools and/or fix the FC
  • NLT1
    • 1 dCache pool node was down for 1h this morning to replace a broken CPU
    • there is an issue with the deletion of files from tape, looking into it
  • OSG - ntr
  • RAL
    • 2 ATLAS disk servers unavailable, hopefully back tomorrow morning
    • SAM SRM tests failed between midnight and 03:00 UTC, also elsewhere in the UK and in other regions
      • Maarten: will inform SAM experts; could have been an issue affecting the SAM UI used for ATLAS tests

  • dashboards - ntr
  • databases - ntr
  • grid services
    • CREAM ce203 was overloaded for a while due to 1 CMS user; this might explain some of the trouble reported by CMS
  • storage
    • CMS tape recall response time has grown to 4d due to large write activity having precedence over reads
      • Oliver: is the migration to EOS involved?
      • Jan: no, that is disk to disk

AOB:

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2012-11-03 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback