Week of 100510

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Ueda, Harry, MariaG, Jamie, Eva, Patricia, Nilo, Maarten, Jean-Philippe, Nicolo, Carlos, Roberto, Gavin, Allesandro, Julia, MariaDZ, Dirk);remote(Jon/FNAL, Andrea/CNAF, Xavier/KIT, Gang/ASGC, Michael/BNL, Vera/NDGF, Rolf/IN2P3, Rob/OSG, John/RAL).

Experiments round table:

  • ATLAS reports - (Ueda)
    • SARA-MATRIX -- FTS errors (GGUS:58082)
    • NDGF-T1 -- Files on tape unavailable since several days (staying NEARLINE) (GGUS:57635)
    • SARA-MATRIX -- Files on tape unavailable since several days (UNAVAILABLE) (GGUS:58071)

  • Vera: is the ticket for NDGF a new ticket? Ueda: yes. Vera will follow up when online again.

  • CMS reports - (Nicolo)
    • T0 Highlights:
      • 1000 files (3 TB) waiting for migration to tape on CASTORCMS T0EXPORT for up to 36 hours, GGUS:58099 TEAM
      • Long-standing issue with JobRobot tests failing with Maradona errors at CERN solved.
    • T1 Highlights:
      • Running tails of MC reprocessing
        • Last 2 jobs resubmitted and completed at CNAF
      • ReReco train workflows running
        • Some job failures at FNAL (3 GB memory limit hit, or waiting for dcap read errors) Elog #2257
      • Skimming running
        • 35 custodial input files not accessible at CNAF Savannah #114343
      • Backfill tests running
      • KIT - some SRM SAM test timeouts during weekend. Savannah #114342
      • ASGC - 7 TB waiting for migration to tape Savannah #114353
      • IN2P3->CNAF: some transfer timeouts during weekend, Savannah #114320
    • Detailed report on progresses on tickets:
      • [ OPEN ] T0_CH_CERN - Files on archived tape at T0 - Remedy #CT663457, and Savannah #112927 Update 8/3: files recovered from other sites when possible, but other files only have a copy on the damaged tape.
      • [ CLOSED ] T0_CH_CERN - Remedy #653289 - - CERN has a high fraction of aborted JobRobot jobs with the Maradona error. Update 10/3: trying two additional options in the LSF configuration, following an advice from Platform computing. Update 17/3: additional issue identified, but changes did not solve the problem - contact with Platform continues. Update 19/3: efficiency improved. Update 23/3: efficiency back to 80% and today's values are still at 80%. Update 24/3: failures look correlated to one particular CE. Update 26/4: situation much improved in last few weeks. Update 10/5: closing ticket.
      • [ OPEN ] T0_CH_CERN - Investigation on low level of "source error during transfer preparation phase" from T0 Remedy #CT678223
      • [ OPEN ] T1_ES_PIC - ongoing investigation on network issues to T2_US_Caltech Savannah #113493.
      • [ CLOSED ] T1_IT_CNAF - verification on 2 unreadable files Savannah #114202 - update 7/5: deleted and invalidated.
      • [ OPEN ] T1_IT_CNAF - 12 files lost from StoRM before migration to tape Savannah #114130
      • [ CLOSED ] T1_IT_CNAF - 2 jobs running for 3 days at CNAF with 0 CPU time used, site contact notified Savannah #114311 - update 10/5: jobs succeeded on resubmission.
      • [ CLOSED ] T1_IT_CNAF - cleanup of failed production Savannah #114296
      • [ OPEN ] T1_UK_RAL - some files waiting for tape migration for a long time (related to low volume of files to migrate) Savannah #114278
      • [ OPEN ] T1_DE_KIT - some SRM SAM test timeouts during weekend. Savannah #114342
      • [ OPEN ] T1_TW_ASGC - 7 TB waiting for migration to tape Savannah #114353
      • [ CLOSED ] T1_FR_CCIN2P3 to T1_IT_CNAF - some transfer timeouts during weekend, Savannah #114320
    • T2 Highlights:
      • Running MC production
        • More jobs stuck in "Submitted" state in WMS.
        • T2_AT_Vienna back to pledged number of batch slots.
        • Upload of MC from T2_RU_SINP to T1_US_FNAL failing for permission errors.
      • T2_EE_Estonia - GlueServiceUniqueID tag for new SRM endpoint not propagating from BalticGrid BDII to top-level WLCG BDII GGUS:58016, preventing use of new SRM for production transfers with FTS.
      • T2_UA_KIPT - SAM/JobRobot errors.
      • T2_TW_Taiwan - SAM CE/SRM errors.
      • T2_RU_JINR - site receiving excessive amount of CMS glidein pilot jobs, under investigation Savannah #114329 and GGUS:58056

  • Andrea/CNAF: working on CNAF ticket - will report back later today.

  • ALICE reports - (Patricia)
    • GENERAL INFORMATION: Pass1 reconstruction + 1 Train analysis + 2 MC runs during the weekend. Large number of jobs running in the Grid during the weekend
    • T0 site
      • Good behavior of the 2 backends for Pass1 reconstruction purposes
      • Alarm triggered at voalicefs01, 02, 03 and 04 (high_load). Responsible of the nodes have been warned and the issue has been solved immediately
      • CAF nodes presenting problems during the last week (lxfssl3311, lxfssl3310, lxfssl3312, lxfssl3313, lxfssl3315, lxfssl3317, lxbsq1415) have been recovered during the weekend. Two have been put in maintenance for Alice operations (3311 and 3317)
    • T1 sites
      • Good behavior of these sites during the weekend
    • T2 sites
      • KFKI: Manual operations required this orning at the site to restart all services (Alien user proxy was corrupted at the vobox)
      • Kolkata: grid.tier2-kol.res.in still unreachable. GGUS:58094

  • LHCb reports - (Roberto)
    • GGUS (or RT) tickets:
      • T0: 2
      • T1: 1
      • T2: 2
    • Issues at the sites and services
      • T0 site issues:
        • Friday reported that CERN-SARA FTS transfers were stuck on Active status (GGUS:58048).
        • LFC-RO issue (apparently no because of exhausted threads this time, GGUS:58067).
      • T1 site issues:
        • pic: DIRAC services at pic are down again. MySQL database was down. (ALARM GGUS:58063 risen at 4:00 UTC in the morning)
      • T2 sites issues:
        • Shared area issue at INFN-TO and BG01-IPP

  • Roberto: 2.7 TB collected but possibly affected by problem with one subdetector. Reco jobs started only today after Dirac restart this morning.
  • Gavin: will check FTS issues.
  • Jean-Philippe: did not receive LFC ticket from Sat yet. Maarten: procedure should be streamlined.

Sites / Services round table:

  • FNAL - ntr
  • CNAF - back from short maintenance
  • KIT - ntr
  • ASGC - internal network problem last sat afternoon: WN and CE lost contact. Problem now fixed but reasons not yet clear. SAM / SRM errors reported were due to one faulty disk server has problem. Recovered now and ticket closed. CMS migration issue is being looked at currently.
  • BNL - ntr
  • NDGF - ntr
  • OSG - one (redundant) bdii will go down - expected to be transparent. Ticket from Wed for BNL (GGUS:57986).
  • RAL - ntr
  • NL-T1 - Sara is currently in scheduled downtime. Will review GGUS tickets once back.
    • Alexei: downtime schedule? Onno: till 6pm local time.
  • INP32 - ntr

  • CERN/Eva: On Sat high load on ATLAS DB cluster caused by user application, followed by node reboot due to misconf. Plan to apply security patches to several integration clusters and move cms integration to new hardware.
  • CERN/Carlos: new hw + sw for core router - should be transparent.
  • CERN/Gavin: myproxy removal has been signed off by experiments Monday. Maarten: found now additional users (sam and nagios tests) and person responsible on vacation. Gavin: will take this into account. Also linux upgrade on CERN VOboxes to SLC 5.5.

AOB:

Tuesday:

Attendance: local(Simone, Eva, Jean-Philippe, Stephane, Maarten, Jamie, MariaG, Gavin, Roberto, Patricia, Ignacio, Harry, Nicolo, Alessandro, Andrea, Steve, MariaDZ, Dirk);remote(Jon/FNAL, Michael/BNL, Paco/NL-T1, Gonzalo/PIC, Xavier/KIT, Andrea/CNAF, IN2P3/Rolf, John/RAL, Vera/NDGF, Rob/OSG) .

Experiments round table:

  • ATLAS reports - (Simone)
    • No beam until 16:00. Data taking will be resumed at 18:00.
    • MC AOD disribution within clouds is in progress (but US), datasets consolidation on Tier-1s is progressing well. CERN and many Tier-1s already have 1000+ replicas.
    • gridftp error this morning
      • gridftp error (CERN->FZK) Unable to connect to lxfsrc2906.cern.ch:20187globus_xio: System error in connect: Connection refusedglobus_xio: A system call failed: Connection refused
      • GGUS:58123
    • voatlas75 is down (ATLAS nightly build is affected)
    • preparation to May reprocessing. Sites pre-validation can be started today.
    • vo.racf.bnl.gov is ATLAS VOMS server (question from SARA folks)
      • from DeSalvo : it is an official ATLAS VOMS replica, and an official announcement has been already sent via an EGEE broadcast

  • CMS reports - (Nicolo)
    • T0 Highlights:
      • Large migration queue on CASTORCMS T0EXPORT, GGUS:58099 TEAM - mostly streamer files. Up to 150k files yesterday, decreased back to 50k this morning, now stable at 50k.
      • 2 files not migrating to tape on CASTORCMS T1TRANSFER - files are stuck in STAGEOUT state after the FTS transfer failed with a gridftp timeout, but have correct size/checksum. Remedy #683052.
    • T1 Highlights:
      • ReReco train workflows completing.
      • Skimming running
        • 35 custodial input files not accessible at CNAF Savannah #114343
      • Backfill tests running
      • ASGC - 7 TB waiting for migration to tape Savannah #114353 - migration hunter restarted, closed.
    • Detailed report on progresses on tickets:
      • [ OPEN ] T0_CH_CERN - Files on archived tape at T0 - Remedy #CT663457, and Savannah #112927 Update 8/3: files recovered from other sites when possible, but other files only have a copy on the damaged tape.
      • [ OPEN ] T0_CH_CERN - Investigation on low level of "source error during transfer preparation phase" from T0 Remedy #CT678223
      • [ OPEN ] T0_CH_CERN - Large backlog in tape migration on CASTORCMS T0EXPORT, GGUS:58099 TEAM
      • [ OPEN ] T0_CH_CERN - 2 files stuck in tape migration in STAGEOUT state on CASTORCMS T1TRANSFER - Remedy #683052
      • [ OPEN ] T1_ES_PIC - ongoing investigation on network issues to T2_US_Caltech Savannah #113493.
      • [ OPEN ] T1_IT_CNAF - 12 files lost from StoRM before migration to tape Savannah #114130
      • [ OPEN ] T1_IT_CNAF - 35 custodial input files not accessible at CNAF Savannah #114343
      • [ CLOSED ] T1_UK_RAL - some files waiting for tape migration for a long time (related to low volume of files to migrate) Savannah #114278
      • [ OPEN ] T1_DE_KIT - some SRM SAM test timeouts during weekend. Savannah #114342 - Update 11/5: still occasional instability in CE tests (Maradona) and SRM tests (timeouts)
      • [ CLOSED ] T1_TW_ASGC - 7 TB waiting for migration to tape Savannah #114353
    • T2 Highlights:
      • Running MC production
        • Upload of MC from T2_RU_SINP to T1_US_FNAL failing for permission errors.
      • T2_UA_KIPT - SAM/JobRobot errors.
      • T2_RU_JINR - site receiving excessive amount of CMS glidein pilot jobs, under investigation Savannah #114329 and GGUS:58056
      • T2_RU_RRC_KI - SAM CE errors
      • Analysis jobs failing with input dataset issues at T2_IT_Pisa and T2_UK_London_Brunel

  • ALICE reports - (Patricia)
    • T0 site
      • Yesterday night a new CREAM-CE was announced (ce203, in SL5). The idea is to migrate all available CREAM-CE at the T0 to SL5 as soon as this new service shows a good performance in production. Since this morning ALICE has included the service into the Pass1 reconsitruction partition together with the previous CREAM-CE
      • GGUS_58133 for ce201.cern.ch: authentication problems at submission time
    • T1 sites
      • RAL: cleanup operations required by the site admin in one of the ALICE VOBOXES yesterday afternoon. In addition, the site has announced the setup of a 2nd CREAM-CE that will be shortly enter in production
    • T2 site
      • Minor issues observed at some Russian T2 sites, already communicated to the reponsible: the local batch system at RRC-Ki seems to have some problems accepting new Alice requirements, while the production at ITEP was stopped due to an obsolete crendential of the ALIEN user at the site
      • Cagliary: The available CRL has expired. Authentication problems login into the local VOBOX. GGUS_58136

  • LHCb reports -
    • Experiment activities:
      • Most of the data collected from the long fill this weekend will be marked as BAD. Running w/o problems (less than 1% failure rate) some MC productions. Further more the Reco-2 activity that counts right now about 2.5K concurrent running jobs also does not present any special issue to be reported. Interesting that the integrated throughput yesterday (with real data) achieved 1GB/s (RAW data export + DST replication). Impressive considering the small size of the reconstructed (and stripped) DSTs.

    • GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0 site issues: none
      • T1 site issues:
        • SARA-NIKHEF: The SE down yesterday the reason of a backlog formed and not yet fully recovered this morning.
        • pic: Accounting system down (Application level, no much pic site admins can do)
      • T2 sites issues: none

Sites / Services round table:

  • FNAL- ntr
  • BNL - ntr
  • NL-T1 - moved dcache srm and namespace and BDII to new hardware without problems. Ticket GGUS 58119: user copied file could not gain access due to incomplete registration (despite success notification). Fixed by site in chimera and dcache dev team contacted about the issue.
  • PIC - lhcb accounting issues: now expert on site works with experiment - hope to be recovered this afternoon.
  • KIT - ntr
  • NDGF - following up on at yesterday's ticket.
  • IN2P3 - ntr
  • RAL - ntr
  • CNAF - working on CMS ticket - will updates as soon as solution has been found.
  • OSG - ntr

  • Ignacio/CERN: HTAR firewall reconfig planned for tomorrow. Large number of disk servers has filled up available space for firewall rules - some rule re-factorisation required to gain space. More planning details for the transparent change tomorrow.
  • Gavin/CERN: Linux software update for cern VOboxes which was planned for yesterday had to be rescheduled to next Monday.

AOB:

Wednesday

Attendance: local(Maarten, Ignacio, Nicolo, Nilo, Eva, Eduardo, Akshat, Gavin,JPB, Roberto, Patricia, Dirk);remote( Jon FNAL, Vera/NDGF,John/RAL, Onno/NL-T1, Rolf/IN2P3, Andrea/CNAF, Kyle/OSG, Rod, Alexei, Xavier/KIT).

Experiments round table:

  • ATLAS reports - (Alexei)
    • voatlas75 issue is solved
    • TAIWAN subscription backlog (since ~ May 9), data10*f* dataset status [staged] in DDM dashboard. See GGUS-Ticket 58178.
    • NAPOLI is at the limit of its import capability. It is currently importing 100MB/s (1GB/s)
      • several possibilities to reduce number of transfers were discussed (ADC+GP). Stop data10*.x* datasets export to NAPOLI CALIBDISK
    • DDM FT is resumed
    • AGLT2_DATADISK has problems, MPPMU has problems
    • LHC and ATLAS
      • LHC Plan
        • Finish missing tests to increase intensity to 8 1010p/beam:
        • Power off test of LBDS with beam circulating
        • Test ramp
        • BPM calibration
        • Collimation during ramp
        • New since yesterday (MPP): Will not increase bunch charge in the next two weeks for physics fills (2 10^10p/bunch), but increase number of bunches: 4 tomorrow (4b_2_2_2) and then 6 bunches (6b_3_3_3) towards the end of the weekend. Today more discussions in the LMC
      • ATLAS
        • Stop ATLAS run, go back to new HLT release, resume ATLAS run (data10_7TeV)
        • Not sure whether there are stable beams this night
      • ADC
        • MC data distribution.
        • May2010 reprocessing : final site validation can be started today

  • CMS reports - (Nicolo)
    • T0 Highlights:
      • Small amount of timeouts for large files (7-36 GB) on FTS channels (CERN-FNAL, CERN-ASGC) - considering request for size-based timeout.
    • T1 Highlights:
      • Prompt Skimming running
        • Large activity at KIT (1300 jobs) - transfer and SAM quality degraded, correlated?
        • 35 custodial input files not accessible at CNAF Savannah #114343 - corrupt, retransferring.
      • Backfill tests running
      • PIC
        • Import of MC files from T2_ES_IFCA failing with "Marking Space as Being Used failed >Already have 1 record(s) with pnfsPath..." Savannah #114413.
    • Detailed report on progresses on tickets:
      • [ OPEN ] T0_CH_CERN - Files on archived tape at T0 - Remedy #CT663457, and Savannah #112927 Update 8/3: files recovered from other sites when possible, but other files only have a copy on the damaged tape.
      • [ OPEN ] T0_CH_CERN - Investigation on low level of "source error during transfer preparation phase" from T0 Remedy #CT678223
      • [ CLOSED ] T0_CH_CERN - Large backlog in tape migration on CASTORCMS T0EXPORT, GGUS:58099 TEAM
      • [ OPEN ] T0_CH_CERN - 2 files stuck in tape migration in STAGEOUT state on CASTORCMS T1TRANSFER - Remedy #683052
      • [OPEN ] T1_ES_PIC - ongoing investigation on network issues to T2_US_Caltech Savannah #113493.
      • [ OPEN ] T1_ES_PIC - Transfer failures from T2_ES_IFCA Savannah #114413.
      • [ OPEN ] T1_IT_CNAF - 12 files lost from StoRM before migration to tape Savannah #114130
      • [ OPEN ] T1_IT_CNAF - 35 custodial input files not accessible at CNAF Savannah #114343
      • [ CLOSED ] T1_DE_KIT - some SRM SAM test timeouts during weekend. Savannah #114342 - Update 11/5: still occasional instability in CE tests (Maradona) and SRM tests (timeouts)
    • T2 Highlights:
      • Running MC production
        • Upload of MC from T2_RU_SINP to T1_US_FNAL failing for permission errors.
      • T2_RU_JINR - site receiving excessive amount of CMS glidein pilot jobs, under investigation Savannah #114329 and GGUS:58056
      • T2_RU_ITEP - SAM CE/SRM errors
      • Low performance of SW area at T2_IT_Bari

  • ALICE reports - (Patricia)
    • GENERAL INFORMATION: Pass1 and Pass2 (900GeV) reconstruction tasks together with 2 analysis trains and 2 MC cyles
    • T0 site
      • GGUS:58133 submitted yesterday concerning ce201.cern.ch waiting for news. Not a showstop for the experiment, but it would be much more secure for the experiment to have the system bac in production
    • T1 sites
      • nothing remarkable to report
    • T2 site
      • Prague-T2: Already migrated to CREAM1.6, problems today with the resource BDII associated to the system. The issue is uncorrelated with the new version of CREAM but in any case stopping the production of Alice at this site. Reported to the experts
      • Issues reported yesterday and concerning Russian T2 sites have been successfully corrected

  • LHCb reports - (Roberto)
    • Experiment activities: No data recently taken...Next fill will be tonight or tomorrow. It will be for MagDown as we have less data than with MagUp. Stripping-02 progress so far is 66.2% for MagUp and 86.4% for MagDown (prods launched on Friday).Jobs during the night we all declared stalled (DIRAC issue) Workflow for stripping-03 commissioned and ready to go. A clean up campaign of stripping-01 data will be done on Monday.
    • GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 3
    • Issues at the sites and services
      • T0 site issues:
        • CREAMCE all jobs failing the 10th (GGUS 58120)
      • T1 site issues: none
      • T2 sites issues:
        • too many jobs failing at: USC-LCG2, INFN-NAPOLI, RO-07-NIPNE

Sites / Services round table:

  • FNAL - ntr
  • NDGF - ntr
  • RAL - ntr
  • NL-T1 - yesterday issue with nikhef wms restarted, 58151 - solved, currently nikhef CE daemon down - restarted, high load under invest.
  • IN2P3 - ntr
  • CNAF - ntr
  • KIT - ntr
  • OSG - ntr

  • Eva/CERN - yesterdays DB interventions fine, planning migration of remaining integration clusters for next two weks
  • Ignacio/CERN - will keep an eye on CMS T0export migrations together with CASTOR developers
  • Eduardo/CERN: bypass ruleset has been successfully reconfigured. The change was applied at midday.

AOB:

  • Nicolo: GGUS seems to be down right now - GGUS team is aware of the problem. (MariaDZ) This is due to the DNS problem in Germany, affecting potentially all .de Domains as reported by Andreas Heiss (KIT.de) in email to wlcg-operations.

Thursday

No meeting - CERN closed

Friday

No meeting - CERN closed

-- JamieShiers - 05-May-2010

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2010-05-13 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback