Week of 120514

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Jamie, Maarten, Rajah, Lukasz, Simone, Claudio, Ignacio, Massimo, Alex, MariaDZ, Eva);remote(Michael, Lisa, Roger, Paolo, Jhen-Wei, Onno, Tiju, Rob, Dimitri).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • NTR
  • T1s/CalibrationT2s
    • RAL storage started failing at 17:00 PM on friday with an authentication error (CRL). GGUS:82148. RAL re-run the script by hand to fetch CRLs but some 80% failure rate remained. On friday evening we decided not to escalate to ALARM, since the problem had been already reported and anyway there was no dataset subscribed to RAL TAPE. On saturday morning RAL experts fixed the problem re-running the CRL script on all components (SRM, gridFTP). The issue is not completely understood.
    • On friday afternoon, a problem in transferring data from/to NIKHEF was spotted. GGUS:82143. The problem appeared to be selective and interest only NIKHEF-CANADA transfers, people started investigating the network. Network experts and NL spotted a problem in the OPN link NIKHEF/SARA -> TRIUMF, where all packets get dropped on the hop just after gw-sara-local.lhc-tier1.triumf.ca towards srm.triumf.ca. NL network experts excluded a problem in the NL segment and passed the ticket to CA. CA fixed the issue promptly (Problem was in BGP session between TRUMF and SARA). The network issue has been there since ~6:00am Wednesday (UTC time) but from the ATLAS point of view was masked by many concurrent factors (low traffic CA-NL, error flooding from SARA-MATRIX, issue in functional tests machinery).
    • NAPOLI suffered a power cut. (A calibration Tier2 - unavailable for 1 day over w/e)
    • A DPM disk server failure in ASGC reduced by 140 TB the ATLAS capacity at the TW T1 (in GROUPDISK and DATADISK space tokens). Problem needs one week for a fix. Sone iteration needed with TW people to understand the impact on existing data. [ Jhen-Wei: reduced groupdisk and scratchdisk. In DPM we configure multi-controllers for all DPM servers. Last week 1 controller had h/w failure, making throughput slow. To minimize impact to ATLAS set this diskserver R/O to keep read requests but disallow new write requests. Hence reduction in logical space for ATLAS but should have no impact on existing data - dual access paths - but maybe some performance impact. No data lost but possible slow access.
    • The ATLAS DB monitoring reported several times blocking sessions in various ATLAS databases: ADCR (instance 1 and 2), ATLR and ATONR. From the application point of view, no issue has been observed. A discussion with ATLAS DB experts is going on.

  • CMS reports -
  • LHC machine / CMS detector
    • Physics data taking
  • CERN / central services and T0
    • Tier-0 suspended operations during CMSR intervention (upgrade of offline DB)
  • Tier-1/2:
    • Still file access errors at ASGC (time out in accessing files from time to time - many jobs fail).



  • LHCb reports -
  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • Diskservers down at CERN. Also created a few problems for LHCb FTS transfers, due to the logic used for submitting the FTS requests (DIRAC issue - being looked at for improvement). [ Massimo - 2 cases; 1=disk server having problems from Monday, was being drained but completely broke on Friday; list of unavailable files in ticket. Alarm ticket on Saturday: same - ticket refers to raw data but only .dst on that machine. Rajah - explain offline. Some new logic in our DM system. People on call did not fully understand. Massimo - last night another machine crashed; trying to recover; will bring new machines. ]
  • T1
    • SARA : Problem with FTS (GGUS:82178).
    • IN2P3 : Various corrupted files - some files apparently get corrupted (checksum mismatch) after some time, even if the file is seemingly okay after being transferred into IN2P3. Old & verified ticket (GGUS:80338) there, but the problem is ongoing. IN2P3 contact (Aresh) aware of this. Corruption seen in files from different diskservers. [ Rolf - we still a lot of short-running pilot jobs here so limited # of running jobs to avoid overloading CEs. Rajah - this is a limitation of SGE? Rolf - no; too many short running pilot jobs, over 90%. ]


Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • CNAF - ntr
  • ASGC - nta
  • NDGF - ntr
  • IN2P3 - nta
  • NL-T1 - 2 things: SARA SRM in sched main in ~10' from now to activate MSS config change; FTS issue reported by LHCb and also noticed by ATLAS is prob caused by load on SRM. Main GGUS:82032 probaby caused by storage hosting namespace DB which is not optimally configured. Can try to reconfig using LVM and hopefully transparent but could take a while and decrease performance. 2nd option is to schedule a downtime and move DB to a faster machine temporarily. Would have to move again later so two downtimes. Each downtime would be an hour or so - TBD. Rajah - has this load on SRM been seen regularly in last month or so? Onno - we have had issues for some months now. Recently getting worse. Rajah - could this mean that jobs are having troubles getting TURLS?
  • RAL - two things: ATLAS issue - still under investigation; also affected LHCb and CMS. Tomorrow from 10-11 local time downtime on CASTOR LHCb instance. Upgrade tape gateway.
  • KIT - ntr
  • OSG - ntr

  • CERN storage - two interventions: Wed 16 - agreed with CMS to patch EOS CMS to fix bug. Wed 23 - intervention on DB behind N/S - should be transparent - at risk for ~1h, most likely 10-11.

  • CERN grid - myproxy service was slow last week; solved by blacklisting one user using 4 computers at 3 sites; biomed user. PRODLFC ATLAS - increase #IPs exposed by alias - now 4/5 are exposed.

  • CERN DB - Oracle security patch being applied
AOB: (MariaDZ) File ggus-tickets.xls with total numbers of GGUS tickets per experiment per week is up-to-date and attached to WLCGOperationsMeetings page.There was one real ALARM last week GGUS:82166 on Sunday 2012/05/13 by LHCb against CERN Detailed drills are attached to this page. As the 2012/05/15 WLCG MB was cancelled again, drills for the new MB date 2012/06/05 will cover 12 weeks - 14 real ALARMs so far since the last MB).

Tuesday

Attendance: local(Raja, Mike, Jamie, Alessandro, Eva, Massimo, Alex, Maarten);remote(Michael, Ronald, Lisa, Xavier, Jhen-Wei, Roger, Rolf, Tiju, Ian, Jeremy, Rob).

Experiments round table:

  • ATLAS reports -
  • ATLAS general: data taking period, stable beam as much as possible foreseen in the coming days.
  • T0/Central Services
    • NTR
  • T1s/CalibrationT2s
    • SARA-MATRIX storage instabilities GGUS:82032 (GGUS:82167 and GGUS:82169 are related to this first ticket) Onno replied that they've scheduled a downtime (23rd of May) to move the SRM database to other hardware
    • transfer Romania to TRIUMF were failing due to a possible network problem between TRIUMF and Romania: TRIUMF experts couldn't download on FTS server from Romania CA the CRL. This problem has been now patched, i.e. they found a workaround to fetch the CRL, but this needs to be sorted out. GGUS:82154
    • LCGR DB yesterday Rolling intervention took one h more than expected, but then everything was ok. Today ADCR and ATLR (same) interventions went smoothly.

  • CMS reports -
  • LHC machine / CMS detector
    • Physics data taking
  • CERN / central services and T0
    • Nothing to report
  • Tier-1/2:
    • We are asking some sites to check production priority since we're not getting as many slots as we expected.

  • ALICE reports -
    • NIKHEF VOBOX unavailable because of scheduled maintenance

  • LHCb reports -
  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • Retiring diskservers : Waiting for information from CERN before deciding what to do about the files which are currently unavailable. [ now have a list of files - only 1 file which is not recovered - AOK ]
  • T1
    • GridKa : Problems accessing data (GGUS:82200) - restarted a LFCDaemon which had died and not restarted automatically. Okay now.
    • IN2P3 :
      • Continuing problem with corrupted files - offline discussions ongoing.
      • Would like to know the DN of the "short" jobs which are coming into IN2P3 as reported yesterday and if possible a plot of the extent of the problem. We cannot see any significant problem from the LHCb side - we are doing the same thing for IN2P3, as for all other sites. [ Rolf - problem with short pilot jobs disappeared yesterday. Shortest running jobs now are ~1000 secs. For corrupted file please open another ticket as previous one closed & marked as verified. ]
      • Pilots failing at IN2P3-T2 (GGUS:82228). Apparently error-152

Sites / Services round table:

  • BNL - short incident yesterday affecting data processing caused by an NFS appliance that hung and needed a restart
  • NL-T1 - site maintenance at NIKHEF going to plan, about to restart services
  • FNAL - ntr
  • KIT - ntr
  • ASGC - DPM disk problem reported yesterday, vendor came to fix this morning and problematic server returned to normal, running smoothly. If not error observed will restore ATLAS settings. Reminder: scheduled downtime this friday to patch DPM headnode and upgrade DPM disk servers. Will start 07:00 - 13:00 UTC. [ Alessandro - in case of future problems please could we discuss a bit more about the actions, e.g. limits of space that were applied. ]
  • NDGF - ntr
  • IN2P3 - next week 22 May we will have an outage of batch which will slowdown evening of 21st; Mass storage system will bedown all day 22 May. Only disk files will be avaiable through dCache. Maybe some network problems for ~1h in morning - all else in downtime notifications
  • RAL - successfully upgraded CASTOR for LHCb this morning; tomorrow morning the same for ATLAS and CMS. Tape gateway upgrade
  • GridPP - ntr
  • OSG - ntr

  • CERN DB - yesterday during patching of LCGR a problem in config of listener - a human mistake. DB unavailabe for ~1h. Today during ATLAS patching had a problem with one ATLR nodes which died during intervention - waiting for motherboard replacement.

  • CERN dashboards - sites need to patch FTS to fix active MQ bug. Only 2 have done so far. Tier0 FTS was patched on Friday and since we have received no data from it.
AOB:

Wednesday

Attendance: local(Jamie, Rajah, Massimo, Alessandro, Eddie, Mike, Luca);remote(Michael, Stephane, Alexander, Paolo, Lisa, Roger, John, Pavel, Jhen-Wei, MariaDZ, Ian).

Experiments round table:

  • ATLAS reports -
    • NDGF-T1: GGUS:82231 (and GGUS:82238): high rate of job failing " cvmfs problem on titan nodes, investigating" [ Roger - looks more like a permission problem on LFC. Ale - will check; maybe some confusion between two tickets ]
    • RAL GGUS:82148 site had to manually update the CRL. It's the second time it happens in the past week. To be understood why.
    • TRIUMF-ROMANIA Cert. Auth. : GGUS 82154 : Confirmation of asymmetric network path affecting FTS Triumf for Romania -> Canada transfers.
    • to all the Tier0/1s running FTS servers: please check yesterday WLCG daily call AOB. Apply patch for activemq-cpp and for gridftpV2.
    • Looks like CRL renewal at CERN is done less than a day before the old one is expired. This is creating problems as CRL can be uploaded only once per day (depends on a cron at the Grid site)
    • RAL: some rawdata were not written on datatape and have to be recovered from CERN. Here is the reference for the files probably not written on TAPE https://savannah.cern.ch/bugs/?94647


  • CMS reports -
  • LHC machine / CMS detector
    • Physics data taking
  • CERN / central services and T0
    • There was a failure of the CRL publication that impacted the central web servers at CERN (amongst other things). There was an alarm ticket issued last night since it was after hours, and these pages host services used for monitoring data taking. Computing operations said it was forwarded, and Maarten commented, but the person who resolved the issue did not comment or close the ticket even today. I closed it myself after asking for an update.
  • Tier-1/2:
    • Selected larger Tier-2s were asked to increase the production share for 3 weeks.


  • LHCb reports -
  • DataReprocessing of 2012 data at T1s with new alignment
  • MC simulation at Tiers2
  • Prompt reprocessing of data

  • T0
    • Still "Input data resolution" problems by jobs at CERN. Suspect one of the faulty diskservers still does not have the files there recovered. [ Massimo - one diskserver not yet recovered; files declared not reachable. ]
  • T1
    • IN2P3 :
      • Continuing problem with corrupted files - In addition to offline discussions, GGUS ticket opened (GGUS:82247). [ Rolf - thanks for opening ticket, we will do an initial update with all history up to now, our T1 coordinator will lead a small WG on this issue ]
    • SARA :
      • Problem with srm-s continues even after the unscheduled intervention postgreDB yesterday (GGUS:82178)
      • Disk cache in front of tape very low (0.8TB free out of 30 TB). Is there anything to be done about this? When does Garbage Collection start? Also see : https://sls.cern.ch/sls/service.php?id=SARA_LHCb-Tape [ Alexander - please open a GGUS ticket about cache issue ]

Sites / Services round table:

  • BNL - ntr
  • NL-T1 - ntr
  • CNAF - ntr
  • FNAL - ntr
  • NDGF - downtime on Friday for reboot of cache pools. Affect some available of data
  • RAL - had a data loss today due to disk server. Tried to migrate files but 34 files with bad checksums. CRL issue: we have a cron that runs between midnight and 08:00. Jobs ran fine but all diskservers had expired CRL - probably as ATLAS report. Today's downtimes all completed successfully
  • KIT - ntr
  • ASGC - ntr
  • OSG - GGUS changes to soap in SAV:127763. Please keep us updated! Maria - announced last week; not yet scheduled for release. Opened request for Service-Now. Only after all systems have done enough testing that we will schedule release so at least one month from now!

  • CERN - ntr

AOB:

Thursday

No meeting - CERN closed

AOB:

Friday

Attendance: local (Andrea, Massimo, Stephane, Eva, Mike, Raja, Ulrich); remote (Michael/BNL, Alexander/NLT1, John/RAL, Mete/NDGF, Jhen-Wei/ASGC, Lisa/FNAL, Jeremy/GridPP, Rob/OSG, Paolo/CNAF; Ian/CMS).

Experiments round table:

  • ATLAS reports -
    • Backlog of T0 export to SARA reached a day -> SARA is excluded from T0 data export untill the storage issue is solved. ATLAS will make sure that current backlog will be exhausted today. NIKHEF not affected
    • Lost files at RAL (did not reach TAPE) : Savannah:94647 : RAW recovered as expected and 18 MC files completly lost.
    • RAL (GGUS:82317) : Problem to stage simulation files from TAPE [John: this is related to the previous issue, GGUS is being updated to give the list of lost files]
    • FZK (GGUS:82306) : 2 FTS channels stopped for more than 24 hours partially blcking the simulation activity. Explanation : 'Probably problem with logrotate'
    • Many sites automatically blacklisted by HammerCloud last night. Related to upgrade of Panda pilot conflicting with installed 'AtlasSetup'. Pilot update was rolled back this morning and site were automatically unblacklisted after few hours. Problem solved quickly enough to avoid impact on production activity.

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • Experimenting with running backfill simulation during CERN running. Seems to work and doesn't negatively impact Tier-0 running
    • Tier-1/2:
      • Selected larger Tier-2s were asked to increase the production share for 3 weeks.
    • Other:
      • Nicolo Magini will take over CRC on Monday. Much of CMS Computing will be at CHEP next week.

  • LHCb reports -
    • Data reprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data
    • Disabled filling mode for pilots for next 10 days.
    • T0
      • Still "Input data resolution" problems by jobs at CERN. Waiting for the files from the faulty diskserver to be recovered.
    • T1
      • RAL : Problem with batch system / aborted pilots (GGUS:82304)
      • SARA :
        • Continuing huge problems with access to data (GGUS:82178). No new GGUS ticket opened.
        • Disk cache in front of tape very low (GGUS:82267). Solved.

Sites / Services round table:

  • KIT - closed today for "pont"
  • Michael/BNL: ntr
  • Alexander/NLT1: ntr
  • John/!RAL: several problems on Wednesday related to the new CRL published by CERN, these should be given in advance, followup is needed on this. CERN should publish the new CRL at least a couple of days before the old ones expire, because the sites must copy the new lists. Note that some CRLs will expire this Saturday, so something needs to be urgently done. [Stephane: this is related to GGUS:82148 for ATLAS, but it is a RAL ticket. Will open a ticket against CERN. Ulrich: thanks, will then dispatch it to the CA experts. Andrea after the meeting: opened INC:130120. ]
    • [Raja: status with batch? John: fixed a few issues, but others are still being followed up.]
  • Mette/NDGF: ntr
  • Jhen-Wei/ASGC: dpm maintenance finished two hours ago
  • Lisa/FNAL: issues with some jobs that seem to fail in the Condor queue (there are two savannah tickets about this), still being followed up
  • Jeremy/GridPP: ntr
  • Rob/OSG: ntr
  • Paolo/CNAF: ntr

  • CERN FTS - We added support for the na62.vo.gridpp.ac.uk VO today to all FTS instances at CERN.
  • Eva/Databases: ntr
  • Mike/Dashboard: ntr
  • Ulrich/Grid: ntr
  • Massimo/Storage: intervention on DB behind CASTOR on Tuesday, will affect all instances, will be marked as at risk in GOCDB
    • [Raja: what is the status of LHCb diskserver? Massimo: vendor will only come on Tuesday, sorry if it is not before]

AOB: none

-- JamieShiers - 11-Apr-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2564.5 K 2012-05-09 - 17:59 MariaDimou GGUS slides for the MB. Drills complete, totals not yet available
Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2012-05-18 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback